[SOLVED] Is tensorflow multi-head attention layer autoregressive? e.g. "tfa.layers.MultiHeadAttention"

Is tensorflow multi-head attention layer autoregressive? e.g. "tfa.layers.MultiHeadAttention"

I looked at the difference between an autoregressive vs non-autoregressive in transformer architecture. but I am wondering whether the attention layer in TensorFlow is actually autoregressive? or do I need to implement the autoregressive mechanism?

I don't see any option for causal (e.g. causal=true/false)

I do not see documentation that states if "tfa.layers.MultiHeadAttention" is autoregressive or not

Any thoughts on that would be appreciated.

Solution

I found the solution:

I found that TensorFlow has a single head attention layer with a causal option (it has a boolean option to be either True or False) which was the best option for my case. The link for the layer code is below:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/dense_attention.py

This layer adds a mask such that position i cannot attend to positions j > i. This prevents the flow of information from the future towards the past.

Can be written as shown below:

tf.keras.layers.Attention(causal=True,dropout = 0.5)