I looked at the difference between an autoregressive vs non-autoregressive in transformer architecture. but I am wondering whether the attention layer in TensorFlow is actually autoregressive? or do I need to implement the autoregressive mechanism?
I don't see any option for causal (e.g. causal=true/false)
I do not see documentation that states if "tfa.layers.MultiHeadAttention" is autoregressive or not
Any thoughts on that would be appreciated.
I found the solution:
I found that TensorFlow has a single head attention layer with a causal option (it has a boolean option to be either True or False) which was the best option for my case. The link for the layer code is below:
This layer adds a mask such that position i cannot attend to positions j > i. This prevents the flow of information from the future towards the past.
Can be written as shown below:
tf.keras.layers.Attention(causal=True,dropout = 0.5)