tensorflowkerasdeep-learninglstmautoencoder

What is timestep in LSTM layers from Keras? and How choose the value for this parameter?


I have problems to understand the parameter "timestep" in LSTM layer. I have found some meanings, but I am very confuse now. Some mention that it is the amount of data per batch-size that enters the model during training. Others, on the other hand, say that it is the number of occurrences of a cell within the LSTM layer, while the states are being passed from one cell to another.

The point is that I have the following form of the training data:

(sequences, number of frames per sequence, width, height, channel = 1)
(2000, 5, 80, 80, 1)

My model must predict the following sequence of frames, in this case 5 future frames. The model consists of a variational autoencoder, first I use 3D convolutional layers to compress the sequences of 5 frames, then I resize the size of the outputs so that I can enter the LSTM layer, who only accepts (batch, timestep, features).

Model: "sequential"
____________________________________________________________________________________________________
Layer (type)                                 Output Shape                            Param #        
====================================================================================================
conv3d (Conv3D)                              (None, 2, 27, 27, 32)                   19392          
____________________________________________________________________________________________________
batch_normalization (BatchNormalization)     (None, 2, 27, 27, 32)                   128            
____________________________________________________________________________________________________
conv3d_1 (Conv3D)                            (None, 1, 14, 14, 32)                   2654240        
____________________________________________________________________________________________________
batch_normalization_1 (BatchNormalization)   (None, 1, 14, 14, 32)                   128            
____________________________________________________________________________________________________
conv3d_2 (Conv3D)                            (None, 1, 7, 7, 64)                     3211328        
____________________________________________________________________________________________________
batch_normalization_2 (BatchNormalization)   (None, 1, 7, 7, 64)                     256            
____________________________________________________________________________________________________
flatten (Flatten)                            (None, 3136)                            0              
____________________________________________________________________________________________________
reshape (Reshape)                            (None, 4, 784)                          0              
____________________________________________________________________________________________________


lstm (LSTM)                                  (None, 64)                              217344         
____________________________________________________________________________________________________
repeat_vector (RepeatVector)                 (None, 4, 64)                           0              
____________________________________________________________________________________________________
lstm_1 (LSTM)                                (None, 4, 64)                           33024          
____________________________________________________________________________________________________
time_distributed (TimeDistributed)           (None, 4, 784)                          50960          
____________________________________________________________________________________________________
reshape_1 (Reshape)                          (None, 1, 7, 7, 64)                     0              
____________________________________________________________________________________________________


conv3d_transpose (Conv3DTranspose)           (None, 2, 14, 14, 64)                   6422592        
____________________________________________________________________________________________________
batch_normalization_3 (BatchNormalization)   (None, 2, 14, 14, 64)                   256            
____________________________________________________________________________________________________
conv3d_transpose_1 (Conv3DTranspose)         (None, 4, 28, 28, 32)                   5308448        
____________________________________________________________________________________________________
batch_normalization_4 (BatchNormalization)   (None, 4, 28, 28, 32)                   128            
____________________________________________________________________________________________________
conv3d_transpose_2 (Conv3DTranspose)         (None, 8, 84, 84, 1)                    19361          
____________________________________________________________________________________________________
batch_normalization_5 (BatchNormalization)   (None, 8, 84, 84, 1)                    4              
____________________________________________________________________________________________________
cropping3d (Cropping3D)                      (None, 8, 80, 80, 1)                    0              
____________________________________________________________________________________________________
cropping3d_1 (Cropping3D)                    (None, 5, 80, 80, 1)                    0              
====================================================================================================

I have finally used the RESHAPE layer to get into the LSTM layer, with shape (batch, 4, 784). In other words, I have called timestpe = 4. I think it should be 5, or not necessarily should be equal to the number of frames I want to predict.

What is the true meaning of timestep in this case? Do I need to order the values ​​of my layers? I really appreciate your support.

On the other hand, I am thinking of applying convolutional layers to each frame, and no longer to the entire 5-frame sequence, but frame by frame and then connect the outputs of the convolutional layers to LSTM layers, finally connect the output states of the LSTM layers of each frame, respecting the order of the frames, in this case I consider using timestpe = 1.


Solution

  • I have called timestpe = 4. I think it should be 5, or not necessarily should be equal to the number of frames I want to predict.

    You are right. The timestep is not equal to the number of frames you want to predict.

    Let us frame it in a natural-language friendly description.

    The timestep in essence is the number of units (seconds/minutes/hours/days/frames in a video etc.) which is used to predict the future step(s).

    For example, you want to predict the stock price taking into account the last 5 days. In this case, the timestep = 5, where T-5 = current_day - 5, T-4 = current_day - 4 etc. Notice that the current_day would be here something like the 'future day', like 'predicting in advance' for today.

    You want to predict maybe the stock price in the current day. In this case, you would to one-step-forecast. However, you may also want to predict the stock price in the current day, tomorrow, and the day after tomorrow. That is, predict T,T+1,T+2 by taking into account T-5,T-4,T-3,T-2,T-1.The acknowledged nomenclature for the second case is called multi-step-forecast.

    Notice how the timestep which is strictly related to "past" is not related as computation for the multi-step-forecast.

    Evidently, according to your problem, it is almost always the case that for multi-step-forecast you may need to take into consideration a bigger "past" frame, i.e. increase the number of timesteps, in order to help your LSTM capture more data correlation.

    If you were to relate it to the amount of data per batch, you can consider a batch-size of 2 equal to 2 chunks of data in which [T-5,T-4,T-3,T-2,T-1] are taken to predict T. Therefore, 2 chunks of the form == 2 * ([T-5,T-4,T-3,T-2,T-1],[T]).

    When you prepare the data and you want to predict the next frame, of course you need exact perfect order for your past values (T-5,T-4...) inside a chunk. What you do not need is to have the exact consecutive chunks from a video, say.

    In other words, you can have a chunk like the one described above from video 1, a chunk from video 9, etc.