tensorflowkerastransfer-learningpre-trained-modelactivity-recognition

Transfer learning for video classification


How can I use pre-trained models to train video classification model? My dataset shape is (4000,10,150,150,1), I try to classify human action recognition with Conv2D TimeDistributed. I can train without transfer learning but I get a poor accuracy. What I have tried:

from keras.applications import VGG16
conv_base = VGG16(weights='imagenet',
                  include_top=False,
                  input_shape=(150, 150, 3))

model = models.Sequential()
model.add(conv_base)
model.add(TimeDistributed(Conv2D(96, (3, 3), padding='same',
                        input_shape=x_train.shape[1:])))
model.add(TimeDistributed(Activation('relu')))
model.add(TimeDistributed(Conv2D(128, (3, 3))))
model.add(TimeDistributed(Activation('relu')))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed(Dropout(0.35)))
.
.
.
.

But I got ValueError: strides should be of length 1, 1 or 3 but was 2
Someone has any idea?


Solution

  • I'm assuming you have 10 frames for each video. It's a simple model which uses VGG16 features (GloabAveragePooling) for each frame, and LSTM to classify the frame sequences.

    You can experiment by adding a few more layers, changing hyperparameters.

    N.B: There are many inconsistencies in your model including passing 5-d data to VGG16 directly which expects 4-d data.

    from tensorflow.keras.layers import *
    from tensorflow.keras.models import Model, Sequential
    from tensorflow.keras.optimizers import Adam
    import tensorflow as tf
    import numpy as np
    
    from tensorflow.keras.applications import VGG16
    conv_base = VGG16(weights='imagenet',
                      include_top=False,
                      input_shape=(150, 150, 3))
    
    IMG_SIZE=(150,150,3)
    num_class = 3
    
    def create_base():
      conv_base = VGG16(weights='imagenet',
                      include_top=False,
                      input_shape=(150, 150, 3))
      x = GlobalAveragePooling2D()(conv_base.output)
      base_model = Model(conv_base.input, x)
      return base_model
    
    conv_base = create_base()
    
    ip = Input(shape=(10,150,150,3))
    t_conv = TimeDistributed(conv_base)(ip) # vgg16 feature extractor
    
    t_lstm = LSTM(10, return_sequences=False)(t_conv)
    
    f_softmax = Dense(num_class, activation='softmax')(t_lstm)
    
    model = Model(ip, f_softmax)
    
    model.summary()
    
    Model: "model_5"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_32 (InputLayer)        [(None, 10, 150, 150, 3)] 0         
    _________________________________________________________________
    time_distributed_4 (TimeDist (None, 10, 512)           14714688  
    _________________________________________________________________
    lstm_1 (LSTM)                (None, 10)                20920     
    _________________________________________________________________
    dense (Dense)                (None, 3)                 33        
    =================================================================
    Total params: 14,735,641
    Trainable params: 14,735,641
    Non-trainable params: 0
    ________________________