tensorflowneural-networkcomputer-visionconv-neural-networkconvergence

Hand Landmark Coordinate Neural Network Not Converging


I'm currently trying to train a custom model with tensorflow to detect 17 landmarks/keypoints on each of 2 hands shown in an image (fingertips, first knuckles, bottom knuckles, wrist, and palm), for 34 points (and therefore 68 total values to predict for x & y). However, I cannot get the model to converge, with the output instead being an array of points that are pretty much the same for every prediction.

I started off with a dataset that has images like this: enter image description here

each annotated to have the red dots correlate to each keypoint. To expand the dataset to try to get a more robust model, I took photos of the hands with various backgrounds, angles, positions, poses, lighting conditions, reflectivity, etc, as exemplified by these further images: enter image description hereenter image description here enter image description here enter image description here enter image description here enter image description here

I have about 3000 images created now, with the landmarks stored inside a csv as such:

enter image description here

I have a train-test split of .67 train .33 test, with the images randomly selected to each. I load the images with all 3 color channels, and scale the both the color values & keypoint coordinates between 0 & 1.

I've tried a couple different approaches, each involving a CNN. The first keeps the images as they are, and uses a neural network model built as such:

model = Sequential()

model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu', input_shape = (225,400,3)))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))

filters_convs = [(128, 2), (256, 3), (512, 3), (512,3)]
  
for n_filters, n_convs in filters_convs:
  for _ in np.arange(n_convs):
    model.add(Conv2D(filters = n_filters, kernel_size = (3,3), padding = 'same', activation = 'relu'))
  model.add(MaxPooling2D(pool_size = (2,2), strides = 2))

model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dense(96, activation="relu"))
model.add(Dense(72, activation="relu"))
model.add(Dense(68, activation="sigmoid"))

opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())

I've modified the various hyperparameters, yet nothing seems to make any noticeable difference.

The other thing I've tried is resizing the images to fit within a 224x224x3 array to use with a VGG-16 network, as such:

vgg = VGG16(weights="imagenet", include_top=False,
    input_tensor=Input(shape=(224, 224, 3)))
vgg.trainable = False

flatten = vgg.output
flatten = Flatten()(flatten)

points = Dense(256, activation="relu")(flatten)
points = Dense(128, activation="relu")(points)
points = Dense(96, activation="relu")(points)
points = Dense(68, activation="sigmoid")(points)

model = Model(inputs=vgg.input, outputs=points)

opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
print(model.summary())

This model has similar results to the first. No matter what I seem to do, I seem to get the same results, in that my mse loss minimizes around .009, with an mae around .07, no matter how many epochs I run: enter image description here

Furthermore, when I run predictions based off the model it seems that the predicted output is basically the same for every image, with only slight variation between each. It seems the model predicts an array of coordinates that looks somewhat like what a splayed hand might, in the general areas hands might be most likely to be found. A catch-all solution to minimize deviation as opposed to a custom solution for each image. These images illustrate this, with the green being predicted points, and the red being the actual points for the left hand: enter image description here enter image description here enter image description here enter image description here

So, I was wondering what might be causing this, be it the model, the data, or both, because nothing I've tried with either modifying the model or augmenting the data seems to have done any good. I've even tried reducing the complexity to predict for one hand only, to predict a bounding box for each hand, and to predict a single keypoint, but no matter what I try, the results are pretty inaccurate.

Thus, any suggestions for what I could do to help the model converge to create more accurate & custom predictions for each image of hands it sees would be very greatly appreciated.

Thanks,

Sam


Solution

  • Usually, neural networks will have a very hard time to predict exact coordinates of landmarks. A better approach is probably a fully convolutional network. This would work as follows:

    1. You omit the dense layers at the end and thus end up with an output of (m, n, n_filters) with m and n being the dimensions of your downsampled feature maps (since you use maxpooling at some earlier stage in the network they will be lower resolution than your input image).
    2. You set n_filters for the last (output-)layer to the number of different landmarks you want to detect plus one more to indicate no landmark.
    3. You remove some of the max pooling such that your final output has a fairly high resolution (so the earlier referenced m and n are bigger). Now your output has shape mxnx(n_landmarks+1) and each of the nxm (n_landmark+1)-dimensional vectors indicate which landmark is present as the position in the image that corresponds to the position in the mxn grid. So the activation for your last output convolutional layer needs to be a softmax to represent probabilities.
    4. Now you can train your network to predict the landmarks locally without having to use dense layers.

    This is a very simple architecture and for optimal results a more sophisticated architecture might be needed, but I think this should give you a first idea of a better approach than using the dense layers for the prediction.

    And for the explanation why your network does predict the same values every time: This is probably, because your network is just not able to learn what you want it to learn because it is not suited to do so. If this is the case, the network will just learn to predict a value, that is fairly good for most of the images (so basically the "average" position of each landmark for all of your images).