[SOLVED] How to Normalize Hand Landmark Positions in Video Frames Using MediaPipe?

How to Normalize Hand Landmark Positions in Video Frames Using MediaPipe?

I am working on a project where I need to track and analyze hand movements in multiple videos using MediaPipe in all frames. The challenge I'm facing is that the distance of the subject from the camera varies, causing the size of the detected hands to change from frame to frame and because of movement I can't use the position of these points. I want to standardize the size of the hand landmarks across different frames to compare movements more accurately.

How can I normalize the positions and/or sizes of hand landmarks detected in a video, considering changes in orientation and distance from the camera? I'm looking for a method to adjust for scale, rotation, and translation of the hand landmarks.

Solution

On the one hand, you can normalise your coordinates to account for scale and translation very easily:

Identify the fingers whose x coordinates are the most distanced from each other.
These x coordinates x_min and x_max will be your 0 and 1 normalised values.
Then, for each other finger, thake their x coordinate x_n and apply the regular normalisation formula: (x_n - x_min)/(x_max - x_min)
Repeat the process for the y-axis.

Now, all coordinates are normalised. However, you should note that rotating your hand will affect the normalised coordinates. This is because rotation is a bit tricky to work around mainly because:

If the plane of rotation is perpendicular to the camera (e.g., you place the palm of your hand facing the camera, and take the axis of rotation to be one that crosses your palm and the camera) the procedure explained above to normalise coordinates would not work, since even though you are making the same gesture, the relative positions of the fingers are changing. Nonetheless, there is a workaround to this problem, if you assume that two constant coordinates will always be at the bottom and bottom, for example, the wrist (0) and middle finger top (12) (view image below). Should this be the case, you will have to get into trigonometry, in order to change the x- and y- axes proposed by Mediapipe into the new axes defined by the line going through landmarks (0) and (12), and its perpendicular.
If the lane of rotation is perpendicular to the camera (e.g., you place the palm of your hand facing the camera, and take the axis of rotation to be one that crosses your wrist and middle finger) normalising the coordinates will be very difficult, mainly because mediapipe has quite a hard time at predicting depth coordinates. Nonetheless, if it is trivial for you to cover this type of rotation, I recommend you take a look at the following links:
- https://developers.google.com/mediapipe/solutions/vision/hand_landmarker/web_js (Scroll down to "Handle and display results" > "landmarks")
- https://github.com/google/mediapipe/issues/742

It is quite a tricky task, so the complexity you want to add to the normalisation depends solely on your use case. Good luck! And may the code be with you...