I am working on a project where I need to track and analyze hand movements in multiple videos using MediaPipe in all frames. The challenge I'm facing is that the distance of the subject from the camera varies, causing the size of the detected hands to change from frame to frame and because of movement I can't use the position of these points. I want to standardize the size of the hand landmarks across different frames to compare movements more accurately.
How can I normalize the positions and/or sizes of hand landmarks detected in a video, considering changes in orientation and distance from the camera? I'm looking for a method to adjust for scale, rotation, and translation of the hand landmarks.
On the one hand, you can normalise your coordinates to account for scale and translation very easily:
Now, all coordinates are normalised. However, you should note that rotating your hand will affect the normalised coordinates. This is because rotation is a bit tricky to work around mainly because:
If the plane of rotation is perpendicular to the camera (e.g., you place the palm of your hand facing the camera, and take the axis of rotation to be one that crosses your palm and the camera) the procedure explained above to normalise coordinates would not work, since even though you are making the same gesture, the relative positions of the fingers are changing. Nonetheless, there is a workaround to this problem, if you assume that two constant coordinates will always be at the bottom and bottom, for example, the wrist (0) and middle finger top (12) (view image below). Should this be the case, you will have to get into trigonometry, in order to change the x- and y- axes proposed by Mediapipe into the new axes defined by the line going through landmarks (0) and (12), and its perpendicular.
If the lane of rotation is perpendicular to the camera (e.g., you place the palm of your hand facing the camera, and take the axis of rotation to be one that crosses your wrist and middle finger) normalising the coordinates will be very difficult, mainly because mediapipe has quite a hard time at predicting depth coordinates. Nonetheless, if it is trivial for you to cover this type of rotation, I recommend you take a look at the following links:
It is quite a tricky task, so the complexity you want to add to the normalisation depends solely on your use case. Good luck! And may the code be with you...