I would like to know how predicted hand tracking works in ARKit. Does it use the Kalman filter, or does it have another approach, like machine learning?
I searched a lot but found no paper or website explaining how this prediction works.
Two antipodal aspects of a skeletal hand tracking are accuracy vs latency. When choosing any filtering/AI/solver algorithm, you sacrifice a low latency for a high accuracy, or vice versa. Kalman filter/predictor is optimal for both parameters. In this Disney research, the team used 22 individual Kalman Filters for each joint, which were working simultaneously. It's a shame that Apple very rarely publishes scientific papers about what principles its products or APIs are based on, however, I'm 99% sure it's the Kalman filter/predictor that's used in ARKit/RealityKit body tracking and hand tracking. Why reinvent the wheel?
I would like to add that when using the .predicted
tracking mode, nothing can prevent Cupertino engineers from using a symbiosis of solutions: Kalman predictor
(that uses past data to predict a joint's motion) with predictive AI models
(when some joints occluded) with inverse kinematics
based on specific solver.