I am currently programming with the Microsoft Kinect for Windows SDK 2 on Windows 8.1. Things are going well, and in a home dev environment obviously there is not much noise in the background compared to the 'real world'.
I would like to seek some advice from those with experience in 'real world' applications with the Kinect. How does Kinect (especially v2) fare in a live environment with passers-by, onlookers and unexpected objects in the background? I do expect, in the space from the Kinect sensor to the user there will usually not be interference however - what I am very mindful of right now is the background noise as such.
While I am aware that the Kinect does not track well under direct sunlight (either on the sensor or the user) - are there certain lighting conditions or other external factors I need to factor into the code?
The answer I am looking for is:
I created an application for home use like you have before, and then presented that same application in a public setting. The result was embarrassing for me, because there were many errors that I would never have anticipated within a controlled environment. However that did help me because it led me to add some interesting adjustments to my code, which is centered around human detection only.
Have conditions for checking the validity of a "human".
When I showed my application in the middle of a presentation floor with many other objects and props, I found that even chairs could be mistaken for people for brief moments, which led to my application switching between the user and an inanimate object, causing it to lose track of the user and lost their progress. To counter this or other false-positive human detections, I added my own additional checks for a human. My most successful method was comparing the proportions of a humans body. I implemented this measured in head units. (head units picture) Below is code of how I did this (SDK version 1.8, C#)
bool PersonDetected = false;
double[] humanRatios = { 1.0f, 4.0, 2.33, 3.0 };
/*Array indexes
* 0 - Head (shoulder to head)
* 1 - Leg length (foot to knee to hip)
* 2 - Width (shoulder to shoulder center to shoulder)
* 3 - Torso (hips to shoulder)
*/
....
double[] currentRatios = new double[4];
double headSize = Distance(skeletons[0].Joints[JointType.ShoulderCenter], skeletons[0].Joints[JointType.Head]);
currentRatios[0] = 1.0f;
currentRatios[1] = (Distance(skeletons[0].Joints[JointType.FootLeft], skeletons[0].Joints[JointType.KneeLeft]) + Distance(skeletons[0].Joints[JointType.KneeLeft], skeletons[0].Joints[JointType.HipLeft])) / headSize;
currentRatios[2] = (Distance(skeletons[0].Joints[JointType.ShoulderLeft], skeletons[0].Joints[JointType.ShoulderCenter]) + Distance(skeletons[0].Joints[JointType.ShoulderCenter], skeletons[0].Joints[JointType.ShoulderRight])) / headSize;
currentRatios[3] = Distance(skeletons[0].Joints[JointType.HipCenter], skeletons[0].Joints[JointType.ShoulderCenter]) / headSize;
int correctProportions = 0;
for (int i = 1; i < currentRatios.Length; i++)
{
diff = currentRatios[i] - humanRatios[i];
if (abs(diff) <= MaximumDiff)//I used .2 for my MaximumDiff
correctProportions++;
}
if (correctProportions >= 2)
PersonDetected = true;
Another method I had success with was finding the average of the sum of the joints distance squared from one another. I found that non-human detections had more variable summed distances, whereas humans are more consistent. The average I learned using a single dimensional support vector machine (I found user's summed distances were generally less than 9)
//in AllFramesReady or SkeletalFrameReady
Skeleton data;
...
float lastPosX = 0; // trying to detect false-positives
float lastPosY = 0;
float lastPosZ = 0;
float diff = 0;
foreach (Joint joint in data.Joints)
{
//add the distance squared
diff += (joint.Position.X - lastPosX) * (joint.Position.X - lastPosX);
diff += (joint.Position.Y - lastPosY) * (joint.Position.Y - lastPosY);
diff += (joint.Position.Z - lastPosZ) * (joint.Position.Z - lastPosZ);
lastPosX = joint.Position.X;
lastPosY = joint.Position.Y;
lastPosZ = joint.Position.Z;
}
if (diff < 9)//this is what my svm learned
PersonDetected = true;
Use player IDs and indexes to remember who is who
This ties in with the previous issue, where if Kinect switched the two users that it was tracking to others, then my application would crash because of the sudden changes in data. To counter this, I would keep track of both each player's skeletal index and their player ID. To learn more about how I did this, see Kinect user Detection.
Add adjustable parameters to adopt to varying situations
Where I was presenting, the same tilt angle and other basic kinect parameters (like near-mode) did not work in the new environment. Let the user be able to adjust some of these parameters so they can get the best setup for the job.
Expect people to do stupid things
The next time I presented, I had adjustable tilt, and you can guess whether someone burned out the Kinect's motor. Anything that can be broken on Kinect, someone will break. Leaving a warning in your documentation will not be sufficient. You should add in cautionary checks on Kinect's hardware to make sure people don't get upset when they break something inadvertently. Here is some code checking whether the user has used the motor more than 20 times in two minutes.
int motorAdjustments = 0;
DateTime firstAdjustment;
...
//in motor adjustment code
if (motorAdjustments == 0)
firstAdjustment = DateTime.Now;
++motorAdjustments;
if (motorAdjustments < 20)
{
//adjust the tilt
}
else
{
DateTime timeCheck = firstAdjustment;
if (DateTime.Now > timeCheck.AddMinutes(2))
{
//reset all variables
motorAdjustments = 1;
firstAdjustment = DateTime.Now;
//adjust the tilt
}
}
I would note that all of these were issues for me with the first version of Kinect, and I don't know how many of them have been solved in the second version as I sadly haven't gotten my hands on one yet. However I would still implement some of these techniques if not back-up techniques because there will be exceptions, especially in computer vision.