I'm learning about the Viola-James detection framework and I read that it uses a 24x24 base detection window[1][2]. I'm having problems understanding this base detection window. Let's say I have an image of size 1280x960 pixels and 3 people in it. When I try to perform face detection on this image, will the algorithm:
Any help is appreciated, even a link to another explanation.
Source: https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf
[1] - page 2, last paragraph before Integral images
[2] - page 4, Results
Does this video help? It is 40 minutes long.
Adam Harvey Explains Viola-Jones Face Detection
Also called Haar Cascades, the algorithm is very popular for face-detection.
About half way down that page is another video which shows a super slow-mo scan in progress so you can see how the window starts small (although much larger than 24x24 for the purpose of demonstration) and shifts around the image pixel by pixel, then does it again and again on successively larger square portions. At each stage, it's still only looking at those windows as though they were resampled to the 24x24 size.
You can also see how it quickly rejects many of those windows and spends most of its time in areas that seem face-like while it computes more and more complex comparisons that become more stringent. This is where the term "cascade" comes into play.