A similar question has been asked here. However I could not understand it clearly.
I understand that SIFT computation has the following steps:
My question is for the fourth step: How to set the region over which the SIFT descriptor is computed? Also how is the shape of the region for SIFT computation determined?
Suppose the scale space extrema was found at scale "s" in the second octave. I use the gradient orientation to align to a canonical orientation. How do I set the region of computation of the SIFT descriptor using these information? Do I use the scale or the magnitude of the gradient to find the region on which SIFT is to be computed? Also how is the shape of the region determined?
So this was surprisingly tricky to find an answer for. David Lowe's original paper only seemed to provide vague theoretical explanation on how his algorithm worked. And as far as I know, his official implementation never had its feature descriptor code open-sourced. So I'm basing my answer off what I consider the next-most canonical implementation of the SIFT algorithm, being Rob Hess' OpenSIFT implementation; which became the base for OpenCV's official implementation.
Anyway, here is my understanding of how SIFT roughly works:
Once you have located your extrema, you should know which octave & interval of the Gaussian Pyramid the extrema belongs to. Based on Rob's code (these two functions on lines 1026-1112), the feature descriptor is calculated from the blurred image of that octave & interval. And the region for calculating SIFT is a square shape surrounding the keypoint. This medium article also seems to agree (see illustration).
The SIFT formula for the Gaussian Kernel scale, relative to the original image size is (reference):
base_scale * 2^(octave + interval / intervals_per_octave)
Or this formula if working relative to the halved image in each octave:
base_scale * 2^(interval / intervals_per_octave)
Where the original paper defined the parameters through experiments as:
base_scale = 1.6
and intervals_per_octave = 3
So if your SIFT was set to have 3 intervals per octave
, with a base Gaussian scale of 1.6
, and the extrema was found on octave 2
, interval 3
;
the image will have been blurred by a Gaussian Kernel of scale : 1.6 * 2^(2 + 3/3) = 12.80 pixels
Now the actual array size of the Gaussian kernel will depend on the code you use, as the scale and the kernel size can be set independently.
In cases like MATLAB, I've found a helpful guidelines from this SO thread.
The selected answer recommends kernel width of 6 times the scale (i.e. 3 sigma rule), our kernel width (and height) is 12.80 * 6 ≈ 77 pixels
;
thus, a SIFT descriptor region of size 77x77
pixels.
Meanwhile, the OpenCV implementation appears to leave the size of the kernel to be determined by OpenCV's own built-in Gaussian Blur function.
Line 246 from OpenCV's code leaves the Gaussian Blur function parameter ksize
as zeroes,
which the official docs only states the kernel size will be "computed from sigma", and never defines how it is actually calculated...
Finally, for Rob's implementation, I have to admit that I couldn't quite understand what was happening in this final step. ¯\_(ツ)_/¯
From lines 1026-1112 Rob defined the code below, which shows show how he calculates the orientation histogram for the SIFT descriptor.
The code shows he defined a radius
and used the nested for-loops with i
and j
to iterate through the square region around the keypoint, located at point (r,c)
.
Yet what I don't really understand is:
radius
, with the Gaussian scale scl
multiplied with some unknown constant SIFT_DESCR_SCL_FCTR = 3.0
hist_width * sqrt(2) * ( d + 1.0 ) * 0.5 + 0.5
, where d = SIFT_DESCR_WIDTH = 4
hist_width = SIFT_DESCR_SCL_FCTR * scl;
radius = hist_width * sqrt(2) * ( d + 1.0 ) * 0.5 + 0.5;
for( i = -radius; i <= radius; i++ )
for( j = -radius; j <= radius; j++ )
{
/*
Calculate sample's histogram array coords rotated relative to ori.
Subtract 0.5 so samples that fall e.g. in the center of row 1 (i.e.
r_rot = 1.5) have full weight placed in row 1 after interpolation.
*/
c_rot = ( j * cos_t - i * sin_t ) / hist_width;
r_rot = ( j * sin_t + i * cos_t ) / hist_width;
rbin = r_rot + d / 2 - 0.5;
cbin = c_rot + d / 2 - 0.5;
if( rbin > -1.0 && rbin < d && cbin > -1.0 && cbin < d )
if( calc_grad_mag_ori( img, r + i, c + j, &grad_mag, &grad_ori ))
{
grad_ori -= ori;
while( grad_ori < 0.0 )
grad_ori += PI2;
while( grad_ori >= PI2 )
grad_ori -= PI2;
obin = grad_ori * bins_per_rad;
w = exp( -(c_rot * c_rot + r_rot * r_rot) / exp_denom );
interp_hist_entry( hist, rbin, cbin, obin, grad_mag * w, d, n );
}
}
But regardless of how the exact size of the region is calculated, I think the general concept is the same. To calculate the region size based on the original Gaussian scale. Besides, given that the features are supposed to be "weighted by a Gaussian window" (original paper, section 6.1, page 15); as long as the region you define is large enough to contain most of the meaningful orientation histograms, you are fine.