OpenCV ORB descriptor - how exactly is it stored in a set of bytes?

I'm currently using OpenCV's ORB features extractor and I did notice the strange (at least for me) way the ORB-descriptor is stored (it is basically a BRIEF-32 with a modification that is not relevant to my question). As some of you know ORB takes the keypoints extracted using a modified FAST-9 (circle radius = 9 pixels; also stores orientation of the keypoint) and uses those with a modified BRIEF-32 descriptor to store the feature that the keypoint represents.

BRIEF (ORB version) works as follows: we take a 31x31 pixels patch (represents a feature) and create a bunch of random 5x5 pixels test points. Then we take pairs of those points and evaluate their intensities resulting in a binary decision (0 or 1) based on whether the intensity of the first point in the pair is greater or less equal than the intensity of the second one. Then we take all those bits and use a basic sum-formula to build a binary string of length n (for BRIEF-32 we have 32 bytes * 8 = 256 bit long binary string):

SUM(2^(i-1)*bit_pair_test)

where the bit_pair_test is the bit value that we've calculated from the test for a pair of test points. The final result is something like (for a set of binary tests (...,0,1,0,1,1)):

(2⁰*1) + (2¹*1) + (2²*0) + (2³*1) + (2⁴*0) + ...

Now the way OpenCV's ORB stores those bitstrings is what is a mystery to me. If we look at the matrix that contains the descriptors for a whole image with each row being a single descriptor for a single keypoint we can see that each descriptor has 32 8-bit numbers, which altogether results in those 256 bits that BRIEF-32 uses to store the information. I don't understand why we split those 256 bits in 32 bytes. The official documentation (http://docs.opencv.org/trunk/doc/py_tutorials/py_feature2d/py_brief/py_brief.html) only says that OpenCV stores such descriptors in bytes, however it does not explain why it does that. I've considered three possibilities without excluding the possibility that some combination of those might be the answer:

Some storage technique that I just can't see
Some performance issue with computing Hamming distance for binary string that long (256 bits in our case)
Optimization on the matching process - the matching basically compares the descriptor of a keypoint from one image to the descriptor of a keypoint in a second image. Because we have binary strings Hamming distance is the obvious choice here. It might be that somehow each of those 32 sub-strings is compared to their counterparts in the descriptor of the other keypoint in the second image (sub-string at position 0 (keypoint X, image 1) with sub-string at position 0 (keypoint Y, image 2). Finally it might be that OpenCV says: "Okay, we have 80% match rate of the descriptor since approx. 26 of all sub-string are the same in both descriptors) so we have a winner." However I was unable to find any evidence to confirm that.

PS: You can read the paper on ORB here and the paper on BRIEF here.

Solution

The choice of 8 and 32 bits pattern is due to storage and efficiency issues.

Matching process

The bit order in BRIEF, ORB and BRISK is not relevant (unlike FREAK). Thus, all bits of these descriptors are of equal significance, and you can't just compare the first part of the bitstream, etc.

On the other had, FREAK was designed with such a matching process (called a cascade in FREAK's paper) in mind.

Storage issues

Well, computers don't store individual bits. Thus, you won't see anybody storing BRIEF and the like in bit arrays.

The smallest component that can be read from memory is a byte (corresponding usually to 8 bits, although some DSPs cannot read chunks that are smaller than 16 bits, but that's another story). Thus, you could see people storing their descriptors in byte arrays (type unsigned char in C/C++, which is the underlying OpenCV implementation language).

Plus, memory accesses are usually better (faster) when the variables are aligned on CPU word boundaries. Most CPU's nowadays have words of 32 or 64 bits, 32 bits words being the better choice because 64 bits architectures were designed with the legacy 32-bits processors in mind.

Efficiency issues

Hamming distance is computed via an XOR operation. It happens that many processors have dedicated instruction sets that can compute XOR efficiently with 32 bits words (the common size of an integer before 64 bits CPUs became more common). Even more, they may also support computing several XOR values on multiple 32 bits words in parallel, which is a parallelism technique called SIMD (Single Input Multiple Data). For example, SSE extensions can be leveraged to further accelerate the Hamming distance computation of BRIEF/ORB/... descriptors whose size is a multiple of 32 bits.