I am building Morton number for spatial indexing, I have 8 unsigned 16 bit numbers that will turn into __int128 number. The efficiency is crucial, so naive solution (loop over everything) or building separate 8 128bit numbers is too expensive.
I am using GCC, the target machine is 64 bits but without BMI2 support.
How can I speed up the computation?
If your machine is x86 and supports SSE2, there is a clever answer using movmsk
instructions. Google SSE2 bit matrix transpose
for full code.