How to encode/decode morton codes(z-order) given [x, y] as 32bit unsigned integers producing 64bit morton code, and vice verse ? I do have xy2d and d2xy but only for coordinates that are 16bits wide producing 32bit morton number. Searched a lot in net, but couldn't find. Please help.
If it is possible for you to use architecture specific instructions you'll likely be able to accelerate the operation beyond what is possible using bit-twiddeling hacks:
For example if you write code for the Intel Haswell and later CPUs you can use the BMI2 instruction set which contains the pext
and pdep
instructions. These can (among other great things) be used to build your functions.
Here is a complete example (tested with GCC):
#include <immintrin.h>
#include <stdint.h>
// on GCC, compile with option -mbmi2, requires Haswell or better.
uint64_t xy_to_morton(uint32_t x, uint32_t y)
{
return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}
void morton_to_xy(uint64_t m, uint32_t *x, uint32_t *y)
{
*x = _pext_u64(m, 0x5555555555555555);
*y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}
If you have to support earlier CPUs or the ARM platform not all is lost. You may still get at least get help for the xy_to_morton function from instructions specific for cryptography.
A lot of CPUs have support for carry-less multiplication these days. On ARM that'll be vmul_p8
from the NEON instruction set. On X86 you'll find it as PCLMULQDQ
from the CLMUL instruction set (available since 2010).
The trick here is, that a carry-less multiplication of a number with itself will return a bit-pattern that contains the original bits of the argument with zero-bits interleaved. So it is identical to the _pdep_u32(x,0x55555555) shown above. E.g. it turns the following byte:
+----+----+----+----+----+----+----+----+
| b7 | b6 | b5 | b4 | b3 | b2 | b1 | b0 |
+----+----+----+----+----+----+----+----+
Into:
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0 | b7 | 0 | b6 | 0 | b5 | 0 | b4 | 0 | b3 | 0 | b2 | 0 | b1 | 0 | b0 |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
Now you can build the xy_to_morton function as (here shown for CLMUL instruction set):
#include <wmmintrin.h>
#include <stdint.h>
// on GCC, compile with option -mpclmul
uint64_t carryless_square (uint32_t x)
{
uint64_t val[2] = {x, 0};
__m128i *a = (__m128i * )val;
*a = _mm_clmulepi64_si128 (*a,*a,0);
return val[0];
}
uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
return carryless_square(x)|(carryless_square(y) <<1);
}
_mm_clmulepi64_si128
generates a 128 bit result of which we only use the lower 64 bits. So you can even improve upon the version above and use a single _mm_clmulepi64_si128 do do the job.
That is as good as you can get on mainstream platforms (e.g. modern ARM with NEON and x86). Unfortunately I don't know of any trick to speed up the morton_to_xy function using the cryptography instructions and I tried really hard for several month.