[SOLVED] Morton curve for non cubic areas that are a power of two

While optimizing a ray tracer I was trying to improve data locality for the intersection datastructure using a Morton space filling curve, except that my 3D space is not cubic (eg. 512x512x256). All sides are a power of two, but not all sides are the same length.

I have been unable to find any examples for non square Morton curves where the sides are a power of two. If it matters I can guarantee that the x/y axis are the same size with only the z axis being a different length.

Note how the width is 2x height, but it could also be 3x or 4x or any other. I have been unable to find a way how to do this.

Ideally the solution would be fast as the Morton code has to be calculated a lot. So my question is: How do I generate a space filling morton curve for non-cubic spaces? This is specifically for the GPU (Cuda).

The conditions on the dimensions are:
x, y, z are a power of two
x == y
x, y >= z
Or if easier
x, y > z

It would probably break for, say, nz=11, but for half of the size of the XY square it seems to work for me

#include <cstdint>
#include <iostream>

static inline uint32_t spread(uint32_t x)
{
    x = (x | (x << 10)) & 0x000F801F;
    x = (x | (x << 4)) & 0x00E181C3;
    x = (x | (x << 2)) & 0x03248649;
    x = (x | (x << 2)) & 0x09249249;

    return x;
}

static inline uint32_t morton(const uint32_t x, const uint32_t y, const uint32_t z)
{
    return spread(x) << 0 | spread(y) << 1 | spread(z) << 2;
}


auto main() -> int {
    int nx = 32;
    int ny = 32;
    int nz = 16;

    for (int iz = 0; iz != nz; ++iz)
    {
        for (int iy = 0; iy != ny; ++iy)
        {
            for (int ix = 0; ix != nx; ++ix)
            {
                auto m = morton(ix, iy, iz);

                std::cout << m << '\n';
            }
        }
    }

    return 0;
}

UPDATE

How to make Morton code work for, say, 256x256x64 (8bit*8bit*6bit): you have to spread X and Y non-equidistantly, taking into account number of bits in Z. Basically, for cube you spread evenly: each bit at position 0, 3, 6, 9, 12, 15, 18, 21, 24, leaving space for other two bits from orthogonal axes.

So there is equidistant spread for a cube. But for the case when you have only 6 bits from Z to insert, you have to have 6 distances of 3, but no Z bits for last gap, thus last gap for X and Y spread should be only 1 bit wide. Thus, non-equidistant spread in X and Y.

Something along the line: if Nx=Ny are number of bits in XY plane, and Nz!=Nx or Ny is number of bits along Z axis, spread gap should be 2 bits for Nz bits and gap of 1 bit for what is left. So two spread routines - one for X&Y with non-equidistant spread which now depends on Nz, and existing spread function for Z axis.

Ok, here is a working version, seems to be doing right thing

#include <cstdint>
#include <iostream>

#define func auto

func spreadZ(uint32_t v) -> uint32_t { // 2bit gap spread
    v = (v | (v << 10)) & 0x000F801F;
    v = (v | (v << 4)) & 0x00E181C3;
    v = (v | (v << 2)) & 0x03248649;
    v = (v | (v << 2)) & 0x09249249;

    return v;
}

func spreadXY(const uint32_t v, const uint32_t bitsZ) -> uint32_t {
    uint32_t mask_z = (1U << bitsZ) - 1U; // to mask bits which are going to have 2bit gap

    uint32_t lo{ v & mask_z }; // lower part of the value where there are Z bits
    lo = spreadZ(lo);          // 2bit gap spread

    uint32_t hi = v >> bitsZ; // high part of the value, 1bit gap

    // 1bit gap spread
    hi = (hi ^ (hi << 8)) & 0x00ff00ffU;
    hi = (hi ^ (hi << 4)) & 0x0f0f0f0fU;
    hi = (hi ^ (hi << 2)) & 0x33333333U;
    hi = (hi ^ (hi << 1)) & 0x55555555U;

    return lo + (hi << 3*bitsZ); // combine them all together
}

func morton(const uint32_t x, const uint32_t y, const uint32_t z, const uint32_t bitsZ) -> uint32_t {
    return spreadXY(x, bitsZ) << 0 | spreadXY(y, bitsZ) << 1 | spreadZ(z) << 2;
}

func ispowerof2(const uint32_t n) -> bool {
    return n && (!(n & (n - 1u)));
}

func bit_pos(const uint32_t n) -> uint32_t {
    if (!ispowerof2(n))
        throw -1;

    uint32_t i{ 1u }, pos{ 1u };

    while (!(i & n)) { // Iterate through bits of n till we find a set bit, i&n will be non-zero only when 'i' and 'n' have a same bit         
        i = i << 1; // unset current bit and set the next bit in 'i' 

        ++pos; // increment position 
    }

    return pos;
}

func main() -> int {
    int nx = 256;
    int ny = 256;

    int nz = 256; //256...128...64...32...16...8...4...2...1 all works
    int bitsZ = bit_pos(nz) - 1; // should be doing try/catch

    for (int iz = 0; iz != nz; ++iz)
    {
        for (int iy = 0; iy != ny; ++iy)
        {
            for (int ix = 0; ix != nx; ++ix)
            {
                auto m = morton(ix, iy, iz, bitsZ);

                std::cout << m << '\n';
            }
        }
    }

    return 0;
}