[SOLVED] What is the fastest way to generate all n-bit gray codes using NumPy?

My goal is to create images using gray codes, an example would be this:

It is all modulo 64 groups in gray codes in polar form.

Now of course I know of the simple mapping n ^ (n >> 1) from binary, but I had found more efficient ways to generate gray codes directly than using said mapping. But since binary codes are related I will post code that generates binary codes as well.

I want a function that generates all n-bit gray codes in the form np.zeros((1 << n, n), dtype=bool). I want it as efficient as possible, and it has to be implemented in numpy and only implemented in numpy, no other libraries are allowed.

Why do I disallow other libraries? Because I have installed scipy, PIL, cv2, matplotlib, numba... all of them require different versions of numpy and updating one breaks the dependency of another, and all of them provide a number of methods, it is a huge learning curve to know how to use them well. I am currently trying to familiarize myself with numpy, so I invented this challenge to make myself learn.

I have implemented a bunch of different methods, they all work correctly, I have rigorously tested them, but none of them strikes me as efficient. So far, I have found that np.unpackbits is the most efficient method to get binary bits of a number, but it only works with np.uint8, that is easy to solve, just using .view(np.uint8), but the output is in mixed endianness and that is somewhat trickier to solve. But even if I use np.unpackbits, converting it from binary to gray code is less efficient than generating gray codes directly.

And according to my tests, np.concatenate(arrs) is more efficient than np.vstack(arrs), np.concatenate(arrs, axis=-1) beats np.hstack(arrs), and np.concatenate(arrs).reshape((w, h)).T beats np.dstack(arrs). And somehow initializing an array and then broadcasting to individual columns using a loop can be more efficient than using np.concatenate.

And using numpy broadcasting to get a & b column-wise in which a is a 1d array and b is a 1d array to get binary decomposition of a can be much less efficient than just looping through the columns and apply & column by column. In particular, (a & b[:, None]).T.astype(bool) is much more efficient than (a[:, None] & b).astype(bool).

Code

import numpy as np

lo = 1
hi = 8
UINT_BITS = {}
for dtype in (np.uint8, np.uint16, np.uint32, np.uint64):
    for i in range(lo, hi + 1):
        UINT_BITS[i] = dtype

    lo = hi + 1
    hi <<= 1


def get_dtype(n: int) -> np.uint8 | np.uint16 | np.uint32 | np.uint64:
    if dtype := UINT_BITS.get(n):
        return dtype

    raise ValueError(f"Argument {n} is not a valid bit width")


def validate(n: int) -> None:
    if not (n and isinstance(n, int)):
        raise ValueError(f"Argument {n} is not a valid bit width")


def binary_codes_0(n: int) -> np.ndarray:
    validate(n)
    count = 1 << n
    rect = np.zeros((count, n), dtype=bool)
    r = 1
    for i in range(n - 1, -1, -1):
        count >>= 1
        rect[:, i] = np.tile(
            np.concatenate([np.zeros(r, dtype=bool), np.ones(r, dtype=bool)]), count
        )
        r <<= 1

    return rect


def binary_codes_1(n: int) -> np.ndarray:
    validate(n)
    r = total = 1 << n
    return (
        np.concatenate(
            [
                np.tile(
                    np.concatenate(
                        [np.zeros((r := r >> 1), dtype=bool), np.ones(r, dtype=bool)]
                    ),
                    1 << i,
                )
                for i in range(n)
            ]
        )
        .reshape((n, total))
        .T
    )


def binary_codes_2(n: int) -> np.ndarray:
    validate(n)
    chunks = np.array([(0,), (1,)], dtype=bool)
    l = 2
    for _ in range(n - 1):
        chunks = np.concatenate(
            [
                np.concatenate([np.zeros((l, 1), dtype=bool), chunks], axis=-1),
                np.concatenate([np.ones((l, 1), dtype=bool), chunks], axis=-1),
            ]
        )
        l <<= 1

    return chunks


def binary_codes_3(n: int) -> np.ndarray:
    validate(n)
    rect = np.zeros([2] * n + [n], dtype=bool)
    for i, a in enumerate(np.ix_(*[(0, 1)] * n)):
        rect[..., i] = a

    return rect.reshape(-1, n)


def binary_codes_4(n: int) -> np.ndarray:
    numbers = np.arange(1 << n, dtype=get_dtype(n))
    return (
        np.concatenate([(numbers & 1 << i).astype(bool) for i in range(n - 1, -1, -1)])
        .reshape(n, 1 << n)
        .T
    )


def binary_codes_5(n: int) -> np.ndarray:
    numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
    result = np.zeros((count, n), dtype=bool)
    mask = count
    for i in range(n):
        result[:, i] = numbers & (mask := mask >> 1)

    return result


def binary_codes_6(n: int) -> np.ndarray:
    return np.unpackbits(
        np.arange(1 << n, dtype=get_dtype(n))[:, None].view(np.uint8),
        axis=1,
        bitorder="little",
        count=n,
    )[:, ::-1]


def binary_codes_7(n: int) -> np.ndarray:
    validate(n)
    return np.array(np.meshgrid(*[(0, 1)] * n, indexing="ij")).reshape((n, 1 << n)).T


def gray_codes_0(n: int) -> np.ndarray:
    numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
    gray = numbers ^ (numbers >> 1)
    return (
        np.concatenate([(gray & 1 << i).astype(bool) for i in range(n - 1, -1, -1)])
        .reshape((n, count))
        .T
    )


def gray_codes_1(n: int) -> np.ndarray:
    numbers = np.arange((count := 1 << n), dtype=get_dtype(n))
    gray = numbers ^ (numbers >> 1)
    result = np.zeros((count, n), dtype=bool)
    for i in range(n):
        result[:, i] = gray & (count := count >> 1)

    return result


def gray_codes_2(n: int) -> np.ndarray:
    validate(n)
    binary = binary_codes_6(n)
    shifted = np.roll(binary, 1, axis=-1)
    shifted[:, 0] = 0
    return binary ^ shifted


def gray_codes_3(n: int) -> np.ndarray:
    validate(n)
    gray = np.array([(0,), (1,)], dtype=bool)
    l = 2
    for _ in range(n - 1):
        gray = np.concatenate(
            [
                np.concatenate([np.zeros((l, 1), dtype=bool), gray], axis=-1),
                np.concatenate([np.ones((l, 1), dtype=bool), gray[::-1]], axis=-1),
            ]
        )
        l <<= 1

    return gray

Testing

import numpy as np
zeros = np.zeros(524288, dtype=bool)
ones = np.ones(524288, dtype=bool)
zeros1 = np.zeros((524288, 32), dtype=bool)
ones1 = np.ones((524288, 32), dtype=bool)
million = [list(range(i*4096, i*4096+4096)) for i in range(256)]
numbers = np.arange(1 << 16, dtype=np.uint64)
mask = np.array([1 << i for i in range(15, -1, -1)], dtype=np.uint64)

In [3]: %timeit (numbers & mask[:, None]).T.astype(bool)
4.1 ms ± 97.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit (numbers[:, None] & mask).astype(bool)
6.1 ms ± 423 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit binary_codes_5(16)
2.02 ms ± 19.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit binary_codes_4(16)
2.32 ms ± 27.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit np.hstack([zeros, ones])
312 μs ± 12.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit np.concatenate([zeros, ones])
307 μs ± 9.97 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit np.vstack([zeros, ones])
315 μs ± 11.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: %timeit np.hstack([zeros1, ones1])
19.8 ms ± 800 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit np.concatenate([zeros1, ones1], axis=-1)
18.1 ms ± 265 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit np.concatenate([zeros1, ones1])
9.73 ms ± 413 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit np.vstack([zeros1, ones1])
10.3 ms ± 229 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %timeit np.dstack(million)[0]
78.7 ms ± 973 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: %timeit np.concatenate(million).reshape((256, 4096)).T
69.9 ms ± 251 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit binary_codes_0(16)
2.32 ms ± 18 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [17]: %timeit binary_codes_1(16)
6.37 ms ± 182 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [18]: %timeit binary_codes_2(16)
1.46 ms ± 28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [19]: %timeit binary_codes_3(16)
1.64 ms ± 29.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [20]: %timeit binary_codes_6(16)
1.12 ms ± 9.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [21]: %timeit gray_codes_0(16)
2.12 ms ± 25.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %timeit gray_codes_1(16)
2.17 ms ± 29 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [23]: %timeit gray_codes_2(16)
4.51 ms ± 151 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %timeit gray_codes_3(16)
1.46 ms ± 19.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Is there a more efficient way to generate all n-bit gray codes?

I have figured out how to use np.meshgrid to do Cartesian product, and it is much slower than expected. I have edited the code above to include it.

In [82]: %timeit binary_codes_7(16)
6.96 ms ± 249 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [83]: %timeit binary_codes_5(16)
1.74 ms ± 36.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [84]: %timeit binary_codes_3(16)
1.65 ms ± 15.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [85]: %timeit np.meshgrid(*[(0, 1)] * 16, indexing="ij")
4.33 ms ± 49.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [86]: np.all(np.array(np.meshgrid(*[(0, 1)] * 5, indexing="ij")).reshape((5, 32)).T == binary_codes_3(5))
Out[86]: np.True_

Now I have implemented everything I can think of.

At this point I have realized this problem is extremely simple, we don't even need any bit operations at all.

In both binary and gray codes, each cell can only have two values, zero and one. It can only be one or zero.

Now if we have np.zeros((1 << n, n), dtype=bool) our job is half way done. Exactly half of the cells have the correct value: zero. We just have to flip the ones.

If we look at the sequences row-wise, there isn't much we can do; but if we look at the columns, it just repeats. There are groups of ones with equal length separated by groups of zeros with the same length.

We can just create a 1d array as a binary mask to flip everything on for each column except those gaps. Job done. The question is, how?

The rightmost column in binary is straightforward, just do arr[:, -1][1::2] = 1. But what about the second last column? It needs to be (0, 0, 1, 1) repeat, in other words every other pair of cells are ones, I know the indices of the start and end points, it needs to be on in [range(2, 4), range(6, 8), range(10, 12)...] but what is the simplest way to tell the computer to flip those cells? And the third last column, the bands of ones are [range(4, 8), range(12, 16), range(20, 24)...], how do I flip those cells?

Surprisingly I haven't found a good answer, or perhaps unsurprising, since Google is useless, but I did find this: Indexing in NumPy: Access every other group of values.

And no, this is not a duplicate, because doing reshape then ravel for each column would be terribly inefficient, and that doesn't create a boolean mask for indexing the array, it creates a smaller array...

Currently I can do this:

arr = np.zeros((16, 4), dtype=bool)
l = 1
for i in (3, 2, 1, 0):
    l2 = l * 2
    for a, b in zip(range(l, 16, l2), range(l2,17,l2)):
        arr[:, i][a:b] = 1
    
    l = l2

But this supposedly is slow (I haven't benchmarked this), however if this is implemented in numpy then I think this would be the most efficient algorithm for this type of sequences. The question is, how to implement this?