I want to create a 1D array of length n, every element in the array can be either 0 or 1. Now I want the array to contain alternating runs of 0s and 1s, every full run has the same length as every other run. Every run of 1s is followed by a run of 0s of the same length and vice versa, the gaps are periodic.
For example, if we start with five 0s, the next run is guaranteed to be five 1s, and then five 0s, and so on.
To better illustrate what I mean:
In [65]: (np.arange(100) // 5) & 1
Out[65]:
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
The runs can also be shifted, meaning we don't have a full run at the start:
In [66]: ((np.arange(100) - 3) // 7) & 1
Out[66]:
array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
Now as you can see I have found a way to do it, in fact I have found three ways, but all are flawed. The above works with shifted runs but is slow, one other is faster but it doesn't allow shifts.
In [82]: %timeit (np.arange(524288) // 16) & 1
6.45 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [83]: range1 = np.arange(524288)
In [84]: %timeit (range1 // 16) & 1
3.14 ms ± 201 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [85]: %timeit np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384)
81.6 μs ± 843 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [86]: %timeit np.repeat([0, 1], 262144).reshape(32, 16384).T.flatten()
5.42 ms ± 74.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [87]: np.array_equal((range1 // 16) & 1, np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384))
Out[87]: True
In [88]: np.array_equal(np.repeat([0, 1], 262144).reshape(32, 16384).T.flatten(), np.tile(np.concatenate([np.zeros(16, dtype=np.uint8), np.ones(16, dtype=np.uint8)]), 16384))
Out[88]: True
Is there a way faster than np.tile
based solution that also allows shifts?
I have made the code blocks output the same result for fair comparison, and for completeness I have added another inefficient method.
Another method:
In [89]: arr = np.zeros(524288, dtype=np.uint8)
In [90]: %timeit arr = np.zeros(524288, dtype=np.uint8)
19.9 μs ± 156 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [91]: arr[262144:] = 1
In [92]: %timeit arr[262144:] = 1
9.91 μs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [93]: %timeit arr.reshape(32, 16384).T.flatten()
932 μs ± 11.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [94]: %timeit arr.reshape(32, 16384).T
406 ns ± 1.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [95]: %timeit list(arr.reshape(32, 16384).T.flat)
24.7 ms ± 242 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you can see, np.repeat
is extremely slow, creating the array and assigning 1 to half the values is very fast, and arr.reshape.T
is extremely fast, but arr.flatten()
is very slow.
Since you are certainly on Windows (which is not mentioned in this question and critical for performance) and you likely use PIP packages, your Numpy implementation has been compiled with MSVC which does not use SIMD units of CPU for many functions, including np.tile
. This means your Numpy package it pretty inefficient.
One simple solution is not to use such a package, but a build of Numpy using Clang (or not to use Windows) as previously mentioned in a past answer. Most functions should then be automagically faster with no effort.
Another solution is to use a trick (SLP-vectorization) to bypass this fundamental limitation. The idea is to do reduce the number of instruction done by operating on wider items. Here is the code:
arr = np.zeros(524288, dtype=np.uint8)
tmp = arr.view(np.uint64)
tmp[2::4] = 0x01010101_01010101
tmp[3::4] = 0x01010101_01010101
It takes 47 μs on my machine (with a i5-9600KF CPU, Numpy 2.1.3 on Windows). This is a bit faster than the 57 μs of the np.tile
solution (which is the standard way of doing that in Numpy). All other proposed solutions are slower.
Note that two stores operations are sub-optimal, especially for larger arrays. That being said, on Windows a vectorised single store is slower (due to Numpy internal and more specifically generators).
I advise you to choose the first solution instead of using Numpy tricks for sake of performance. If you cannot, another solution is simply to use Cython or Numba for that:
import numba as nb
@nb.njit('(uint32,uint32)')
def gen(size, repeat):
res = np.ones((repeat, size), dtype=np.uint8)
for i in range(repeat):
for j in range(size // 2):
res[i, j] = 0
return res.reshape(-1)
The Numba code only takes 35 µs on my machine. It is cleaner and should be faster on large arrays. It is also easy to debug/maintain. Note that the array allocation takes 40% of the time (14 µs)...