I assume you all know what prime numbers are and what Sieve of Eratosthenes is, so I won't waste time explaining them.
Now, all prime numbers except 2 are odd numbers, so we only need to check odd numbers, this is very obvious, but this is worth mentioning because this simple optimization halved the candidates we need to check, and so we only need to mark odd multiples of prime numbers other than 2.
Another simple optimization is to only check numbers up to square root of limit, and yet another simple optimization is to mark as composite start at the square of the prime number, as the reasons for these should be obvious I won't explain why, though I can't think of further optimizations regarding the marking of composites.
But we can narrow down the search space further, all prime numbers except 2, 3, 5 must be congruent to [1, 7, 11, 13, 17, 19, 23, 29]
% 30. This is evident from the nature of modulus. There are only 30 possibilities, if the modulus is even then the number is a multiple of 2, and the other possibilities mean the number is either a multiple of 3, a multiple of 5 or both. In other words all prime numbers must be coprime to 30 except 2, 3, 5.
In Python, this is:
wheel3 = [i for i in range(1, 30) if math.gcd(i, 30) == 1]
# [1, 7, 11, 13, 17, 19, 23, 29]
Now we calculate the difference between consecutive pairs of elements, starting at 7, and 1 comes immediately after 29 because of nature of modulo operation, for example, 31 % 30 == 1
, and so the difference between them is 2.
We obtain the following: [4, 2, 4, 2, 4, 6, 2, 6]
.
Now, out of every 30 numbers we only need to check 8 numbers, we skip 22 numbers. This is a significant improvement from the previous optimization of only bruteforcing odd numbers, we needed to process 15 numbers out of every 30 numbers if we process all odd numbers, now we have 7 numbers less, the search space is narrowed to 4/15 which is 0.2666...
We can optimize this further by using a bigger wheel, using 2 * 3 * 5 * 7 = 210 as basis, all prime numbers starting at 11 must be coprime to 210.
wheel4 = [i for i in range(1, 210) if math.gcd(i, 210) == 1]
wheel4 == [
1 , 11 , 13 , 17 , 19 , 23 ,
29 , 31 , 37 , 41 , 43 , 47 ,
53 , 59 , 61 , 67 , 71 , 73 ,
79 , 83 , 89 , 97 , 101, 103,
107, 109, 113, 121, 127, 131,
137, 139, 143, 149, 151, 157,
163, 167, 169, 173, 179, 181,
187, 191, 193, 197, 199, 209
]
And the list of index changes is:
[
2 , 4 , 2 , 4 , 6 , 2 ,
6 , 4 , 2 , 4 , 6 , 6 ,
2 , 6 , 4 , 2 , 6 , 4 ,
6 , 8 , 4 , 2 , 4 , 2 ,
4 , 8 , 6 , 4 , 6 , 2 ,
4 , 6 , 2 , 6 , 6 , 4 ,
2 , 4 , 6 , 2 , 6 , 4 ,
2 , 4 , 2 , 10, 2 , 10
]
Now out of every 210 numbers we only need to process 48 numbers, down from the previous 56 numbers, we need to process 8 numbers less, we narrowed the search space down to 8/35 which is 0.22857142857142857... and less than a quarter.
So I expect the version using the 210-based wheel to take only 6/7 or 85.71% of the time the 30-based wheel version takes to execute, but that isn't so.
import math
import numpy as np
import numba as nb
@nb.njit(cache=True)
def prime_wheel_sieve(n: int) -> np.ndarray:
wheel = [4, 2, 4, 2, 4, 6, 2, 6]
primes = np.ones(n + 1, dtype=np.uint8)
primes[:2] = False
for square, step in ((4, 2), (9, 6), (25, 10)):
primes[square::step] = False
k = 7
lim = int(math.sqrt(n) + 0.5)
i = 0
while k <= lim:
if primes[k]:
primes[k**2 :: 2 * k] = False
k += wheel[i]
i = (i + 1) & 7
return np.nonzero(primes)[0]
# fmt: off
WHEEL4 = np.array([
2 , 4 , 2 , 4 , 6 , 2 ,
6 , 4 , 2 , 4 , 6 , 6 ,
2 , 6 , 4 , 2 , 6 , 4 ,
6 , 8 , 4 , 2 , 4 , 2 ,
4 , 8 , 6 , 4 , 6 , 2 ,
4 , 6 , 2 , 6 , 6 , 4 ,
2 , 4 , 6 , 2 , 6 , 4 ,
2 , 4 , 2 , 10, 2 , 10
], dtype=np.uint8)
# fmt: on
@nb.njit(cache=True)
def prime_wheel_sieve4(n: int) -> np.ndarray:
primes = np.ones(n + 1, dtype=np.uint8)
primes[:2] = False
for square, step in ((4, 2), (9, 6), (25, 10), (49, 14)):
primes[square::step] = False
k = 11
lim = int(math.sqrt(n) + 0.5)
i = 0
while k <= lim:
if primes[k]:
primes[k**2 :: 2 * k] = False
k += WHEEL4[i]
i = (i + 1) % 48
return np.nonzero(primes)[0]
In [549]: np.array_equal(prime_wheel_sieve(65536), prime_wheel_sieve4(65536))
Out[549]: True
In [550]: %timeit prime_wheel_sieve(65536)
161 μs ± 1.13 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [551]: %timeit prime_wheel_sieve4(65536)
163 μs ± 1.79 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [552]: %timeit prime_wheel_sieve4(131072)
330 μs ± 10.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [553]: %timeit prime_wheel_sieve(131072)
328 μs ± 7.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [554]: %timeit prime_wheel_sieve4(262144)
680 μs ± 14.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [555]: %timeit prime_wheel_sieve(262144)
669 μs ± 7.79 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [556]: %timeit prime_wheel_sieve(524288)
1.44 ms ± 16.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [557]: %timeit prime_wheel_sieve4(524288)
1.48 ms ± 13.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [558]: %timeit prime_wheel_sieve4(1048576)
3.25 ms ± 81.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [559]: %timeit prime_wheel_sieve(1048576)
3.23 ms ± 115 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [560]: %timeit prime_wheel_sieve(2097152)
7.08 ms ± 80.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [561]: %timeit prime_wheel_sieve4(2097152)
7.1 ms ± 85.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [562]: %timeit prime_wheel_sieve4(4194304)
14.8 ms ± 120 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [563]: %timeit prime_wheel_sieve(4194304)
14.2 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [564]: %timeit prime_wheel_sieve(8388608)
39.4 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [565]: %timeit prime_wheel_sieve4(8388608)
41.7 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
According to my tests, using a bigger wheel makes it slower not faster, why is it this case? Theoretically speaking using a bigger wheel narrows the search space, so it shouldn't cause increase in execution time, why does using a bigger wheel slow down the code?
Okay, using the scientific method I controlled the differences between the two functions so that the only difference between them, the only quantity that can affect the performance, is the wheel used.
I made the first function use modulo operator instead of bitwise-AND, and I made the second function use a local list just like the first function. I wanted to separate code and data but whatever.
@nb.njit(cache=True)
def prime_wheel_sieve3(n: int) -> np.ndarray:
wheel = [4, 2, 4, 2, 4, 6, 2, 6]
primes = np.ones(n + 1, dtype=np.uint8)
primes[:2] = False
for square, step in ((4, 2), (9, 6), (25, 10)):
primes[square::step] = False
k = 7
lim = int(math.sqrt(n) + 0.5)
i = 0
while k <= lim:
if primes[k]:
primes[k**2 :: 2 * k] = False
k += wheel[i]
i = (i + 1) % 8
return np.nonzero(primes)[0]
@nb.njit(cache=True)
def prime_wheel_sieve4_1(n: int) -> np.ndarray:
# fmt: off
wheel = [
2 , 4 , 2 , 4 , 6 , 2 ,
6 , 4 , 2 , 4 , 6 , 6 ,
2 , 6 , 4 , 2 , 6 , 4 ,
6 , 8 , 4 , 2 , 4 , 2 ,
4 , 8 , 6 , 4 , 6 , 2 ,
4 , 6 , 2 , 6 , 6 , 4 ,
2 , 4 , 6 , 2 , 6 , 4 ,
2 , 4 , 2 , 10, 2 , 10
]
# fmt: on
primes = np.ones(n + 1, dtype=np.uint8)
primes[:2] = False
for square, step in ((4, 2), (9, 6), (25, 10), (49, 14)):
primes[square::step] = False
k = 11
lim = int(math.sqrt(n) + 0.5)
i = 0
while k <= lim:
if primes[k]:
primes[k**2 :: 2 * k] = False
k += wheel[i]
i = (i + 1) % 48
return np.nonzero(primes)[0]
I had to add the formatting comments to prevent Black formatter from messing up my formatted table in VS Code, and of course that doesn't affect performance at all.
The only differences between the functions are the initial value of k, the primes that had to be processed before rolling the wheel proper, and of course the size of the wheel (and the wheel itself). These had to be different (because they use different wheels) but everything else aren't changed.
In [679]: np.array_equal(prime_wheel_sieve3(65536), prime_wheel_sieve4_1(65536))
Out[679]: True
In [680]: %timeit prime_wheel_sieve3(65536)
162 μs ± 2.27 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [681]: %timeit prime_wheel_sieve4_1(65536)
158 μs ± 1.83 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [682]: %timeit prime_wheel_sieve3(131072)
326 μs ± 7.91 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [683]: %timeit prime_wheel_sieve4_1(131072)
322 μs ± 8.89 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [684]: %timeit prime_wheel_sieve3(262144)
659 μs ± 7.74 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [685]: %timeit prime_wheel_sieve4_1(262144)
655 μs ± 12.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [686]: %timeit prime_wheel_sieve3(524288)
1.45 ms ± 14.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [687]: %timeit prime_wheel_sieve4_1(524288)
1.42 ms ± 8.13 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [688]: %timeit prime_wheel_sieve3(1048576)
3.2 ms ± 68.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [689]: %timeit prime_wheel_sieve4_1(1048576)
3.2 ms ± 199 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now interestingly prime_wheel_sieve4_1
performs a little bit faster than prime_wheel_sieve3
, but it is only a tiny little bit. The speed up is insignificant, but I know the execution time of code is stochastic in nature, so prime_wheel_sieve4_1
performs consistently faster than prime_wheel_sieve3
by a little bit is statistically significant. Though I haven't tested much, this doesn't exclude the possibility of coincidence.
But according to my theory, I should see 14.29% decrease in execution time, not basically no improvement. My tests made my case stronger.
So why does using a bigger wheel not speed up the code?
The speedup is insignificant because you only sped up an insignificant part of the overall work. Most time is spent by the primes[...] = False
commands, and they're the same for both wheels: You do that for each prime up to sqrt(n). It doesn't matter that the larger wheel finds those primes a bit faster.