I decided to compare just for interest the performance of basic algorithm implemented in PHP/Lua/Python with the same program in C++ as reference point. Search of primes algorithm and code examples are here. I wrote this algorithm in PHP/Lua/Py/C++ without any problems and in addition made simple numeric search limit input for convenience (until user exits 'prompt loop'). Here my results for searching all primes up to 1'000'000 (totally 78498 primes) and for C++ code up to 100'000 (totally 9592 primes):
Limit is 1e6
Limit is 1e5
Imagine my disappointment, when I saw C++ code performance with 10 times lower number limit, because it should perform much faster than Lua anyway. I tried to compile my code with optimizations (see below), compile code in MS Visual Studio 2022 (MSVC++20 ISO), compile and run code on laptop and again on desktop but under Ubuntu. I even sent my executable to my relative to run on his machine, but the results were always the same: still ~12-17 seconds depending on machine.
What am I doing wrong or not doing right at all?
My compiler options and PC settings:
MinGW64/Win10 and GNU GCC/Ubuntu:
g++ -std=c++20 -march=x86-64 -mtune=ivybridge -O3 primenum.cpp -o primenum.exe
MSVC++20/Win10: Release, x64, ISO C++20 Standard
Desktop PC: Win10x64/Ubuntu server 24.04 LTS, Intel i5-3570K Ivy Bridge;
Laptop PC: Win10x64, Intel i7-3537U Ivy Bridge;
My C++ full code:
#include <cmath>
#include <iostream>
#include <string>
using namespace std;
//=============================================================================
/* This C++ program generates all prime numbers (primes) up to
* integer limit entered by user. Primes searching method implements Sieve
* of Atkin algorithm.*/
//=============================================================================
// Convert string chars to lowercase
void str_tolower(string &str) {
int length = str.length();
for (int i = 0; i < length; ++i) {
str[i] = (char)tolower(str[i]);
}
}
// Sieve of Atkin algorithm main function
// ======================================
void SieveOfAtkin(size_t limit) {
/* Initialise the sieve array with false values.
* "2" and "3" are known to be prime.*/
size_t length = limit + 1;
size_t *sieve = new size_t[length]{};
sieve[2] = 1;
sieve[3] = 1;
/* Mark sieve[n] is true if one of the following is true:
* a) n = (4*x*x)+(y*y) has odd number of solutions, i.e., there exist
* odd number of distinct pairs (x, y) that satisfy the equation and
* n % 12 = 1 or n % 12 = 5.
* b) n = (3*x*x)+(y*y) has odd number of solutions and n % 12 = 7
* c) n = (3*x*x)-(y*y) has odd number of solutions, x > y and
* n % 12 = 11.*/
size_t n{}, x{}, y{};
for (x = 0; x*x <= limit; ++x) {
for (y = 0; y*y <= limit; ++y) {
// (a)
n = 4*x*x + y*y;
if ((n <= limit) && (n % 12 == 1 || n % 12 == 5)) {
sieve[n] = !sieve[n];
}
// (b)
n = 3*x*x + y*y;
if ((n <= limit) && (n % 12 == 7)) {
sieve[n] = !sieve[n];
}
// (c)
n = 3*x*x - y*y;
if ((x > y) && (n <= limit) && (n % 12 == 11)) {
sieve[n] = !sieve[n];
}
}
}
// Mark all multiples of squares as non-prime
size_t r{}, i{};
for (r = 5; r*r <= limit; ++r) {
if (sieve[r]) {
for (i = r*r; i <= limit; i += r*r) {
sieve[i] = 0;
}
}
}
// Disable listing if 'limit' > 1086 (print only first 180 primes)
bool disable_list_output = limit >= 1087 ? true : false;
// Print primes using sieve[]
size_t count{0};
const unsigned char columns{20};
for (size_t a = 2; a <= limit; ++a) {
if (sieve[a]) {
++count;
if (!disable_list_output) {
cout.setf(ios::dec);
cout.width(10);
cout << a;
if (count > 0 && count % columns == 0)
cout << endl;
}
}
}
delete[] sieve;
if (disable_list_output) {
cout << endl << "===========================================";
cout << endl << "Primes count is over 180, listing disabled.";
}
cout << endl << "-------------------------------------------" << endl;
cout << "Primes found: " << count << endl;
}
// Main loop of the program
// ========================
int main(void) {
time_t start_time{};
size_t limit{};
string input{};
float f_limit{};
while (true) {
// Read user input string
cout << endl << "Enter positive integer, N > 3 : ";
getline(cin, input);
str_tolower(input);
// Processing user input string
// ----------------------------
if (!cin or input.empty()) { // if input failed --> repeat 'while' loop
continue;
} else if (input == "exit") {
// Successful program termination
exit(EXIT_SUCCESS);
} else {
try { // if input is float --> OK...
f_limit = stof(input);
} catch (...) { // ...else --> repeat 'while' loop
continue;
}
f_limit = stof(input);
// Check if input is integer
if (floor(f_limit) == f_limit) {
limit = static_cast<size_t>(f_limit);
start_time = time(NULL);
SieveOfAtkin(limit);
cout << "Execution time : " << time(NULL) - start_time << endl;
cout << "===========================================" << endl;
}
}
}
return 0;
}
I think that vector<bool>
is a culprit, not the only one... But a culprint here.
vector<bool>
uses a specialization that tries to be 'smart' and save space by storing the bools inside of bits instead of bytes... Which means that accessing an element of the vector is complicated. Instead of doing address = vec.array + index
which is how normal array access individual members. vector<bool>
performs division to get the relevant byte: address = vec.array + index / (sizeof(uint) * 8)
, and then performs a modulus and a shift to figure out which bit you are actually looking at bit_shift = index % (sizeof(uint) * 8); value = (bool) (*address & 1 << bit_shift).
to get/modify the value of each 'bool'. I believe this modulo shouldn't cost anything as it can get the remainder from the previous division of index... But I could be wrong, and it might perform 2 division here...
And why is this a problem? Division is slow, And it is being performed O(3n^2) in your code (because your code is doing x <= limit
instead of x * x <= limit
or the far which might be a contributing factor. Your code does division about (3n^2 -n) more often than the correct implementation does)... because of vector<bool>
, and your coding mistake. on my hunk of junk, it takes about 30x as long to perform division vs addition, which means it would take my CPU 2670 X the time to perform this calculation then it should if I decided to use a vector<bool>
. your CPU is more modern, I am guessing with pipelining it probably takes 4-5 cycles per division. (if you look up your CPU specs, how far off am I?)
if you use a vector<char>
instead to store your bool
values it should significantly faster. That being said, vectors perform dynamic memory allocation. And they resize Which takes maloc time (long, long time) + O(size of the previous vector) each time they resize... It would be more performant to start the vector at the correct size using vector<char> sieve(limit)
to avoid a resize during this process (faster would to just do it on the stack with an array, bool sieve[limit]
as long as your limit doesn't exceed your stack size.)
x <= ceil(Math.sqrt(limit))
multiplication is also slow, not division slow... but slow.Edit:
Your code is also including the time it takes to print things to the terminal. Which is not very fast. If you stick that outside of the timed portion as well, it should be more performant.