The C++ armadillo libraries function for finding eigenvalues and eigenvectors of a complex sparse matrix eigs_gen
fails (crashes or give wrong result) if run concurrently in multiple threads, even when these threads do not read or write from the same data. Why is this and can it be run multithreaded?
Consider the following code, which uses the Armadillo library to find the lowest eigenvalue and corresponding eigenvector of multiple (here 2) large complex square matrices (~20000x20000) in C++ in two different threads. The threads do not share any data whatsoever, so no mutex should be needed, even so, this example places a mutex before the function which actually diagonalises the matrix:
#include<iostream>
#include<complex>
#include<exception>
#include <armadillo>
#include<thread>
#include<string>
using namespace std;
using namespace arma;
std::mutex stupid_lock;
void get_eigenvalue(int id, size_t N, double* out)
{
sp_cx_mat H (N, N);//Sparse matrix, as we have mostly 0 entries, initialized as only zeros by default
//As a demonstration, a hermitian matrix, which should have real eigenvalues
for (uint i = 0; i <N; ++i)
{
H(i, i)=1*i;
if (i>0)
H(i, i-1)=-1i*double(i);
if (i<N-1)
H(i, i+1)=+1i*double(i);
}
//Somewhere to put the eigenvalues and vectors
cx_vec eigval;
cx_mat eigvec;
stupid_lock.lock();
//It makes no sense that I need thism everything was created locally in this function
eigs_gen( eigval, eigvec, H, 1, "sr");//The lowest eigenvalue
stupid_lock.unlock();
*out = eigval[0].real();//Should be real, as the matrix is hermitian
if ( eigval[0].imag()> 1e-9 )//Sanity check, if this is not real something has failed (in practice allow for slight rounding errors)
{
throw std::runtime_error("ERROR in thread "+to_string(id)+", energy "+to_string(eigval[0].real())+"+"+to_string(eigval[0].imag())+"*i + has non-zero imaginary part");//I don't even bother catching the possible error, since it should be mathematically impossible for it to trigger
}
}
int main()
{
size_t N = 20000;//The error only happens if the functions take a substantial amount of time, this takes around half a second to diagonalise on my computer
double out1=0;
double out2=0;
thread t1(get_eigenvalue,1,N,&out1);
thread t2(get_eigenvalue,2,N,&out2);
t1.join();
t2.join();
cout<<"These numbers should be the same "<<out1<<" "<<out2<<endl;
return 0;
}
Note that the matrices used in this example are hermitian, so we expect the eigenvalue printed to be a real number, the program throws an exception if this was not the case: The output of this program is exactly what you would expect
$ bin/minimum_failure.exe
These numbers should be the same -19936.6 -19936.6
This works perfectly every single time.
But the one thing which takes a long time here is running eigs_gen
, so having one thread wait for the other to finish is almost as slow as if no threads were used at all, and the threads do not write to the same data, so I see no reason the mutex should be there at all, alas, commenting out the mutex \\stupid_lock.lock();
and \\stupid_lock.unlock();
causes the program to fail in a large number of different ways, most often:
$ bin/minimum_failure.exe
[1] 540245 segmentation fault (core dumped) bin/minimum_failure.exe
Or
$ bin/minimum_failure.exe
warning: eigs_gen(): ARPACK error
warning: eigs_gen(): ARPACK error -14 in neupd()-15
in neupd()
[1] 541755 segmentation fault (core dumped) bin/minimum_failure.exe
But sometimes the threads "succeed" but the eigenvalues found are wrong (and random each time) sometimes they are real but wrong:
$ bin/minimum_failure.exe
These numbers should be the same -4.78507e+16 -3.69731e+12
And other times the hermitian matrix has complex eigenvalues (Which is mathematically impossible!):
$ bin/minimum_failure.exe
terminate called after throwing an instance of 'std::runtime_error'
what(): ERROR in thread 2, energy -16346042235.187666+52718696738.912788*i + has non-zero imaginary part
[1] 540310 IOT instruction (core dumped) bin/minimum_failure.exe
This clearly not a rounding error, this is a completely wrong result.
But why does this happen? a mutex should only be needed when we read or write to the same data from different threads, and these threads do not do that. Does Armadillo behind the scene use some kind of global scope variables? And is it impossible to use eigenvalue decomposition of matrices in multiple different threads at the same time?
Distro Arch Linux with Linux 6.1.15-1-lts
Armadillo 12.0.1-1
And openblas 0.3.21-4 which provides blas 3.9.0
GCC 12.2.1-2
I am compiling with flags: -std=c++17 -O2 -larmadillo -lpthread
My CPU has room for 20 threads (found by calling uint64_t max_threads = thread::hardware_concurrency();
in C++ or lscpu
in the terminal)
Armadillo can use two backends for the eigs_sym()
and eigs_gen()
functions. One is the standard ARPACK library. The other is an alternative C++ implementation of ARPACK called NEWARP, which is included with Armadillo.
Standard ARPACK is an old implementation in Fortran that has known problems like not being re-entrant so it can't be used from multiple threads.
The alternative NEWARP backend is enabled by a two step process. First, define a macro called ARMA_USE_NEWARP
. Second, disable ARPACK by defining a macro called ARMA_DONT_USE_ARPACK
. Both macros have to be defined before including the armadillo header file.
#define ARMA_USE_NEWARP
#define ARMA_DONT_USE_ARPACK
#include <armadillo>
You can also modify the config.hpp
file in Armadillo to do the same.