linuxfloating-pointmpichrobustnessscalapack

netlib scalapack example1.f - execution problems


I'm trying to compile, link and run netlib's scalapack example1.f program. code for example1.f

The code compiles and link OK, but when executed it shows instabilities in the results. Sometimes the residual is very low. Other times it is around 10E+13.

I also see that I get a lot of IEEE_DENORMAL and IEEE_UNDERFLOW flags when executing example1 and some other scalapack testing programs (shipped with scalapack source code and built along with the library libscalapack.a).

I had to edit this line of the example1.f:

DATA               NPROW / 2 / , NPCOL / 3 /

to

DATA               NPROW / 2 / , NPCOL / 2 /

because I have only 4 processing cores.

I have MPICH version 4.0.2 - built from source scaLapack version 2.2.0 LAPACK version 3.10.1 BLAS - reference implementation of LAPACK 3.10.1

The libraries were built and tested OK.

Compile-link command: mpif77 example1.f -o example1 libmpi.a libscalapack.a liblapack.a librefblas.a

Run/execution command:

mpiexec -n 4 ./example1

or mpirun -n 4 ./example1

Sometimes the result is:

According to the normalized residual the solution is correct.
||A*x - b|| / ( ||x||*||A||*eps*N ) =   1.47973536-253

But other times I get an incorrect one: According to the normalized residual the solution is incorrect. ||Ax - b|| / ( ||x||||A||epsN ) = 1.87065413E+13

The output of example1 is very erratic.

The program example1 uses the function PDGESV to get the results. I searchd the use of that function in scalapack testing functions in TESTING folder and found that the program xdsvd uses it. I tested xdsvd and found that the function passes the tests AND that the results are very robust, i.e., the numbers displayed in the output are always the same.

Thanks for any tips pointing the way to solve this problem.


Solution

  • I was able to come around my difficulties.

    The real problem is that it was not necessary to change the process grid in the code of example1.f.

    I thought the number of processes was limited to the actual number of cores in the physical processor. No, that's not the case. So, when I built and executed the original code of example1.f with:

    mpif77 example1.f -o example1 libscalapack.a liblapack.a librefblas.a
    

    and then mpirun -n 6 ./example1

    I got a correct answer and a zero INFO code, indicating no problems in the scalapack routine PDGESV.

    INFO code returned by PDGESV =   0
    

    It's worth mentioning that the IEEE floating point flags disappeared.

    I also noted that I need not to provide the libmpi.a in the build command.

    This is not a complete answer, because I still don't know the changing of the process grid messed up the whole thing. I experimented with changing the dimensions of the block submatrices. Tried NB=3 and MB=3 with process grid 1x3 and other combinations with no success.

    That's what I'm going to investigate now. Maybe I'll have to ansk another question on how to check the validity of the scalapack computation parameters.