Unpredictable changes in calculation accuracy within the OpenCL kernel

Good afternoon, everyone!

I have the following OpenCL kernel code:

double calcA(double2 point1, double2 point2);

void __kernel mainProgram()
{
    long long index = get_global_id(0);

    if (index == 0)
    {
        double2 p1   = (double2)(0.891949793450334649, 0.513485940430910115);
        double2 p2   = (double2)(0.891949846176460226, 0.513485940430959964);

        double dist = calcA(p1, p2);

        printf("NV A        = %.18e\n", dist);
    }
}


double calcA(double2 p1, double2 p2)
{
    double cos1 = cos(p1[0]);
    double cos2 = cos(p2[0]);
    double sin1 = sin(p1[0]);
    double sin2 = sin(p2[0]);
    double cosDelta = cos(p2[1] - p1[1]);

    double A_h1 = cos1 * sin2;
    double A_h2 = sin1 * cos2 * cosDelta;
    double A = A_h1 - A_h2;

    // printf("NV A_h1     = %.18e\n", A_h1);

    return A;
}

This is part of a much larger code that just demonstrates the underlying problem. In this code, as you can see, inside the central kernel function, a local function is called that calculates some value of A (in three steps) and prints it on the screen. In the host program, the same calculations are performed in parallel and printed on the screen to compare the values obtained. The problem is that the value of A calculated on the GPU has an accuracy of up to 9 decimal places:

CPU A = 5.27261255 7671862987e-08
NV  A = 5.27261255 9188061227e-08

However, if one of the intermediate variables (A_h1 or A_h2) is printed in the calcD function (for example, by uncomenting a line), then the GPU starts to output the absolute correct value of A:

CPU A = 5.272612557671862987e-08
NV  A = 5.272612557671862987e-08

I note that I use the <CL/opengl.hpp> library directly, performing calculations on RTX 4080 (CUDA 12.4, drivers 551.78). I also note that I have seen tips like using the compilation flag --fmad=false, however, there is no such flag in the OpenCL specification.

Thank you so much for your help!

amendment 1. @ProjectPhysX suggested the -cl-opt-disable flag. It had an effect on the test example shown here, but not on the main program, unfortunately. In the main program, the inaccurate values just changed a little, but did not become more accurate.

Solution

When printing the intermediate values, the GPU performs the math as you coded it, and you get the exact end result. When not printing intermediate values, the multiplication and subtraction here are contracted by the compiler to one fma/mad which may use reduced precision:

double A = fma(cos1, sin2, -sin1 * cos2 * cosDelta);

You can disable this behavior with the -cl-opt-disable compiler flag, passed to clBuildProgram.

Note that double-precision (FP64) performance on Nvidia Ada GPUs is abysmal and most likely even below the FP64 performance of your CPU. I'd recommend GPU use only if you can get away with single-precision (FP32).