floating-pointx86mipsnumerical-computingflops

What's the relative speed of floating point add vs. floating point multiply


A decade or two ago, it was worthwhile to write numerical code to avoid using multiplies and divides and use addition and subtraction instead. A good example is using forward differences to evaluate a polynomial curve instead of computing the polynomial directly.

Is this still the case, or have modern computer architectures advanced to the point where *,/ are no longer many times slower than +,- ?

To be specific, I'm interested in compiled C/C++ code running on modern typical x86 chips with extensive on-board floating point hardware, not a small micro trying to do FP in software. I realize pipelining and other architectural enhancements preclude specific cycle counts, but I'd still like to get a useful intuition.


Solution

  • It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

    For example, take this loop:

    for(int j=0;j<NUMITER;j++) {
      for(int i=1;i<NUMEL;i++) {
        bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
      }
    }
    

    for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

    bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;
    

    It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

    bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;
    

    -- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

    bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/
    

    again takes 1.7 seconds.

    bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/
    

    (same as initial loop, but without expensive constant addition: 2.1 seconds)

    bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/
    

    (mostly muls, but one addition:1.9 seconds)

    So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

    Some more cases:

    bla *= someval; // someval very near 1.0; takes 2.1 seconds
    bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
    bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
    bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
    bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
    bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
    bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
    bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86