c++cperformancearmthumb

Is it redundant to check if a modulo operation is needed, then performing it?


Consider next code:

unsigned idx;
//.. some work with idx
if( idx >= idx_max )
    idx %= idx_max;

Could be simplified to only second line:

idx %= idx_max;

and will achieve the same result.


Several times I met next code:

unsigned x;
//... some work with x
if( x!=0 )
  x=0;

Could be simplified to

x=0;

The questions:


Solution

  • If you want to understand what the compiler is doing, you'll need to just pull up some assembly. I recommend this site (I already entered code from the question)): https://godbolt.org/g/FwZZOb.

    The first example is more interesting.

    int div(unsigned int num, unsigned int num2) {
        if( num >= num2 ) return num % num2;
        return num;
    }
    
    int div2(unsigned int num, unsigned int num2) {
        return num % num2;
    }
    

    Generates:

    div(unsigned int, unsigned int):          # @div(unsigned int, unsigned int)
            mov     eax, edi
            cmp     eax, esi
            jb      .LBB0_2
            xor     edx, edx
            div     esi
            mov     eax, edx
    .LBB0_2:
            ret
    
    div2(unsigned int, unsigned int):         # @div2(unsigned int, unsigned int)
            xor     edx, edx
            mov     eax, edi
            div     esi
            mov     eax, edx
            ret
    

    Basically, the compiler will not optimize away the branch, for very specific and logical reasons. If integer division was about the same cost as comparison, then the branch would be pretty pointless. But integer division (which modulus is performed together with typically) is actually very expensive: http://www.agner.org/optimize/instruction_tables.pdf. The numbers vary greatly by architecture and integer size but it typically could be a latency of anywhere from 15 to close to 100 cycles.

    By taking a branch before performing the modulus, you can actually save yourself a lot of work. Notice though: the compiler also does not transform the code without a branch into a branch at the assembly level. That's because the branch has a downside too: if the modulus ends up being necessary anyway, you just wasted a bit of time.

    There's no way to make a reasonable determination about the correct optimization without knowing the relative frequency with which idx < idx_max will be true. So the compilers (gcc and clang do the same thing) opt to map the code in a relatively transparent way, leaving this choice in the hands of the developer.

    So that branch might have been a very reasonable choice.

    The second branch should be completely pointless, because comparison and assignment are of comparable cost. That said, you can see in the link that compilers will still not perform this optimization if they have a reference to the variable. If the value is a local variable (as in your demonstrated code) then the compiler will optimize the branch away.

    In sum the first piece of code is perhaps a reasonable optimization, the second, probably just a tired programmer.