c++performance assembly micro-optimization branch-prediction

What is faster in C++: mod (%) or another counter?

At the risk of this being a duplicate, maybe I just can't find a similar post right now:

I am writing in C++ (C++20 to be specific). I have a loop with a counter that counts up every turn. Let's call it counter. And if this counter reaches a page-limit (let's call it page_limit), the program should continue on the next page. So it looks something like this:

const size_t page_limit = 4942;
size_t counter = 0;
while (counter < foo) {
    if (counter % page_limit == 0) {
        // start new page
    }
    
    // some other code
    counter += 1;
}

Now I am wondering since the counter goes pretty high: would the program run faster, if I wouldn't have the program calculate the modulo counter % page_limit every time, but instead make another counter? It could look something like this:

const size_t page_limit = 4942;
size_t counter = 0;
size_t page_counter = 4942;
while (counter < foo) {
    if (page_counter == page_limit) {
        // start new page
        page_counter = 0;
    }

    // some other code
    counter += 1;
    page_counter += 1;
}

Solution

(I assume you meant to write if(x%y==0) not if(x%y), to be equivalent to the counter.)

I don't think compilers will do this optimization for you, so it could be worth it. It's going to be smaller code-size, even if you can't measure a speed difference. The x % y == 0 way still branches (so is still subject to a branch misprediction those rare times when it's true). Its only advantage is that it doesn't need a separate counter variable, just some temporary registers at one point in the loop. But it does need the divisor every iteration.

Overall this should be better for code size, and isn't less readable if you're used to the idiom. (Especially if you use if(--page_count == 0) { page_count=page_limit; ... so all pieces of the logic are in two adjacent lines.)

If your page_limit were not a compile-time constant, this is even more likely to help. dec/jz that's only taken once per many decrements is a lot cheaper than div/test edx,edx/jz, including for front-end throughput. (div is micro-coded on Intel CPUs as about 10 uops, so even though it's one instruction it still takes up the front-end for multiple cycles, taking away throughput resources from getting surrounding code into the out-of-order back-end).

(With a constant divisor, it's still multiply, right shift, sub to get the quotient, then multiply and subtract to get the remainder from that. So still several single-uop instructions. Although there are some tricks for divisibility testing by small constants see @Cassio Neri's answer on Fast divisibility tests (by 2,3,4,5,.., 16)? which cites his journal articles; recent GCC may have started using these.)

But if your loop body doesn't bottleneck on front-end instruction/uop throughput (on x86), or the divider execution unit, then out-of-order exec can probably hide most of the cost of even a div instruction. It's not on the critical path so it could be mostly free if its latency happens in parallel with other computation, and there are spare throughput resources. (Branch prediction + speculative execution allow execution to continue without waiting for the branch condition to be known, and since this work is independent of other work it can "run ahead" as the compiler can see into future iterations.)

Still, making that work even cheaper can help the CPU see and handle a branch mispredict sooner. But modern CPUs with fast recovery can keep working on old instructions from before the branch while recovering. (What exactly happens when a skylake CPU mispredicts a branch? / Avoid stalling pipeline by calculating conditional early )

And of course a few loops do fully keep the CPU's throughput resources busy, not bottlenecking on cache misses or a latency chain. And fewer uops executed per iteration is more friendly to the other hyperthread (or SMT in general).

Or if you care about your code running on in-order CPUs (common for ARM and other non-x86 ISAs that target low-power implementations), the real work has to wait for the branch-condition logic. (Only hardware prefetch or cache-miss loads and things like that can be doing useful work while running extra code to test the branch condition.)

The divide work does take up more space in the ROB (ReOrder Buffer) so reduces how far ahead the CPU can see for out-of-order execution to find independent work between code before / after it.

Use a down-counter

Instead of counting up, you'd actually want to hand-hold the compiler into using a down-counter that can compile to dec reg / jz .new_page or similar; all normal ISAs can do that quite cheaply because it's the same kind of thing you'd find at the bottom of normal loops. (dec/jnz to keep looping while non-zero)

    if(--page_counter == 0) {
        /*new page*/;
        page_counter = page_limit;
    }

A down-counter is more efficient in asm and equally readable in C (compared to an up-counter), so if you're micro-optimizing you should write it that way. Related: using that technique in hand-written asm FizzBuzz. Maybe also a code review of asm sum of multiples of 3 and 5, but it does nothing for no-match so optimizing it is different.

Notice that page_limit is only accessed inside the if body, so if the compiler is low on registers it can easily spill that and only read it as needed, not tying up a register with it or with multiplier constants.

Or if it's a known constant, just a move-immediate instruction. (Most ISAs also have compare-immediate, but not all. e.g. MIPS and RISC-V only have compare-and-branch instructions that use the space in the instruction word for a target address, not for an immediate.) Many RISC ISAs have special support for efficiently setting a register to a wider constant than most instructions that take an immediate (like ARM movw with a 16-bit immediate, so 4092 can be done in one instruction for mov but not cmp: it doesn't fit in 12 bits rotated by an even count).

Compared to dividing (or multiplicative inverse), most RISC ISAs don't have multiply-immediate, and a multiplicative inverse is usually wider than one immediate can hold. (x86 does have multiply-immediate, but not for the form that gives you a high-half.) Divide-immediate is even rarer, not even x86 has that at all, but no compiler would use that unless optimizing for space instead of speed if it did exist.

CISC ISAs like x86 can typically multiply or divide with a memory source operand, so if low on registers the compiler could keep the divisor in memory (especially if it's a runtime variable). Loading once per iteration (hitting in cache) is not expensive. But spilling and reloading an actual variable that changes inside the loop (like page_count) could introduce a store/reload latency bottleneck if the loop is short enough and there aren't enough registers. (Although that might not be plausible: if your loop body is big enough to need all the registers, it probably has enough latency to hide a store/reload.)