c++rustoptimizationbenchmarkingedit-distance

Why is Rust port of function 2x slower than C++?


I have a function to compute String edit distance in C++:

#include <string>
#include <vector>
#include <algorithm>

size_t sed_diff(const std::string & a, const std::string & b) {

    std::vector<size_t> cp(b.length() + 1);
    std::vector<size_t> cc(b.length() + 1);

    // Edit distance on postorder traversal.
    for (int j = 0; j <= b.length(); ++j) {
        cp[j] = j;
    }
    for (int i = 1; i <= a.length(); ++i) {
        cc[0] = i;
        for (int j = 1; j <= b.length(); ++j) {
            unsigned int c1 = a[i - 1];
            unsigned int c2 = b[j - 1];
            cc[j] = std::min({
                                     cp[j - 1] + ((c1 == c2) ? 0 : 1),
                                     cc[j - 1] + 1,
                                     cp[j] + 1
                             });
        }
        std::swap(cp, cc);
    }
    return cp[b.length()];
}

int main() {
    // call
    sed_diff(std::string("banana"), std::string("mango"));
    return 0;
}

I attempted to rewrite the same in Rust code:

fn sed(s1: &[u8], s2: &[u8]) -> u32 {
    let mut cp = vec![0u32; s2.len() + 1];
    let mut cc = vec![0u32; s2.len() + 1];

    for i in 0..=s2.len() {
        cp[i] = i as u32;
    }

    for i in 1..=s1.len() {
        cc[0] = i as u32;
        for j in 1..=s2.len() {
            let c1 = s1[i - 1];
            let c2 = s2[j - 1];

            cc[j] = std::cmp::min(
                std::cmp::min(cp[j - 1] + if c1 == c2 { 0 } else { 1 }, cc[j - 1] + 1),
                cp[j] + 1,
            )
        }
        std::mem::swap(&mut cp, &mut cc);
    }
    cp[s2.len()]
}

fn main() {
    sed(
        String::from("banana").as_bytes(),
        String::from("mango").as_bytes(),
    );
}

When benchmarking the code on the same dataset, the C++ code averages around 15 seconds, while the Rust compiled code averages around 30 seconds total runtime.

Both tests are run in a nested for loops, using 20 000 input strings and 50 test strings.

But I can't understand, why the Rust version is so much slower?

EDIT:

CPU is AMD's Ryzen 5 4500U. I compiled C++ code using g++ (v13.2.1) + CMake in Release mode, so AFAIK only -O3 flag was used. Rustc version is v1.75.0.

As for tested inputs as I mentioned in the comments, the strings are generated DNA sequences with lenghts of 95 up to 105 chars long. E.g.:

ATGATGGTCAGGTAGTGGAGAGGCTTACATATGGCCACATATCGGTCAAAAGCCATGGCTATGAGCAGCACCATC
TCAGTTCCCCCAACTGCATGGATAA

I used 20 000 input strings with 50 query strings with nested for loops (query string in outer loop).


Solution

  • Rustc fails to keep cc[j] = min(...) in a register to become next iteration's cc[j - 1], instead it actually stores + reloads, creating a longer latency bottleneck. This is on top of the branchless 2x cmp/cmov and add (+1) that's probably about 5 cycles to get min of 3 inputs. Store-forwarding latency is usually about 5 cycles on typical modern x86, so that explains the 2x speed difference.

    https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ / https://agner.org/optimize/. On some CPUs like Zen 2 and Zen 4, and Ice Lake, zero-latency store forwarding is possible, but that usually requires the same addressing mode, not just the same final address. (https://www.agner.org/forum/viewtopic.php?t=41)

    Or maybe there's a throughput bottleneck as well for the GCC version. GCC does min(cp[j] + 1, cp[j - 1] + (c1 != c2)) first because that's independent of last iteration's min result that's now in cc[j - 1], so the actual critical path latency is only add/cmp/cmov, 3 cycles on Broadwell and later, or AMD.

    Without compiler versions/options and CPU model, and details on your microbenchmark to see if the function inlined and compiled differently, the precise explanation for the exact speed difference will have to wait, but the bottom line is a missed optimization by the Rust compiler.

    You could probably help it out by using a local variable instead of re-reading an array element with the bounds-checking that entails in Rust.


    The inner loop has a dependency on the previous iteration via cc[j-1] + 1, so compilers can only vectorize the cp[j] = j; init loop.

    It's this loop-carried dependency that appears to be a throughput bottleneck in the Rust -C opt-level=3 code-gen, because Rustc / LLVM made asm that reloads it from memory instead of keeping the result of last iteration's min in a register.

    Perhaps it has a harder time optimizing than in C++ because of the bounds checks (there are some extra cmp/jcc conditional branches in the inner loop, but I don't think that would account for a factor of 2 in throughput.) But LittleBoxOfSunshine's answer tested with get_unchecked and found negligible speedups, so probably just LLVM misses this.

    GCC code-gen for the inner loop looks like this (after changing the type to unsigned to match your use of u32 in Rust), Godbolt. I know you actually benchmarked a version that used 64-bit size_t elements for the vectors, assuming you compiled for x86-64, but that's not significant. For your small strings, these easily fit in L1d cache.

    # GCC13 -O3 -march=x86-64-v3 -Wall   but this loop didn't use any new instructions
    .L15:
            mov     eax, DWORD PTR [rbx+rdx*4]        # EAX = cp[j]
            xor     esi, esi
            cmp     BYTE PTR [r9-1+rdx], dil
            setne   sil                               # ESI = (c1 != c2)
            add     esi, DWORD PTR [rbx-4+rdx*4]      # ESI += cp[j-1]
            add     eax, 1
            cmp     esi, eax
            cmovbe  eax, esi                          # eax = (esi<eax) ? ESI : EAX
            add     ecx, 1
            cmp     eax, ecx
            cmovbe  ecx, eax
            mov     rax, rdx
            mov     DWORD PTR [r14+rdx*4], ecx     # store the result
            add     rdx, 1
            cmp     r15, rax
            jne     .L15
    

    Note that there are only two DWORD loads plus the BYTE load (of the char from the string). And none of the loads are using r14 in the addressing mode like the store is. So we can see there are no loads from cc[], just a store, so cc[j - 1] + 1 must be computed from last iteration's store.

    Another possible optimization would have been to keep cp[j] around for use as cp[j-1] next iteration, but looking at the addressing modes is enough to figure out that it didn't do that.

    Clang is weird, with what looks like two versions of the loop, one with setcc/cmovcc and another with branches (starting at .LBB0_38: I think).

    But Rustc (Godbolt) has one inner loop:

    .LBB0_20:
            cmp     rdi, r15
            adc     r10, 0                 # r10 += (rdi<r15) unsigned
            lea     r11, [rdi - 1]
            cmp     r11, r15
            jae     .LBB0_21               # bounds check?
            cmp     rdi, rbx
            jae     .LBB0_27               # these are based on 64-bit unsigned compares, so they're not u32 or char data
            xor     r11d, r11d
            cmp     r9b, byte ptr [rbp + rdi - 1]
            setne   r11b                           # r11 = (c1 != c2)
            add     r11d, dword ptr [r13 + 4*rdi - 4]
            mov     r12d, dword ptr [r14 + 4*rdi - 4]   # load from [r13 + scaled-idx]
            inc     r12d
            cmp     r11d, r12d
            cmovb   r12d, r11d
            mov     r11d, dword ptr [r13 + 4*rdi]
            inc     r11d
            cmp     r12d, r11d
            cmovb   r11d, r12d
            mov     dword ptr [r14 + 4*rdi], r11d      # store to [r14 + scaled-idx]
            cmp     rdi, r15
            jae     .LBB0_31
            mov     rdi, r10
            cmp     r10, r15
            jbe     .LBB0_20
    

    So the main work is still done branchlessly like GCC.

    But unlike GCC, it's actually storing/reloading cc[j - 1], so store-forwarding latency becomes part of the loop-carried dependency chain critical path which is the bottleneck.


    Optimized Rust version

    Seeing these problems in the asm, we can change the source to make it easier for the compiler to avoid them. Use a local mut lastccj variable to hold cc[j] across iterations, and use lastccj only in the outer min, not as an input to the inner one, to keep the critical path latency down to inc / cmp/cmov, with the first cmp/cmov` being independent.

    Godbolt

    // Improved version of the loop.
        for i in 1..=s1.len() {
            cc[0] = i as u32;
            let c1 = s1[i - 1];
            let mut lastccj = cc[0];
            for j in 1..=s2.len() {
                let c2 = s2[j - 1];
                lastccj = std::cmp::min(
                    std::cmp::min(cp[j - 1] + (c1 != c2) as u32, cp[j] + 1),
                    lastccj + 1   // only needed after the first min result is ready
                );
                cc[j] = lastccj;
            }
            std::mem::swap(&mut cp, &mut cc);
        }
    

    This is obviously somewhat less readable and harder to follow, so it would be nice if compilers did a better job. There might also be something to gain from cp[j-1] vs. cp[j], using a temporary variable for that, but that's just a throughput issue so we'd have to look if that saves instructions. Modern x86 can do 2 or 3 loads per clock cycle so the back-end execution units aren't the limit, just getting all the instructions through the front-end and issued into the back-end. Especially with the bounds-check overhead, this loop is far from tiny.

    Perhaps some SIMD to prepare a buffer of std::cmp::min(cp[j - 1] + (c1 != c2) as u32, cp[j] + 1) results would help. That could maybe even auto-vectorize. Could be very good with AVX-512 merge-masking, but could still go 8 elements at a time with AVX2 with a few instructions.

    But the final loop is going to have a 3 cycle loop-carried dependency and the bounds-check overhead, so there's a limit to how much speedup we can get.

    https://uica.uops.info/ predicts that the inner loop from this optimized version bottlenecks at one iteration per 5.36 cycles on Skylake, one per 4.7 cycles on Ice Lake / Rocket Lake. (Or 4.2 cycles with perfect scheduling). Those are both worse than the 3-cycle latency bottleneck. So the bounds-check branching becomes part of the throughput bottleneck after fixing the latency bottleneck. CPUs with 6-wide pipelines like Alder Lake P-cores or Zen 3 or 4 could run it even faster, in theory one per 3.5 cycles.

    Would it work to manually vectorize with SIMD prefix-scan techniques for the chain of min operations? Maybe not, because I'm not sure it's separable into a prefix-min of cc[j]+1; the min with elements from cp[j] can affect an older cc[j] in ways that we can't correct for later. So the mix of +1 and min makes it not associative, I think.