I have a function to compute String edit distance in C++:
#include <string>
#include <vector>
#include <algorithm>
size_t sed_diff(const std::string & a, const std::string & b) {
std::vector<size_t> cp(b.length() + 1);
std::vector<size_t> cc(b.length() + 1);
// Edit distance on postorder traversal.
for (int j = 0; j <= b.length(); ++j) {
cp[j] = j;
}
for (int i = 1; i <= a.length(); ++i) {
cc[0] = i;
for (int j = 1; j <= b.length(); ++j) {
unsigned int c1 = a[i - 1];
unsigned int c2 = b[j - 1];
cc[j] = std::min({
cp[j - 1] + ((c1 == c2) ? 0 : 1),
cc[j - 1] + 1,
cp[j] + 1
});
}
std::swap(cp, cc);
}
return cp[b.length()];
}
int main() {
// call
sed_diff(std::string("banana"), std::string("mango"));
return 0;
}
I attempted to rewrite the same in Rust code:
fn sed(s1: &[u8], s2: &[u8]) -> u32 {
let mut cp = vec![0u32; s2.len() + 1];
let mut cc = vec![0u32; s2.len() + 1];
for i in 0..=s2.len() {
cp[i] = i as u32;
}
for i in 1..=s1.len() {
cc[0] = i as u32;
for j in 1..=s2.len() {
let c1 = s1[i - 1];
let c2 = s2[j - 1];
cc[j] = std::cmp::min(
std::cmp::min(cp[j - 1] + if c1 == c2 { 0 } else { 1 }, cc[j - 1] + 1),
cp[j] + 1,
)
}
std::mem::swap(&mut cp, &mut cc);
}
cp[s2.len()]
}
fn main() {
sed(
String::from("banana").as_bytes(),
String::from("mango").as_bytes(),
);
}
When benchmarking the code on the same dataset, the C++ code averages around 15 seconds, while the Rust compiled code averages around 30 seconds total runtime.
Both tests are run in a nested for loops, using 20 000 input strings and 50 test strings.
But I can't understand, why the Rust version is so much slower?
EDIT:
CPU is AMD's Ryzen 5 4500U.
I compiled C++ code using g++ (v13.2.1) + CMake in Release mode, so AFAIK only -O3
flag was used.
Rustc version is v1.75.0.
As for tested inputs as I mentioned in the comments, the strings are generated DNA sequences with lenghts of 95 up to 105 chars long. E.g.:
ATGATGGTCAGGTAGTGGAGAGGCTTACATATGGCCACATATCGGTCAAAAGCCATGGCTATGAGCAGCACCATC
TCAGTTCCCCCAACTGCATGGATAA
I used 20 000 input strings with 50 query strings with nested for loops (query string in outer loop).
Rustc fails to keep cc[j] = min(...)
in a register to become next iteration's cc[j - 1]
, instead it actually stores + reloads, creating a longer latency bottleneck. This is on top of the branchless 2x cmp/cmov
and add
(+1) that's probably about 5 cycles to get min
of 3 inputs. Store-forwarding latency is usually about 5 cycles on typical modern x86, so that explains the 2x speed difference.
https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ / https://agner.org/optimize/. On some CPUs like Zen 2 and Zen 4, and Ice Lake, zero-latency store forwarding is possible, but that usually requires the same addressing mode, not just the same final address. (https://www.agner.org/forum/viewtopic.php?t=41)
Or maybe there's a throughput bottleneck as well for the GCC version. GCC does min(cp[j] + 1, cp[j - 1] + (c1 != c2))
first because that's independent of last iteration's min
result that's now in cc[j - 1]
, so the actual critical path latency is only add
/cmp
/cmov
, 3 cycles on Broadwell and later, or AMD.
Without compiler versions/options and CPU model, and details on your microbenchmark to see if the function inlined and compiled differently, the precise explanation for the exact speed difference will have to wait, but the bottom line is a missed optimization by the Rust compiler.
You could probably help it out by using a local variable instead of re-reading an array element with the bounds-checking that entails in Rust.
The inner loop has a dependency on the previous iteration via cc[j-1] + 1
, so compilers can only vectorize the cp[j] = j;
init loop.
It's this loop-carried dependency that appears to be a throughput bottleneck in the Rust -C opt-level=3
code-gen, because Rustc / LLVM made asm that reloads it from memory instead of keeping the result of last iteration's min
in a register.
Perhaps it has a harder time optimizing than in C++ because of the bounds checks (there are some extra cmp/jcc
conditional branches in the inner loop, but I don't think that would account for a factor of 2 in throughput.) But LittleBoxOfSunshine's answer tested with get_unchecked
and found negligible speedups, so probably just LLVM misses this.
GCC code-gen for the inner loop looks like this (after changing the type to unsigned
to match your use of u32
in Rust), Godbolt. I know you actually benchmarked a version that used 64-bit size_t
elements for the vectors, assuming you compiled for x86-64, but that's not significant. For your small strings, these easily fit in L1d cache.
# GCC13 -O3 -march=x86-64-v3 -Wall but this loop didn't use any new instructions
.L15:
mov eax, DWORD PTR [rbx+rdx*4] # EAX = cp[j]
xor esi, esi
cmp BYTE PTR [r9-1+rdx], dil
setne sil # ESI = (c1 != c2)
add esi, DWORD PTR [rbx-4+rdx*4] # ESI += cp[j-1]
add eax, 1
cmp esi, eax
cmovbe eax, esi # eax = (esi<eax) ? ESI : EAX
add ecx, 1
cmp eax, ecx
cmovbe ecx, eax
mov rax, rdx
mov DWORD PTR [r14+rdx*4], ecx # store the result
add rdx, 1
cmp r15, rax
jne .L15
Note that there are only two DWORD loads plus the BYTE load (of the char from the string). And none of the loads are using r14
in the addressing mode like the store is. So we can see there are no loads from cc[]
, just a store, so cc[j - 1] + 1
must be computed from last iteration's store.
Another possible optimization would have been to keep cp[j]
around for use as cp[j-1]
next iteration, but looking at the addressing modes is enough to figure out that it didn't do that.
Clang is weird, with what looks like two versions of the loop, one with setcc/cmovcc and another with branches (starting at .LBB0_38:
I think).
But Rustc (Godbolt) has one inner loop:
.LBB0_20:
cmp rdi, r15
adc r10, 0 # r10 += (rdi<r15) unsigned
lea r11, [rdi - 1]
cmp r11, r15
jae .LBB0_21 # bounds check?
cmp rdi, rbx
jae .LBB0_27 # these are based on 64-bit unsigned compares, so they're not u32 or char data
xor r11d, r11d
cmp r9b, byte ptr [rbp + rdi - 1]
setne r11b # r11 = (c1 != c2)
add r11d, dword ptr [r13 + 4*rdi - 4]
mov r12d, dword ptr [r14 + 4*rdi - 4] # load from [r13 + scaled-idx]
inc r12d
cmp r11d, r12d
cmovb r12d, r11d
mov r11d, dword ptr [r13 + 4*rdi]
inc r11d
cmp r12d, r11d
cmovb r11d, r12d
mov dword ptr [r14 + 4*rdi], r11d # store to [r14 + scaled-idx]
cmp rdi, r15
jae .LBB0_31
mov rdi, r10
cmp r10, r15
jbe .LBB0_20
So the main work is still done branchlessly like GCC.
But unlike GCC, it's actually storing/reloading cc[j - 1]
, so store-forwarding latency becomes part of the loop-carried dependency chain critical path which is the bottleneck.
Seeing these problems in the asm, we can change the source to make it easier for the compiler to avoid them. Use a local mut lastccj
variable to hold cc[j]
across iterations, and use lastccj
only in the outer min
, not as an input to the inner one, to keep the critical path latency down to inc
/ cmp
/cmov, with the first
cmp/cmov` being independent.
// Improved version of the loop.
for i in 1..=s1.len() {
cc[0] = i as u32;
let c1 = s1[i - 1];
let mut lastccj = cc[0];
for j in 1..=s2.len() {
let c2 = s2[j - 1];
lastccj = std::cmp::min(
std::cmp::min(cp[j - 1] + (c1 != c2) as u32, cp[j] + 1),
lastccj + 1 // only needed after the first min result is ready
);
cc[j] = lastccj;
}
std::mem::swap(&mut cp, &mut cc);
}
This is obviously somewhat less readable and harder to follow, so it would be nice if compilers did a better job. There might also be something to gain from cp[j-1]
vs. cp[j]
, using a temporary variable for that, but that's just a throughput issue so we'd have to look if that saves instructions. Modern x86 can do 2 or 3 loads per clock cycle so the back-end execution units aren't the limit, just getting all the instructions through the front-end and issued into the back-end. Especially with the bounds-check overhead, this loop is far from tiny.
Perhaps some SIMD to prepare a buffer of std::cmp::min(cp[j - 1] + (c1 != c2) as u32, cp[j] + 1)
results would help. That could maybe even auto-vectorize. Could be very good with AVX-512 merge-masking, but could still go 8 elements at a time with AVX2 with a few instructions.
But the final loop is going to have a 3 cycle loop-carried dependency and the bounds-check overhead, so there's a limit to how much speedup we can get.
https://uica.uops.info/ predicts that the inner loop from this optimized version bottlenecks at one iteration per 5.36 cycles on Skylake, one per 4.7 cycles on Ice Lake / Rocket Lake. (Or 4.2 cycles with perfect scheduling). Those are both worse than the 3-cycle latency bottleneck. So the bounds-check branching becomes part of the throughput bottleneck after fixing the latency bottleneck. CPUs with 6-wide pipelines like Alder Lake P-cores or Zen 3 or 4 could run it even faster, in theory one per 3.5 cycles.
Would it work to manually vectorize with SIMD prefix-scan techniques for the chain of min
operations? Maybe not, because I'm not sure it's separable into a prefix-min of cc[j]+1
; the min with elements from cp[j]
can affect an older cc[j]
in ways that we can't correct for later. So the mix of +1
and min
makes it not associative, I think.