rustiteratorcharbyteelement

bytes().nth() vs chars().nth() in rust


I recently found out that bytes.nth(i) is waay faster than chars.nth(i) if you want to iterate over a String.

pub fn test(word1: String) {
    for i in (0..word1.len()).rev() {
        word1.bytes().nth(i);
    }
}
// runs in +- 820 microseconds (word1.len() = 42551)

vs

pub fn test(word1: String) {
    for i in (0..word1.len()).rev() {
        word1.chars().nth(i);
    }
}
// runs in +- 17.5 seconds (word1.len() = 42551)

I found the explanation te be because bytes.nth() has time complexity O(1) and chars.nth() has time complexity O(n). But why is this the case?

I also saw a suggestion to instantiate a variable and use the chars.next() / .next_back() like in the example below.

let mut word1_chars = word1.chars();

for i in (0..word1.len()).rev() {
    word1_chars.next_back()
}
// runs in +- 600 microseconds (word1.len() = 42551)

I was shocked when I saw this was even faster. I presume there is a logical explanation for all of this so if you know it please share it :).

Thanks in advance

Edit: The times were recorded when building in debug mode. When building them for release there is no difference between bytes().nth() and chars.next() (While chars().nth() is more than 10 000 times slower). Why is there such a "big" difference between bytes().nth() and chars.next() in debug mode but not in release?


Solution

  • .chars() enumerates the chars of the string, not simply the bytes.
    Coming from C for example, there is an historical confusion between char (with text in mind) and byte (a small integer); in quick-and-dirty disposable programs we don't bother making the distinction (although we should).

    On the other hand, in Rust a char is not a single byte, but a wider integer which could be the result of decoding a UTF-8 sequence of bytes.
    Such a single char could come from a sequence of one, two, three or four bytes.
    Rust strings are UTF-8 encoded, then accessing the N^th char cannot be performed in constant time because each of the preceding chars could be made of various bytes.
    Each one has to be decoded in order, and we stop when N chars have been decoded.

    Concerning your second question about .next() being «faster», I guess you compare something like

    for i in 0..N {
        let c = word.chars().nth(i);
        // ...
    }
    

    against something like

    for c in word.chars() {
        // ...
    }
    

    which is similar to

    let mut it = word.chars();
    while let Some(c) = it.next() {
        // ...
    }
    

    In the first case, each iteration implies decoding all the chars from the beginning of the string, until i.
    The two other cases simply decode, at each iteration, only the next char, starting from the previous byte-offset (kept from the previous iteration inside the iterator).

    Note that .next_back() is also efficient because a UTF-8 sequence of bytes can be decoded backwards.