utf-8rust

How to decode a single UTF-8 character and step onto the next using only the Rust standard library?


Does Rust provide a way to decode a single character (unicode-scalar-value to be exact) from a &[u8], which may be multiple bytes, returning a single USV?

Something like GLib's g_utf8_get_char & g_utf8_next_char:

// Example of what glib's functions might look like once ported to Rust.
let i = 0;
while i < slice.len() {
    let unicode_char = g_utf8_get_char(&slice[i..]);

    // do something with the unicode character
    funcion(unicode_char);

    // move onto the next.
    i += g_utf8_next_char(&slice[i..]);
}

Short of porting parts of the GLib API to Rust, does Rust provide a way to do this, besides some trial & error calls to from_utf8 which stop once the second character is reached?

See GLib's code.


Solution

  • Since rust 1.79 &[u8] now has a method utf8_chunks. This returns an object of type Utf8Chunks.

    This can be used to get the functionality you want, even if the byte slice contains invalid utf-0 though it's not the perfect API. The simplest way to use it looks like.

    let b : &[u8] = b"1\xD02";
    for chunk in b.utf8_chunks() {
        for c in chunk.valid().chars() {
            println!("{} valid",c);
        }
        for b in chunk.invalid() {
            println!("{} invalid",b);
        }
    }
    

    If you want something more directly equivilent to your glib example that gives you a value and a length you can use.

    let mut i = 0;
    while i < b.len() {
       let sliceend = min(i+4,b.len());
       let remain = &b[i..sliceend]; 
       let chunk = remain.utf8_chunks().next().unwrap().valid();
       if let Some(c) = chunk.chars().next() {
           println!("{} valid",c);
           i += chunk.len();
       } else {
           println!("{} invalid",remain[0]);
           i += 1;
       }
    }
    

    Unfortunately the utf8chunks API may be rather inefficient in some cases, because it validates a whole "chunk" of valid utf-8 at once, even if you only needed to validate a few characters.