rustencoding

How to char_indices a string in encoding different from utf-8 in rust efficiently?


I'm interested in knowing the byte position of each character in a string in a potentially different encoding from utf-8. For example, I'm looking for something like string.char_indices("cp936") and get the byte position of each character in code point 936. A candidate solution is to iterate through each character in string, turn the character into a string, find the number of bytes it takes to encode the char string, yield the byte position, and accumulate the number of bytes to current byte position. Pseudocode (may not compile) below:

// Using https://github.com/lifthrasiir/rust-encoding.git
use encoding::{Encoding, EncoderTrap};
use encoding::label::encoding_from_whatwg_label;

// Usage: char_indices("Acme\u{a9}", "ISO-8859-2")
fn char_indices(s: &str, encoding_label: &str) -> Vec<(usize, char)> {
    let mut pos = 0;
    let mut result = Vec::new();
    let encoder = encoding_from_whatwg_label(encoding_label).unwrap();
    for ch in s.chars() {
        let ch_str = ch.to_string();
        let bytes = encoder.encode(&ch_str, EncoderTrap::Strict);
        result.push((pos, ch));
        pos += bytes.len();
    }

    result
}

But it looks very slow, and I need something more efficient. (I didn't actually benchmark whether it's that slow; maybe I'm wrong.) Is it achievable in rust? I'd prefer solution where manually encoding of a string without using a well-known crate is not required.


Solution

  • let ch_str = ch.to_string();
    

    This allocation is not necessary: if you use char_indices() it will give you the start index of the codepoint in the source string, and char::len_utf8 tells you how many bytes that codepoint requires, you can use the combination to create a string slice for the codepoint, which you can then encode.

    rust-encoding is abandoned, but using encoding_rs you can encode into a big enough byte buffer using encode_from_utf8_without_replacement which returns the number of bytes it wrote.

    With a bunch of error handling to handle the various runtime conditions this seems to work:

    use encoding_rs::{EncoderResult, Encoding};
    use itertools::{Itertools, Position};
    
    fn char_indices<'a>(
        s: &'a str,
        encoding_label: &'a str,
    ) -> Option<impl Iterator<Item = Result<(usize, char), EncoderResult>> + 'a> {
        let mut e = Encoding::for_label(encoding_label.as_bytes())?.new_encoder();
    
        let mut idx = 0;
        Some(s.char_indices().with_position().map(move |(pos, (i, c))| {
            // 8 bytes seems reasonably overkill for any single 
            // codepoint, I don't think any encoding actually needs
            // more than 4
            let mut buf = [0u8; 8];
            match e.encode_from_utf8_without_replacement(
                &s[i..i + c.len_utf8()],
                &mut buf,
                matches!(pos, Position::Only | Position::Last),
            ) {
               (EncoderResult::InputEmpty, _, out_len) => {
                   let idx_before = idx;
                   idx += out_len;
                   Ok((idx_before, c))
               },
               (r, _, _) => Err(r)
            }
        }))
    }
    
    fn main() {
        for s in ["thing", "étoile", "中华人民共和国"] {
            println!("{s}");
            for e in ["utf-8", "iso-8859-2", "Shift_JIS", "GBK"] {
                println!("\t{e}");
                for r in char_indices(s, e).expect("a valid encoding label") {
                    println!("\t\t{r:?}")
                }
            }
        }
    }