stringrustunicodedevanagarigrapheme

How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string?


I am trying to split Devanagari (vowels and) bi-tri and tetra conjuncts consonants as whole while keeping the vowel sign and virama. and later map them with other Indic script. I first tried using Rust's chars() which didn't work. Then I came across grapheme clusters.

I have used grapheme clusters in my current code, but it does not give me the desired output. I understand that this method may not work for complex scripts like Devanagari or other Indic scripts.

How can I achieve the desired output?

Here's the Devanagari Script and Conjucts wiki:

Here's what I wrote to split:

use unicode_segmentation::UnicodeSegmentation;


fn main() {
    
    let hs = "हिन्दी मुख्यमंत्री हिमंत";
    let hsi = hs.graphemes(true).collect::<Vec<&str>>();
    for i in hsi { 
        print!("{}  ", i); // double space eye comfort
    }
} 

Current output:

हि  न्  दी   मु  ख्  य  मं  त्  री    हि  मं  त

Desired ouput:

हि न्दी  मु ख्य मं त्री  हि मं त

My other try:

I also tried to create a simple grapheme cluster following this answer to "Combining Devanagari characters"

fn split_conjuncts(text: &str) -> Vec<String> {
    let mut result = vec![];
    let mut temp = String::new();

    for c in text.chars() {
        if (c as u32) >= 0x0300 && (c as u32) <= 0x036F {
            temp.push(c);
        } else {
            temp.push(c);
            if !temp.is_empty() {
                result.push(temp.clone());
                temp.clear();
            }
        }
    }
    if !temp.is_empty() {
        result.push(temp);
    }
    result
}

fn main() {
    let text = "संस्कृतम्";
    let split_tokens = split_conjuncts(text);
    println!("{:?}", split_tokens);

}

Output:

["स", "\u{902}", "स", "\u{94d}", "क", "\u{943}", "त", "म", "\u{94d}"]

So, how can I get the desired output?

Desired ouput:

हि न्दी  मु ख्य मं त्री  हि मं त

I also checked other SO answers (links below) dealing issues with Unicode, grpahemes, UTF-8, but no luck yet.


Solution

  • Here's my Rust implementation:

    use unicode_segmentation::UnicodeSegmentation;
    
    // Define a struct that holds a grapheme iterator
    struct DevanagariSplitter<'a> {
        graphemes: std::iter::Peekable<unicode_segmentation::Graphemes<'a>>,
    }
    
    // Implement Iterator trait for DevanagariSplitter
    impl<'a> Iterator for DevanagariSplitter<'a> {
        type Item = String;
    
        fn next(&mut self) -> Option<Self::Item> {
            // Get the next grapheme from the iterator
            let mut akshara = match self.graphemes.next() {
                Some(g) => g.to_string(),
                None => return None,
            };
    
            // Check if the grapheme ends with a virama
            if akshara.ends_with('\u{094D}') {
                // Peek at the next grapheme and see if it starts with a letter
                if let Some(next) = self.graphemes.peek() {
                    if next.starts_with(|c: char| c.is_alphabetic()) {
                        // Append the next grapheme to the current one
                        akshara.push_str(self.graphemes.next().unwrap());
                    }
                }
            }
    
            // Return the akshara as an option
            Some(akshara)
        }
    }
    
    // Define a function that takes a string and returns an DevanagariSplitter
    fn aksharas(s: &str) -> DevanagariSplitter {
        // Use UnicodeSegmentation to split the string into graphemes
        let graphemes = s.graphemes(true).peekable();
        // Create and return an DevanagariSplitter from the graphemes
        DevanagariSplitter { graphemes }
    }
    
    fn main() {
        // Define an input string in devanagari script
        let input = "हिन्दी मुख्यमंत्री हिमंत";
    
        // Print each akshara separated by spaces using aksharas function
        for akshara in aksharas(input) {
            print!("{} ", akshara);
        }
    }
    
     
    

    Output:

    // The output of this code is:
    
    // "हि न्दी  मु ख्य मं त्री  हि मं त"