I am currently working on implementing a iterator which splits the given string and returns the substrings as an iterator. For a special character it is going to return just the special character or it is going to return a substring of alphanumeric characters and gets split at whitespaces.
I guess there is some kind of problem with the indexing because of the utf-8 chars but i do not know how to manage it.
This is the struct and it's iterator implementation.
pub struct SpecialStr<'a> {
string: &'a str,
back: usize,//index of the back of the &str substring.
}
impl<'a> SpecialStr<'a> {
pub fn new(input : &'a str) -> Self {
SpecialStr {string: input, back: 0}
}
}
//anything which is not a alphanumeric or a whitespace.
pub fn is_special(c: char) -> bool {
!c.is_ascii_alphanumeric() && !c.is_whitespace()
}
impl<'a> Iterator for SpecialStr<'a> {
type Item = &'a str;
fn next(&mut self) -> Option<Self::Item> {
let input_string: &str = self.string;
let max_index = self.string.len();
for front in self.back..max_index {
let character = match self.string.chars().nth(front) {
Some(character) => character,
None => return None,
};
//if the present char is a special character just return it by itself.
if is_special(character) {
self.back += character.len_utf8();
return Some(&input_string[self.back-character.len_utf8()..self.back]);
} else if !character.is_whitespace() {
//if it is not a special character then we are going to select a substring whose end will be at :
//--the one before the next following special character
//--or the one before a whitespace
//--or the one before the end of the sentence.
//then we are going to determine the substring to be selected based on this comparision.
for back in front+character.len_utf8()..max_index {
let character_2 = match self.string.chars().nth(back) {
Some(character) => character,
None => return None,
};
if is_special(character_2) || character_2.is_whitespace() || back == max_index-1 {
self.back = back;
return Some(&input_string[front..self.back]);
}
}
} else {
self.back += 1;
}
}
None
}
}
And this is the test.
fn divide_n_print_3() {
use super::tokenisation::SpecialStr;
let input = "` i love mine, too . happy mother�s day to all";
let new_one = SpecialStr::new(&input);
for i in new_one.into_iter() {
println!("{}", i);
}
}
I am getting the error:
thread 'feature_extraction::tokenisation_test::divide_n_print_3' panicked at 'byte index 38 is not a char boundary; it is inside '½' (bytes 37..39) of \`\` i love mine, too . happy mother�s day to all\`', src\\feature_extraction\\tokenisation.rs:74:38
I understand the meaning of the error but do not have any idea how to solve this, any kind of help would be appreciated
Maybe this is a case where using regular expressions truly makes the problem easier.
use regex::Regex;
use std::sync::OnceLock;
fn tokenize(s: &str) -> impl Iterator<Item = &str> {
static REGEX: OnceLock<Regex> = OnceLock::new();
let regex = REGEX.get_or_init(|| Regex::new(r"[[:alnum:]]+|\S").unwrap());
regex.find_iter(s).map(|m| m.as_str())
}
This returns any consecutive run of alphanumeric ASCII characters, otherwise any single non-whitespace character and skips all whitespace. (Note that it skips all Unicode whitespace, while it only considers alphanumeric ASCII characters, since this is what your code does.)
If you prefer to implement the iterator yourself, here is one option:
struct Tokenizer<'a> {
s: &'a str,
}
impl<'a> Iterator for Tokenizer<'a> {
type Item = &'a str;
fn next(&mut self) -> Option<Self::Item> {
self.s = self.s.trim_start();
let c = self.s.chars().next()?;
let len = if c.is_ascii_alphanumeric() {
self.s
.find(|c: char| !c.is_ascii_alphanumeric())
.unwrap_or(self.s.len())
} else {
c.len_utf8()
};
let result;
(result, self.s) = self.s.split_at(len);
Some(result)
}
}
This avoids most of the issues you had by using string methods for the actual iteration – trim_start()
to skip whitespace, and find()
to find runs of alphnumeric characters.