I have mysql table with a latin text. I am trying to tokenize this text into words.
I came across boost and ICU tokenizers. The problem is these libraries expects me to figure out the word boundries.
I tried following boost code, ( with default tokenizer i.e. spaces and punctuations ).
int main(){
using namespace std;
using namespace boost;
string s = "Tänk efter nu – förr'n vi föser dig bort";
tokenizer<> tok(s);
for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
return 0;
}
It does give me the list of words. But here I am assuming the space is the correct word separator.
Considering the set of these ( http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Languages_with_complete_coverage ) languages is it safe to use above code?
Or do you recon any other solution?
ICU has support for boundary analysis taking into account the characteristics of the text language: