[SOLVED] Tokenize latin-1 text in c++

Tokenize latin-1 text in c++

I have mysql table with a latin text. I am trying to tokenize this text into words.

I came across boost and ICU tokenizers. The problem is these libraries expects me to figure out the word boundries.

I tried following boost code, ( with default tokenizer i.e. spaces and punctuations ).

int main(){

   using namespace std;
   using namespace boost;

   string s = "Tänk efter nu – förr'n vi föser dig bort";
   tokenizer<> tok(s);

   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }

   return 0;
}

It does give me the list of words. But here I am assuming the space is the correct word separator.

Considering the set of these ( http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Languages_with_complete_coverage ) languages is it safe to use above code?

Or do you recon any other solution?

Solution

ICU has support for boundary analysis taking into account the characteristics of the text language:

http://userguide.icu-project.org/boundaryanalysis