I am building a document classifier to categorize documents.
So first step is to represent each documents as "features vector" for the training purpose.
After some research, I found that I can use either the Bag of Words approach or N-gram approach to represent a document as a vector.
The text in each document (scanned pdfs and images) is retrieved using an OCR, thus some words contain errors. And I don't have previous knowledge about the language used in these documents (can't use stemming).
So as far as I understand I have to use the n-gram approach. or are there other approaches to represent a document?
N-grams are just sequences of N items. In classification by topic you normally use N-grams of words or their roots (though there are models based on N-grams of chars). Most popular N-grams are unigrams (just word), bigrams (2 serial words) and trigrams (3 serial words). So, from sentence
Hello, my name is Frank
you should get following unigrams:
[hello, my, name, is, frank] (or [hello, I, name, be, frank], if you use roots)
following bigrams:
[hello_my, my_name, name_is, is_frank]
and so on.
At the end your feature vector should have as much positions (dimensions) as there are words in all your text plus 1 for unknown words. Every position in instance vector should somehow reflect number of corresponding words in instance text. This may be number of occurrences, binary feature (1 if word occurs, 0 otherwise), normalized feature or tf-idf (very popular in classification by topic).
Classification process itself is the same as for any other domain.