Here is my code, I have a sentence and I want to tokenize and stem it before passing it to TfidfVectorizer to finally to get a tf-idf representation of the sentence:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer_ita = SnowballStemmer("italian")
def tokenizer_stemmer_ita(text):
return [stemmer_ita.stem(word) for word in text.split()]
def sentence_tokenizer_stemmer(text):
return " ".join([stemmer_ita.stem(word) for word in text.split()])
X_train = ['il libro è sul tavolo']
X_train = [sentence_tokenizer_stemmer(text) for text in X_train]
tfidf = TfidfVectorizer(preprocessor=None, tokenizer=None, use_idf=True, stop_words=None, ngram_range=(1,2))
X_train = tfidf.fit_transform(X_train)
# let's see the features
print (tfidf.get_feature_names())
I get as output:
['il', 'il libr', 'libr', 'libr sul', 'sul', 'sul tavol', 'tavol']
if I change the parameter
tokenizer=None
to:
tokenizer=tokenizer_stemmer_ita
and I comment this line:
X_train = [sentence_tokenizer_stemmer(text) for text in X_train]
I expect to get the same result but the result is different:
['il', 'il libr', 'libr', 'libr è', 'sul', 'sul tavol', 'tavol', 'è', 'è sul']
Why? Am I implementing correctly the external stemmer? It seems, at least, that the stopwords ("è") are removed in the first run, even if stop_words=None.
[edit] As suggested by Vivek, the problem seems to be the default token patter, which is applied anyway when tokenizer = None. So if a add these two lines at the beginning of tokenizer_stemmer_ita:
token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
text = " ".join( token_pattern.findall(text) )
I should get the correct behaviour, and in fact I get it for the above simple example, but for a different example:
X_train = ['0.05%.\n\nVedete?']
I don't, the two outputs are different:
['05', '05 ved', 'ved']
and
['05', '05 vedete', 'vedete']
why? In this case the question mark seems to be the problem, without it the output are identical.
[edit2] It seems I have to stem first and then apply the regex, in this case the two outputs are identical.
Thats because of default tokenizer pattern token_pattern
used in TfidfVectorizer:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
So the character è
is not selected.
import re
token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
print token_pattern.findall('il libro è sul tavolo')
# Output
# ['il', 'libro', 'sul', 'tavolo']
This default token_pattern
is used when tokenizer is None, as you are experiencing.