I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later.
temp_tok = ["or", "Normal sinus rhythm", "sinus", "anuj","Normal sinus"]
def tokenize(text):
return [temp_tok[0],temp_tok[1], "sinus", "Normal sinus"]
def tokenize2(text):
return [i for i in temp_tok if i in text]
text = "Normal sinus rhythm"
The output of text for both functions is same which is
tokenize(text)
output = ['or', 'Normal sinus rhythm', 'sinus', 'Normal sinus']
But when I build vectorizer with these tokenizer, it gives unexpected output for tokenize2. My vocabulary is temp_tok for both. I experimented with n_gram range but it is not helping.
vectorizer = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize)
vectorizer2 = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize2)
While vectorizer.transform([text]) is giving expected output, vectorizer2.transform([text]) is giving 1 only for "or" and "sinus"
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 1, 1, 0, 1]])
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 0, 1, 0, 0]])
I also tried passing dictionary instead of list temp_tok as vocabulary to Countvectorizer but it doesn't help. Is this sklearn problem or I am doing something wrong?
Countvectorizer is passing the text by converting it to lower case. So tokenize2 is not working while tokenize works well. This can be seen by adding a print function in tokenize2.
def tokenize2(text):
print(text)
return [i for i in temp_tok if i in text]
A good solution would be to change the elements in temp_tok to lower cases. Else any technique to handle small case, capital case would work.