pythonscikit-learnnlptfidfvectorizer

TFIDFVectorizer making concatenated word tokens


I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.

I am using the following code to achieve it:

import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
    content_string=""
    content = [line.replace('\n','') for line in f]
    content =  content_string.join(content)
    doc=re.split('.I\s[0-9]{1,4}',content)
    f.close()
#some data cleaning
doc = [line.replace('.T',' ').replace('.B',' ').replace('.A',' ').replace('.W',' ') for line in doc]
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]




vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)

I have attached below a few examples I obtain when I print vocabulary:

'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752

When I use with small data, it does not happen. How to prevent this from happening?


Solution

  • My guess would be that the issue is caused by this line:

    content = [line.replace('\n','') for line in f]
    

    When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:

    content = [line.replace('\n',' ') for line in f]
                                 ---
    

    (note the space between '')