I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.
I am using the following code to achieve it:
import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
content_string=""
content = [line.replace('\n','') for line in f]
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
#some data cleaning
doc = [line.replace('.T',' ').replace('.B',' ').replace('.A',' ').replace('.W',' ') for line in doc]
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)
I have attached below a few examples I obtain when I print vocabulary:
'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752
When I use with small data, it does not happen. How to prevent this from happening?
My guess would be that the issue is caused by this line:
content = [line.replace('\n','') for line in f]
When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:
content = [line.replace('\n',' ') for line in f]
---
(note the space between ''
)