I want to reduce the size of the sparse matrix of the tf-idf vectorizer outputs since i am using it with cosine similarity and it takes a long time to go through each vector. I have about 44,000 sentences so the vocabulary size is also very large.
I was wondering if there was a way to combine a group of words to mean one word for example teal, navy and turquiose will all mean blue and that will have same tf-idf value.
I am dealing with a dataset of clothing items so things like colour, and similar clothing articles like shirt, t-shirt and sweatshirts are things i want to group.
I know i can use stop words to give certain words a value of 1 but is it possible to group words to have the same value?
Here is my code
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
dataset_2 = "/dataset_files/styles_2.csv"
df = pd.read_csv(dataset_2)
df = df.drop(['gender', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'season', 'year', 'usage'], axis = 1)
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(new_df['ProductDisplayName'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
Unfortunately we can't use the vocabulary
optional argument to TfidfVectorizer to signal synonyms; I tried and got error ValueError: Vocabulary contains repeated indices."
Instead, you could run the tfidf vectorizer algorithm once, then manually merge columns that correspond to synonyms.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
## DATA
corpus = ['The grey cat eats the navy mouse.',
'The ashen cat drives the red car.',
'There is a mouse on the brown banquette of the crimson car.',
'The teal car drove over the poor cat and tarnished its beautiful silver fur with scarlet blood.',
'I bought a turquoise sapphire shaped like a cat and mounted on a rose gold ring.',
'Mice and cats alike are drowning in the deep blue sea.']
synonym_groups = [['grey', 'gray', 'ashen', 'silver'],
['red', 'crimson', 'rose', 'scarlet'],
['blue', 'navy', 'sapphire', 'teal', 'turquoise']]
## VECTORIZING FIRST TIME TO GET vectorizer0.vocabulary_
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
## MERGING SYNONYM COLUMNS
vocab = vectorizer.vocabulary_
synonym_representants = { group[0] for group in synonym_groups }
redundant_synonyms = { word: group[0] for group in synonym_groups for word in group[1:] }
syns_dict = {group[0]: group for group in synonym_groups}
# syns_dict = {next(word for word in group if word in vocab): group for group in synonym_groups} ## SHOULD BE MORE ROBUST
nonredundant_columns = sorted( v for k, v in vocab.items() if k not in redundant_synonyms )
for rep in synonym_representants:
X[:,vocab[rep]] = X[:, [vocab[syn] for syn in syns_dict[rep] if syn in vocab]].sum(axis=1)
Y = X[:, nonredundant_columns]
new_vocab = [w for w in sorted(vocab, key=vocab.get) if w not in redundant_synonyms]
## COSINE SIMILARITY
cos_sim = cosine_similarity(Y, Y)
## RESULTS
print(' ', ''.join('{:11.11}'.format(word) for word in new_vocab))
print(Y.toarray())
print()
print('Cosine similarity')
print(cos_sim)
Output:
alike banquette beautiful blood blue bought brown car cat cats deep drives drove drowning eats fur gold grey like mice mounted mouse poor red ring sea shaped tarnished
[[0. 0. 0. 0. 0.49848319 0. 0. 0. 0.29572971 0. 0. 0. 0. 0. 0.49848319 0. 0. 0.49848319 0. 0. 0. 0.40876335 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0. 0. 0.35369727 0.30309169 0. 0. 0.51089257 0. 0. 0. 0. 0. 0.51089257 0. 0. 0. 0. 0. 0.51089257 0. 0. 0. 0. ]
[0. 0.490779 0. 0. 0. 0. 0.490779 0.3397724 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.4024458 0. 0.490779 0. 0. 0. 0. ]
[0. 0. 0.31893014 0.31893014 0.31893014 0. 0. 0.2207993 0.18920822 0. 0. 0. 0.31893014 0. 0. 0.31893014 0. 0.31893014 0. 0. 0. 0. 0.31893014 0.31893014 0. 0. 0. 0.31893014]
[0. 0. 0. 0. 0.65400152 0.32700076 0. 0. 0.19399619 0. 0. 0. 0. 0. 0. 0. 0.32700076 0. 0.32700076 0. 0.32700076 0. 0. 0.32700076 0.32700076 0. 0.32700076 0. ]
[0.37796447 0. 0. 0. 0.37796447 0. 0. 0. 0. 0.37796447 0.37796447 0. 0. 0.37796447 0. 0. 0. 0. 0. 0.37796447 0. 0. 0. 0. 0. 0.37796447 0. 0. ]]
Cosine similarity
[[1. 0.34430458 0.16450509 0.37391712 0.3479721 0.18840894]
[0.34430458 1. 0.37091192 0.46132163 0.20500145 0. ]
[0.16450509 0.37091192 1. 0.23154573 0.14566346 0. ]
[0.37391712 0.46132163 0.23154573 1. 0.3172916 0.12054426]
[0.3479721 0.20500145 0.14566346 0.3172916 1. 0.2243601 ]
[0.18840894 0. 0. 0.12054426 0.2243601 1. ]]