pythonpython-multiprocessingdaskgensimwmd

Can I optimize this Word Mover's Distance look-up function?


I am trying to measure the Word Mover's Distance between a lot of texts using Gensim's Word2Vec tools in Python. I am comparing each text with all other texts, so I first use itertools to create pairwise combinations like [1,2,3] -> [(1,2), (1,3), (2,3)]. For memory's sake, I don't do the combinations by having all texts repeated in a big dataframe, but instead make a reference dataframe combinations with indices of the texts, which looks like:

    0   1
0   0   1
1   0   2
2   0   3

And then in the comparison function I use these indices to look up the text in the original dataframe. The solution works fine, but I am wondering whether I would be able to it with big datasets. For instance I have a 300.000 row dataset of texts, which gives me about a 100 year's worth of computation on my laptop:

C2​(300000) = 300000​! / (2!(300000−2))!
           = 300000⋅299999​ / 2 * 1
           = 44999850000 combinations

Is there any way this could be optimized better?

My code right now:

import multiprocessing
import itertools
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from gensim.models.word2vec import Word2Vec
from gensim.corpora.wikicorpus import WikiCorpus

def get_distance(row):
    try: 
        sent1 = df.loc[row[0], 'text'].split()
        sent2 = df.loc[row[1], 'text'].split()
        return model.wv.wmdistance(sent1, sent2)  # Compute WMD
    except Exception as e:
        return np.nan

df = pd.read_csv('data.csv')

# I then set up the gensim model, let me know if you need that bit of code too.

# Make pairwise combination of all indices
combinations = pd.DataFrame(itertools.combinations(df.index, 2))

# To dask df and apply function
dcombinations = dd.from_pandas(combinations, npartitions= 2 * multiprocessing.cpu_count())
dcombinations['distance'] = dcombinations.apply(get_distance, axis=1)
with ProgressBar():
    combinations = dcombinations.compute()

Solution

  • You might use wmd-relax for performance improvement. However, you'll first have to convert your model to spaCy and use the SimilarityHook as described on their webpage:

    import spacy
    import wmd
    
    nlp = spacy.load('en_core_web_md')
    nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
    doc1 = nlp("Politician speaks to the media in Illinois.")
    doc2 = nlp("The president greets the press in Chicago.")
    print(doc1.similarity(doc2))