pythonmatrixnlpword2vecwmd

Iterate efficiently over a list of strings to get matrix of pairwise WMD distances


I am trying to generate a matrix of pairwise distances from a list strings (newspaper articles).

WMD distance is not implemented in scipy.spatial.distance.pdist so I hook this implementation: https://github.com/src-d/wmd-relax to SpaCy. However, I cannot figure out how to iterate over my list to generate the distance matrix.


Solution

  • As per doc:

    
    import spacy
    import wmd
    import numpy as np
    
    
    nlp = spacy.load('en_core_web_md')
    nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
    
    # given articles is a list of strings
    docs = [nlp(article) for article in articles]
    
    # matrix is just a list of lists in terms of Python objects
    m = []
    for doc1 in docs:
        row = []
        for doc2 in docs:
            # if distance is similarity function
            row.append(doc1.similarity(doc2))
        m.append(row)
    
    result = np.matrix(m)
    

    Numpy matrix doc