pythonvectorizationsparse-matrixknnnmslib

Why does NMSLIB scale badly when I insert a CSR Matrix in a cosinesimil HNSW Index?


I'm working with text embeddings, stored in a sparse format as a csr_matrix (generated via a TfIdfVectorizer). I'd like to use NMSLIB's cosinesimil/HNSW index to insert them and do a Nearest Neighbors search.

My problem is that inserting the embeddings.toarray() doesn't scale when I have more than e.g. 1M embeddings to insert. I noticed here that inserting directly a csr_matrix without calling toarray() seems supported:

test_features = sparse.csr_matrix(test_features)
train_features = sparse.csr_matrix(train_features)

nsw = nmslib.init(method = 'sw-graph', space = 'cosinesimil_sparse', data_type=nmslib.DataType.SPARSE_VECTOR)
nsw.addDataPointBatch(train_features)

However, when I try inserting my embeddings, I get this error:

    self.similar_items_index = nmslib.init(space='cosinesimil', method='hnsw')
    self.similar_items_index.addDataPointBatch(self.embeddings)

->


Traceback (most recent call last):
  File "/home/pln/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/213.7172.26/plugins/python/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/pln/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/213.7172.26/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/pln/Work/project/foo/bar/baz.py", line 140, in <module>
    cbf_model.train()
  File "/home/pln/Work/project/foo/bar/baz.py", line 152, in timing_wrapper
    value = func(*args, **kwargs)
  File "/home/pln/Work/project/foo/bar/baz.py", line 130, in train
    self.insert_datapoints()
  File "/home/pln/Work/project/foo/bar/baz.py", line 152, in timing_wrapper
    value = func(*args, **kwargs)
  File "/home/pln/Work/project/foo/bar/baz.py", line 159, in insert_datapoints
    self.similar_items_index.addDataPointBatch(self.embeddings)
ValueError: setting an array element with a sequence.
python-builtins.ValueError

Is this expected, or should I be able to insert a csr_matrix as-is to such an index?


Solution

  • The problem with your code is the space used: as you can see in the quoted example, the proper way to insert a Compressed Sparse Row Matrix is to use the cosinesimil_sparse space.

    See NMSLIB's documentation for spaces, in particular the section on Input Format:

    For sparse spaces that include the Lp-spaces, the sparse cosine similarity, and the maximum-inner product space, the input data is a sparse scipy matrix. An example can be found here.