numpyscikit-learnscipypairwise-distance

Optimized way to find pairwise cosine distance matrix using pairwise_distances_chunked


I have a numpy array with 42000 (rows) * 110000 (dimensions) ,I am trying to create a pairwise distance matrix(42000*42000) with 32GB ram and 8 cores.

I tried pairwise_distances_chunked but it is only giving 3120*42000 distance matrix .Also used pairwise_distances but it is giving out of memory error.

Any suggestions what can be done?


Solution

  • Reading the documentation for pairwise_distances_chunked, it yields a chunk at a time. Based on the way you phrased your question, it seems like you did this:

    D_chunk = next(pairwise_distances_chunked(X))
    

    That code (which is the first example from the documentation) only gives you the first chunk.

    What you want to do is this:

    for chunk in pairwise_distances_chunked(X):
        do_something(chunk)