[SOLVED] Postgres: index on cosine similarity of float arrays for one-to-many search

Postgres: index on cosine similarity of float arrays for one-to-many search

Cosine similarity between two equally-sized vectors (of reals) is defined as the dot product divided by the product of the norms.

To represent vectors, I have a large table of float arrays, e.g. CREATE TABLE foo(vec float[])'. Given a certain float array, I need to quickly (with an index, not a seqscan) find the closest arrays in that table by cosine similarity, e.g. SELECT * FROM foo ORDER BY cos_sim(vec, ARRAY[1.0, 4.5, 2.2]) DESC LIMIT 10; But what do I use?

pg_trgm's cosine similarity support is different. It compares text, and I'm not sure what it does exactly. An extension called smlar (here) also has cosine similarity support for float arrays but again is doing something different. What I described is commonly used in data analysis to compare features of documents, so I was thinking there'd be support in Postgres for it.

Solution

If you're ok with an inexact solution, you can use random projection: https://en.wikipedia.org/wiki/Random_projection.

Randomly generate k different vectors of the same length as your other vectors and store them somewhere. You will use these to spatially bin your data. For each vector in your table, do a dot product with each of the random vectors and store the product's sign.

Vectors with the same sign for each random vector go in the same bin, and generally vectors that have high cosine similarity will end up in the same bin. You can pack the signs as bits into an integer and use a normal index to pull vectors in the same bin as your query, then do a sequential search to find those with the highest cosine similarity.