I have many users and each user has an associated vector. I would like to compute the cosine similarity between each user. This is prohibitive based on the size. It seems LSH is a good approximation step, which I understand will create buckets where the users in this case, are mapped to the same bucket where there is high probability that they are similar. In Pyspark, the following example:
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([1.0, 1.0]),),
(1, Vectors.dense([1.0, -1.0]),),
(4, Vectors.dense([1.0, -1.0]),),
(5, Vectors.dense([1.1, -1.0]),),
(2, Vectors.dense([-1.0, -1.0]),),
(3, Vectors.dense([-1.0, 1.0]),)]
dfA = ss.createDataFrame(dataA, ["id", "features"])
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=1.0, numHashTables=3)
model = brp.fit(dfA)
model.transform(dfA).show(truncate=False)
+---+-----------+-----------------------+
|id |features |hashes |
+---+-----------+-----------------------+
|0 |[1.0,1.0] |[[-1.0], [0.0], [-1.0]]|
|1 |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|4 |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|5 |[1.1,-1.0] |[[-2.0], [-2.0], [1.0]]|
|2 |[-1.0,-1.0]|[[0.0], [-1.0], [0.0]] |
|3 |[-1.0,1.0] |[[1.0], [1.0], [-2.0]] |
+---+-----------+-----------------------+
Any pointers to how to best set bucketLength and numHashTables are appreciated.
Assuming I have the above with 3 hash tables, how can I determine the buckets from within each to calculate the cosine similarity given that there are more than 1? I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each. Is this correct?
I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each. Is this correct?
Yes, LSH uses a method to reduce dimensionality while preserving similarity. It hashes your data into a bucket. Only items that end up in the same bucket are then compared.(Distance calculated)
The magic is tuning the number of buckets and hash functions to reduce the number of false positives and false negatives. There isn't a set number it depends on your data.
r
is your bucket size,
b
is the number of hash functions to use(Or the number of buckets you will be using to detect matches.
From this article that helped me understand what was happening.
Let’s say your signature matrix has 100 rows. Consider 2 cases:
b1 = 10 → r = 10
b2 = 20 → r = 5
In 2nd case, there is higher chance for 2 [vectors] to appear in same bucket at least once as they have more opportunities (20 vs 10) and fewer elements of the signature are getting compared (5 vs 10)
If you need to join you can use: approxSimilarityJoin
and set the acceptable distance
. (This is another parameter that you need to tune, Distance is the distance between vectors that have fallen into at least on of the hash buckets, making them likely to be close to eachother.)
distance = 300
model.approxSimilarityJoin(df, df2, distance, distCol="EuclideanDistance").select(
col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("EuclideanDistance")).show()
You can get an idea of what reasonable for the distance between vectors by reviewing the data(from the join) or playing around with approxNearestNeighbors
. If you want 10 nearest neighbors here's how you can find there distance:
NumberOfNeigthbors = 10
CandidateVector = Vectors.dense([1.0, 2.0])
model.approxNearestNeighbors(df2, CandidateVector, NumberOfNeigthbors).collect()
[Row(id=4, features=DenseVector([2.0, 2.0]), hashes=[DenseVector([1.0])], distCol=1.0)]