I have a list minhash_sig = ['112', '223'], and I would like to find the jaccard similarity between this list and every element in a pyspark dataframe's column. Unfortunately I'm not able to do so.
I've tried using array_intersect, as well as array_union to attempt to do the comparison. However, this does not work as I get the message Resolved attribute missing
.
Here is the pyspark dataframe that I have created so far.
df = spark.createDataFrame(
[
(1, ['112','333']),
(2, ['112','223'])
],
["id", "minhash"] # add your column names here
)
minhash_sig = ['112', '223']
df2 = spark.createDataFrame([Row(c1=minhash_sig)])
And here is the code that I've used to try to compare the list to the pyspark column elements.
df.withColumn('minhash_sim',size(array_intersect(df2.c1, df.minhash)))
Does anyone know how I can do this comparison without this error?
the column from df2 will not be known to df1 unless you join them and create one object, you can try to first crossjoin both and then try your code:
df.crossJoin(df2).withColumn('minhash_sim',size(array_intersect("c1", "minhash")))\
.show()
+---+----------+----------+-----------+
| id| minhash| c1|minhash_sim|
+---+----------+----------+-----------+
| 1|[112, 333]|[112, 223]| 1|
| 2|[112, 223]|[112, 223]| 2|
+---+----------+----------+-----------+