pythonapache-sparkpysparkdatabricksminhash

Compare list to every element in a pyspark column


I have a list minhash_sig = ['112', '223'], and I would like to find the jaccard similarity between this list and every element in a pyspark dataframe's column. Unfortunately I'm not able to do so.

I've tried using array_intersect, as well as array_union to attempt to do the comparison. However, this does not work as I get the message Resolved attribute missing.

Here is the pyspark dataframe that I have created so far.

df = spark.createDataFrame(
    [
        (1, ['112','333']), 
        (2, ['112','223'])
    ],
    ["id", "minhash"]  # add your column names here
)
minhash_sig = ['112', '223']
df2 = spark.createDataFrame([Row(c1=minhash_sig)])

And here is the code that I've used to try to compare the list to the pyspark column elements.

df.withColumn('minhash_sim',size(array_intersect(df2.c1, df.minhash)))

Does anyone know how I can do this comparison without this error?


Solution

  • the column from df2 will not be known to df1 unless you join them and create one object, you can try to first crossjoin both and then try your code:

    df.crossJoin(df2).withColumn('minhash_sim',size(array_intersect("c1", "minhash")))\
      .show()
    

    +---+----------+----------+-----------+
    | id|   minhash|        c1|minhash_sim|
    +---+----------+----------+-----------+
    |  1|[112, 333]|[112, 223]|          1|
    |  2|[112, 223]|[112, 223]|          2|
    +---+----------+----------+-----------+