dataframepysparklsh

Is it possible to store custom class object in Spark Data Frame as a column value?


I am working on duplicate documents detection problem using LSH algorithm. To handle large-scale data, we are using spark.

I have around 300K documents with at least 100-200 words per document. On spark cluster, these are the steps we are performing on data frame.

  1. Run Spark ML pipeline for converting text into tokens.

pipeline = Pipeline().setStages([
        docAssembler,
        tokenizer,
        normalizer,
        stemmer,
        finisher,
        stopwordsRemover,
       # emptyRowsRemover
    ])
model = pipeline.fit(spark_df)
final_df = model.transform(spark_df)

  1. For each document, get MinHash value using datasketch(https://github.com/ekzhu/datasketch/) library and store it as a new column.
final_df_limit.rdd.map(lambda x: (CalculateMinHash(x),)).toDF()

2nd step is failing as spark is not allowing us to store custom type value as a column. Value is an object of class MinHash.

Does anyone know how can i store Minhash objects in dataframes?


Solution

  • I don't think it might be possible to save python objects in DataFrames, but you can circumvent this in a couple of ways: