apache-sparkvectorpysparkapache-spark-sqlapache-spark-ml

Convert string into vector in Spark


I have two PySpark dataframes of the following structure. I would like to perform cross join and calculate cosine similarity. The qry_emb is a string column with comma separated values.

How to convert this string into dense vector?

Pyspark dataframe

df.printSchema()
# root
# |-- query: string (nullable = true)
# |-- qry_emb: string (nullable = true)

Solution

  • To convert string to vector, first convert your string to array (split), then use array_to_vector

    from pyspark.sql import functions as F
    from pyspark.ml.functions import array_to_vector
    df = df.withColumn('qry_emb', array_to_vector(F.split('qry_emb', ',[ ]*').cast('array<double>')))