pysparkapache-spark-mllib

TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector


I have a dataframe with multiple rows that look like this: df.head() gives:

Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))

Now I want to compute the columnSimilarities() on my dataframe, and I do the following:

rdd2 = df.rdd
mat = RowMatrix(rdd2)
sims = mat.columnSimilarities()

However, I get the following error:

 File "/opt/apache-spark/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py", line 67, in _convert_to_vector
    raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector

Can someone help me with this? Thanks!


Solution

  • The present rdd form is:

    Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))
    

    As per the example in official documentation it will work if we get it in form:

    [DenseVector([1.02, 4.23, 4.534, 0.342])]
    

    Construct the RowMatrix as:

    RowMatrix(df.rdd.map(list))
    

    Here is a full example which reporduces and fixes your problem:

    df = spark.createDataFrame(data=[([1.02, 4.23, 4.534, 0.342],)], schema=["features"])
    
    from pyspark.sql.functions import udf
    from pyspark.ml.linalg import VectorUDT
    @udf(returnType=VectorUDT())
    def arrayToVector(arrCol):
      from pyspark.ml.linalg import Vectors
      return Vectors.dense(arrCol)
    # 
    
    df = df.withColumn("features", arrayToVector("features"))
    # print(df.head())
    # df.printSchema()
    
    # mat = RowMatrix(df.rdd) # Causes TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
    mat = RowMatrix(df.rdd.map(list))
    sims = mat.columnSimilarities()
    print(sims.entries.collect())
    
    [Out]:
    [MatrixEntry(2, 3, 1.0), MatrixEntry(0, 1, 1.0), MatrixEntry(1, 2, 1.0), MatrixEntry(0, 3, 1.0), MatrixEntry(1, 3, 1.0), MatrixEntry(0, 2, 1.0)]