I have a dataframe with multiple rows that look like this: df.head() gives:
Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))
Now I want to compute the columnSimilarities() on my dataframe, and I do the following:
rdd2 = df.rdd
mat = RowMatrix(rdd2)
sims = mat.columnSimilarities()
However, I get the following error:
File "/opt/apache-spark/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py", line 67, in _convert_to_vector
raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
Can someone help me with this? Thanks!
The present rdd form is:
Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))
As per the example in official documentation it will work if we get it in form:
[DenseVector([1.02, 4.23, 4.534, 0.342])]
Construct the RowMatrix as:
RowMatrix(df.rdd.map(list))
Here is a full example which reporduces and fixes your problem:
df = spark.createDataFrame(data=[([1.02, 4.23, 4.534, 0.342],)], schema=["features"])
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
@udf(returnType=VectorUDT())
def arrayToVector(arrCol):
from pyspark.ml.linalg import Vectors
return Vectors.dense(arrCol)
#
df = df.withColumn("features", arrayToVector("features"))
# print(df.head())
# df.printSchema()
# mat = RowMatrix(df.rdd) # Causes TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
mat = RowMatrix(df.rdd.map(list))
sims = mat.columnSimilarities()
print(sims.entries.collect())
[Out]:
[MatrixEntry(2, 3, 1.0), MatrixEntry(0, 1, 1.0), MatrixEntry(1, 2, 1.0), MatrixEntry(0, 3, 1.0), MatrixEntry(1, 3, 1.0), MatrixEntry(0, 2, 1.0)]