I have a DataFrame with a many columns. I also have a function
def getFeatureVector(features:Array[String]) : Vector
that is fairly complex, but takes some strings and returns a spark MLlib vector.
Now, I want to look at some columns in the DF (I don't know which beforehand), pass them to getFeatureVector, and add a new column containing the resulting vectors.
I have access to an array of the columns I want to use, and I wrote a function that casts it to string, and makes an array column:
val colNamesToEncode = Array("col1", "col2", "col3", "col4")
def getColsToEncode:Column = {
val cols = colNamesToEncode.map(x => col(x).cast("string"))
array(cols:_*)
}
Finally, I try to make a udf and apply it to the DF:
val encoderUDF = udf(getFeatureVector _)
val cols = getColsToEncode()
data.withColumn(featuresColName,encoderUDF(cols))
but when I run that, I get java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
How can I apply to function to the DF?
PS: I was using this answer (Spark UDF with varargs) as a guide while writing my code.
Just remove ()
from the below line, that resolved the error.
From val cols = getColsToEncode()
To
val cols = getColsToEncode