There is a dataframe with null values in one column(not all being null), it need to fill the null value with uuid, is there a way?
cala> val df = Seq(("stuff2",null,null), ("stuff2",null,Array("value1","value2")),("stuff3","stuff3",null)).toDF("field","field2","values")
df: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> df.show
+------+------+----------------+
| field|field2| values|
+------+------+----------------+
|stuff2| null| null|
|stuff2| null|[value1, value2]|
|stuff3|stuff3| null|
+------+------+----------------+
I tried this way, but each row of the "field2" has the same uuid.
scala> val fillDF = df.na.fill(java.util.UUID.randomUUID().toString(), Seq("field2"))
fillDF: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> fillDF.show
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|d007ffae-9134-4ac...| null|
|stuff2|d007ffae-9134-4ac...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
How to make it? in case there is more than 1,000,000 rows
You can do it using UDF
and coalesce
like below.
import org.apache.spark.sql.functions.udf
val arr = udf(() => java.util.UUID.randomUUID().toString())
val df2 = df.withColumn("field2", coalesce(df("field2"), arr()))
df2.show()
You will get different UUID like below.
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|fda6bc42-1265-407...| null|
|stuff2|3fa74767-abd7-405...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+