I have a dataframe in spark which contains a column of
df.select("y_wgs84").show
+----------------+
| y_wgs84|
+----------------+
|47,9882373902965|
|47,9848921211406|
|47,9781530280939|
|47,9731284286555|
|47,9889813907224|
|47,9881440349524|
|47,9744969812356|
|47,9779388492231|
|48,0107946653620|
|48,0161245749621|
|48,0176065577678|
|48,0029496680229|
|48,0061848607139|
|47,9947482295108|
|48,0055828684523|
|48,0148743653486|
|48,0163361315735|
|48,0071490870937|
|48,0178054077099|
|47,8670099558802|
+----------------+
As these were read by spark.read.csv()
its schema is of type String
. Now I want to convert it to a double as follows:
val format = NumberFormat.getInstance(Locale.GERMANY)
def toDouble: UserDefinedFunction = udf[Double, String](format.parse(_).doubleValue)
df2.withColumn("y_wgs84", toDouble('y_wgs84)).collect
but it fails with java.lang.NumberFormatException: For input string: ".E0"
Strangely though, when grepping the file, there is no single record containing an E
.
Additionally, df.select("y_wgs84").as[String].collect.map(format.parse(_).doubleValue)
this will work just fine.
What is wrong here when calling the function as an UDF in spark?
Actually, thread safety is the problem. So changing the parsing function to
def toDouble: UserDefinedFunction = udf[Double, String](_.replace(',', '.').toDouble)
works just fine.