I tried to merge two files in a Datalake using scala in data bricks and saved it back to the Datalake using the following code:
val df =sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("adl://xxxxxxxx/Test/CSV")
df.coalesce(1).write.
format("com.databricks.spark.csv").
mode("overwrite").
option("header", "true").
save("adl://xxxxxxxx/Test/CSV/final_data.csv")
However the file final_data.csv is saved as a directory instead of a file with multiple files and the actual .csv file is saved as 'part-00000-tid-dddddddddd-xxxxxxxxxx.csv'.
How do I rename this file so that I can move it to another directory?
Got it. It can be renamed and placed into another destination using the following code. Also current files that were merged will be deleted.
val x = "Source"
val y = "Destination"
val df = sqlContext.read.format("csv")
.option("header", "true").option("inferSchema", "true")
.load(x+"/")
df.repartition(1).write.
format("csv").
mode("overwrite").
option("header", "true").
save(y+"/"+"final_data.csv")
dbutils.fs.ls(x).filter(file=>file.name.endsWith("csv")).foreach(f => dbutils.fs.rm(f.path,true))
dbutils.fs.mv(dbutils.fs.ls(y+"/"+"final_data.csv").filter(file=>file.name.startsWith("part-00000"))(0).path,y+"/"+"data.csv")
dbutils.fs.rm(y+"/"+"final_data.csv",true)