scalaazure-blob-storageazure-databricksapache-spark-xml

Read files And Modify filename from the azure storage containers in Azure Databricks


I am ingesting Large XML file and generating individual JSON according to the XML Element, I am using SPARK-XML in azure databricks. Code to create the json file as

commercialInfo
.write
.mode(SaveMode.Overwrite)
.json("/mnt/processed/" + "commercialInfo")

I am able to extract the XML element node and writing into the Azure storage container. A folder is created in the container and inside the folder we have name with the guid not with the filename.

enter image description here

Can anyone suggest if we have control over the File Name created in the container, i.e part-0000 into something meaningful name so that it can be read using some Azure Blob trigger.


Solution

  • Unfortunately, it's not possible to control the file name using standard spark library, but you can use Hadoop API for managing file system - save output in temporary directory and then move file to the requested path.

    Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part-0000 files.

    In order to change filename, try to add something like this in your code:

    In Scala it will look like:

    import org.apache.hadoop.fs._
    val fs = FileSystem.get(sc.hadoopConfiguration)
    val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()
    
    fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
    fs.delete(new Path("mydata.csv-temp"), true)
    

    OR

    import org.apache.hadoop.fs._
    val fs = FileSystem.get(sc.hadoopConfiguration)
    fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))