azureazure-databricksazure-data-factorydatabricks-community-edition

Saving pyspark dataframe from azure databricks to azure data lake in a specific folder created dynamically


I am doing some ETL process in Azure.

1. Source data is in Azure data lake
2. Processing it in Azure databricks
3. Loading the output dataframe in Azure data lake to a specific folder 
   considering Current year / Month / date and then file name in csv format.

I am stuck in 3rd step -

1. Tried loading the dataframe to mnt location
   outPath = '/dbfs/mnt/abcd/<outputfoldername>/' + cy_year + "/" + 
   cy_month + "/" + cy_date + "/"

df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").opt 
 ion("header","true").csv(outPath)

This is saving data to DBFS but not to ADLS as suggested by many links over internet.

2. Tried working like - 
   spark.conf.set("fs.azure.account.key.<storage account 
   name>.dfs.core.windows.net", "<<ACCESS KEY")

  output_container_path = "abfss://<container- 
  name>@salesdetails.dfs.core.windows.net/<dir path>"

  df.coalesce(1).write.format("csv").mode("overwrite").option("header", 
  "true").format("com.databricks.spark.csv").save(output_container_path)

This is saving into data into ADLS but into 4 files. 3 are supported ones. I want only one final file name example abc.csv

3. Tried with pandas dataframe which gives us flexibility to name the 
   file name but here we will need specific folder name which is not the 
   case with me.

Please assist at the earliest. Many thanks in advance


Solution

  • Follow these steps:

    1.Mount your storage account with azure data lake gen 2 as per MS_Doc.

    configs = {"fs.azure.account.auth.type": "OAuth",
              "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
              "fs.azure.account.oauth2.client.id": "f4dab6c8-5009-4857xxxxxxxxxxxxx",
              "fs.azure.account.oauth2.client.secret":"3GF8Q~3ZGkgflxxxxxxxxx",
              "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/72f988bfxxxxxxx/oauth2/token"}
    
    dbutils.fs.mount(
      source = "abfss://demo123@vamblob.dfs.core.windows.net/",
      mount_point = "/mnt/abcd11",
      extra_configs = configs)
    

    enter image description here

    2.Configure your storage account and read the Source data is in Azure data lake.

    df11 = spark.read.format("csv").load("abfss://<container>@<Storage_acccount>.dfs.core.windows.net/")
    
    display(df11)
    

    enter image description here

    Sample_Code:

    from pyspark.sql.functions import year, month, dayofmonth
    from datetime import datetime
    
    now = datetime.now()
    year = now.year
    month = now.month
    day = now.day
    
    folder12 = "/mnt/abcd1/{}/{}/{}/output.csv".format(year, month, day)
    
    # write the dataframe into the  folder in CSV format
    df1.write.option("header", "true").csv(folder12, mode="overwrite")
     
    

    enter image description here

    Output:

    enter image description here