I am doing some ETL process in Azure.
1. Source data is in Azure data lake
2. Processing it in Azure databricks
3. Loading the output dataframe in Azure data lake to a specific folder
considering Current year / Month / date and then file name in csv format.
I am stuck in 3rd step -
1. Tried loading the dataframe to mnt location
outPath = '/dbfs/mnt/abcd/<outputfoldername>/' + cy_year + "/" +
cy_month + "/" + cy_date + "/"
df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").opt
ion("header","true").csv(outPath)
This is saving data to DBFS but not to ADLS as suggested by many links over internet.
2. Tried working like -
spark.conf.set("fs.azure.account.key.<storage account
name>.dfs.core.windows.net", "<<ACCESS KEY")
output_container_path = "abfss://<container-
name>@salesdetails.dfs.core.windows.net/<dir path>"
df.coalesce(1).write.format("csv").mode("overwrite").option("header",
"true").format("com.databricks.spark.csv").save(output_container_path)
This is saving into data into ADLS but into 4 files. 3 are supported ones. I want only one final file name example abc.csv
3. Tried with pandas dataframe which gives us flexibility to name the
file name but here we will need specific folder name which is not the
case with me.
Please assist at the earliest. Many thanks in advance
Follow these steps:
1.Mount your storage account with azure data lake gen 2 as per MS_Doc.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "f4dab6c8-5009-4857xxxxxxxxxxxxx",
"fs.azure.account.oauth2.client.secret":"3GF8Q~3ZGkgflxxxxxxxxx",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/72f988bfxxxxxxx/oauth2/token"}
dbutils.fs.mount(
source = "abfss://demo123@vamblob.dfs.core.windows.net/",
mount_point = "/mnt/abcd11",
extra_configs = configs)
2.Configure your storage account and read the Source data is in Azure data lake.
df11 = spark.read.format("csv").load("abfss://<container>@<Storage_acccount>.dfs.core.windows.net/")
display(df11)
Sample_Code:
from pyspark.sql.functions import year, month, dayofmonth
from datetime import datetime
now = datetime.now()
year = now.year
month = now.month
day = now.day
folder12 = "/mnt/abcd1/{}/{}/{}/output.csv".format(year, month, day)
# write the dataframe into the folder in CSV format
df1.write.option("header", "true").csv(folder12, mode="overwrite")
Output: