aclazure-databricksazure-data-lake-gen2

Azure ADLS Gen2 file created by Azure Databricks doesn't inherit ACL


I have a databricks notebook that is writing a dataframe to a file in ADLS Gen2 storage.

It creates a temp folder, outputs the file and then copies that file to a permanent folder. For some reason the file doesn't inherit the ACL correctly. The folder it creates has the correct ACL.

The code for the notebook:

#Get data into dataframe
df_export = spark.sql(SQL)

# OUTPUT file to temp directory coalesce(1) creates a single output data file
(df_export.coalesce(1).write.format("parquet")
.mode("overwrite")
.save(TempFolder))

#get the parquet file name.  It's always the last in the folder as the other files are created starting with _
file = dbutils.fs.ls(TempFolder)[-1][0]

#create permanent copy
dbutils.fs.cp(file,FullPath)

The temp folder that is created shows the following for the relevant account. Folder Permissions

Where the file shows the following.

File Permission

There is also a mask. I'm not really familiar with masks so not sure how this differs.

The Mask permission on the folder shows

Mask Folder Permissions

On the file it shows as

Mask File Permission

Does anyone have any idea why this wouldn't be inheriting the ACL from the parent folder?


Solution

  • I've had a response from Microsoft support which has resolved this issue for me.

    Cause: Databricks stored files have Service principal as the owner of the files with permission -rw-r--r--, consequently forcing the effective permission of rest of batch users in ADLS from rwx (directory permission) to r-- which in turn causes jobs to fail

    Resolution: To resolve this, we need to change the default mask (022) to custom mask (000) on Databricks end. You can set the following in Spark Configuration settings under your cluster configuration: spark.hadoop.fs.permissions.umask-mode 000