azureazure-databricksdelta-lakedata-processing

Delta table partition folder name is getting changed


I am facing an issue where the expected date parition folder should be named in format date=yyyymmdd, but instead writing as - enter image description here

Sometimes for each parquet file created in delta path, it's creating a seperate folder.

I don't see any issues with the source data or pyspark code, since it's working perfectly for other data sources. Also , the same data is writing perfectly in seperate delta path.

It's not causing any issues since in delta table date format is captured correctly and can be queried. But if I change the folder names manually in the storage account, then it throws error.

I am expecting data should be written for each date in a specific folder which should be named with the date value -

enter image description here

Since, the pyspark code is creating a date column from timestamp value like this - 2021-10-27T11:56:41.380416Z .I tried to convert the field into timestamp and then extract the date, but it then creates the folder as date=. The existing code was working for this database earlier , but suddenly started behaving this way


Solution

  • Thanks for the response. But it was an issue with delta table version. To remove or rename any delta table column we need to change in read version 2 and write version 5. But that causes the issue I was facing. And this change is irreversible as per databricks.

    For a delta table with default read version 1 and write version 2, the same code works fine and date folders