azurepysparkazure-synapseazure-synapse-analyticsazure-notebooks

Parquet file not overwriting in azure synapse notebooks


I am working on azure synapsse analytics notebooks and I want to I load Parquet files into DataFrames, perform some transformations, and then overwrite the original files with the transformed data. This process works perfectly for all my files except for one, which throws an error.

Here is a simplified version of my code:

%%pyspark
folder_path = 'abfss://example_folder_path/'


df = spark.read.load(folder_path, format='parquet').cache()
display(df.limit(10))

df.createOrReplaceTempView("dcp")

# Transformations

display(final.limit(10))


df.write.mode('overwrite').parquet(folder_path)


df.unpersist()

The code is having a problem with overwriting this specific file and I get this error:

enter image description here

I have verified that the problematic file is indeed a parquet and confirmed that there are no permission issues with the file or directory. I have also tried to find the difference between this file and the rest to determine why it is not working but I haven't found anything.

Does anyone have any ideas as to why this specific file is causing issues or what I can do to fix it? Any help would be greatly appreciated.


Solution

  • This is happening because you are reading and overwriting on the same folder. Which is something I won't recommend but if you want to do then 2 options -

    1. Write it temp location, from temp location read again and overwrite the folder (recommended)
    2. Use cache after all transformation?

    Option 1 -

    folder_path = 'abfss://example_folder_path/'
    df = spark.read.load(folder_path, format='parquet')
    
    ##Transformation
    # df.transform()
    
    df.write.mode('overwrite').parquet(tempfolder)
    spark.read.load(tempfolder, format='parquet').write.mode('overwrite').parquet(folder_path)
    

    Option 2 -

    %%pyspark
    folder_path = 'abfss://example_folder_path/'
    df = spark.read.load(folder_path, format='parquet').cache()
    
    ##Transformation
    # df.transform()
    df.cache()
    df.show(10)
    
    df.write.mode('overwrite').parquet(folder_path)
    df.unpersist()