pythondatabricksazure-databricks

How to write parquet file to Databricks Volume?


I'd like to export data from tables within my Databricks Unity Catalog. I'd like to transform each of the tables to a single parquet file which I can download. I thought I just write a table to a parquet file in my Unity Catalog Volume (whose files and stuff I can also see within my Microsoft Azure Storage Explorer) such that I can download it easily. That did not work. So, what I tried were the following approaches:

  1. spark.table(my_unity_catalog_table_path).repartition(1).write.format('parquet').mode('overwrite').save('/Volumes/my_volume_name/my_table'). Databricks told me that I'm not allowed to write to a Volume like that
  2. Write the same table to the Workspace like "/Workspace/Users/myuser/my_table" but that also didn't work as no file was created although I didn't get any error at all.
  3. Write the same table to the tmp directory like "/tmp/my_table" but that also didn't work as no file was created although I didn't get any error at all.
  4. Transform the table to Pandas and write a parquet file to the Workspace like spark.table(my_unity_catalog_table_path).toPandas().to_parquet('/Workspace/Users/myuser/my_table.parquet') which worked but not for bigger tables as I guess the Workspace has limits regarding file size
  5. Transform the table to Pandas and write a parquet file to the Volume directly like spark.table(my_unity_catalog_table_path).toPandas().to_parquet('/Volumes/my_volume_name/my_table.parquet') but that also didn't work out...
  6. Transform the table to Pandas and write a parquet file to the tmp folder like spark.table(my_unity_catalog_table_path).toPandas().to_parquet('/tmp/my_table.parquet') in order to move it to the Volume afterwards using dbutils.fs.mv or shutil.move. None of those options worked either.

So how can this be done?


Solution

  • If you're working with Azure Data Lake you can try writing directly to it and then download file from there using Storage Explorer. Try specifying path like in a like snippet below:

    spark.table(my_unity_catalog_table_path).repartition(1).write.format('parquet').mode('overwrite').save("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")