I want to save a Dataframe (pyspark.pandas.Dataframe) as an Excel file on the Azure Data Lake Gen2 using Azure Databricks in Python. I've switched to the pyspark.pandas.Dataframe because it is the recommended one since Spark 3.2.
There's a method called to_excel (here the doc) that allows to save a file to a container in ADL but I'm facing problems with the file system access protocols. From the same class I use the methods to_csv and to_parquet using abfss and I'd like to use the same for the excel.
So when I try so save it using:
import pyspark.pandas as ps
# Omit the df initialization
file_name = "abfss://CONTAINER@SERVICEACCOUNT.dfs.core.windows.net/FILE.xlsx"
sheet = "test"
df.to_excel(file_name, test)
I get the error from fsspec:
ValueError: Protocol not known: abfss
Can someone please help me?
Thanks in advance!
The pandas dataframe does not support the protocol. It seems on Databricks you can only access and write the file on abfss via Spark dataframe. So, the solution is to write file locally and manually move to abfss. See this answer here.