I have a container in an Azure blob storage that contains around 10,000,000 CSV and Zip files.
I want to use "dbutils.fs.ls" in a Databricks notebook to get a list of files. However, after running the command and waiting for more than 30 minutes, I got the below error:
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
I have used a multi-node cluster:
It seems that the cluster cannot handle getting the list of all files. I was wondering if I could push down a filter on filenames and get the list of files after filtering. I am interested in files starting with "Energy". In this way, it might be possible to get a list of desired files without the above error.
Use Azure SDK instead:
for blob in container_client.list_blobs(name_starts_with="Energy"):
...
list_blobs
can filter results, moreover it returns a generator instead of materializing all results upfront.