[SOLVED] Is there any way to push down a filter when running "dbutils.fs.ls" in a Databricks notebook?

Is there any way to push down a filter when running "dbutils.fs.ls" in a Databricks notebook?

I have a container in an Azure blob storage that contains around 10,000,000 CSV and Zip files.

I want to use "dbutils.fs.ls" in a Databricks notebook to get a list of files. However, after running the command and waiting for more than 30 minutes, I got the below error:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

I have used a multi-node cluster:

Driver: Standard_D4ds_v5
Worker: Standard_D4ds_v5
Min Workers: 2
Max Workers: 8

It seems that the cluster cannot handle getting the list of all files. I was wondering if I could push down a filter on filenames and get the list of files after filtering. I am interested in files starting with "Energy". In this way, it might be possible to get a list of desired files without the above error.

Solution

Use Azure SDK instead:

for blob in container_client.list_blobs(name_starts_with="Energy"):
  ...

list_blobs can filter results, moreover it returns a generator instead of materializing all results upfront.