filterazure-blob-storagedatabricksdbutils

Is there any way to push down a filter when running "dbutils.fs.ls" in a Databricks notebook?


I have a container in an Azure blob storage that contains around 10,000,000 CSV and Zip files. 

I want to use "dbutils.fs.ls" in a Databricks notebook to get a list of files. However, after running the command and waiting for more than 30 minutes, I got the below error:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

I have used a multi-node cluster:

It seems that the cluster cannot handle getting the list of all files. I was wondering if I could push down a filter on filenames and get the list of files after filtering. I am interested in files starting with "Energy". In this way, it might be possible to get a list of desired files without the above error. 


Solution

  • Use Azure SDK instead:

    for blob in container_client.list_blobs(name_starts_with="Energy"):
      ...
    

    list_blobs can filter results, moreover it returns a generator instead of materializing all results upfront.