azurepysparkazure-blob-storagespark-csvazure-databricks

PySpark on Databricks: Reading a CSV file copied from the Azure Blob Storage results in java.io.FileNotFoundException


I am running Azure Databricks 4.3 (includes Apache Spark 2.3.1, Scala 2.11).

I copied a CSV file from Azure Blob Storage into Databricks cluster using dbutils.fs.cp into disk by adding file: to the absolute local_path:

copy_to = "file:" + local_path
dbutils.fs.cp(blob_storage_path, copy_to)

When I then try to read the file using the same path with file: added in front:

csv_spark_df = sqlContext.read.format('csv').options(header='true', inferSchema='true').load(copy_to)

I am getting an error message denoting that the given path does not exist:

java.io.FileNotFoundException: File file:/<local_path>

When I mount the Azure Blob Storage container, as described below, then I can read the file correctly with Spark using the same snippet above, using the absolute local_path of the file in the mounted directory:

https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs

Is it at all possible to read the CSV file that was copied from the Azure Blob Storage or is the solution using mounting of the Azure Blob Storage container the preferred one anyway?


Solution

  • I'm not certain what the file: will map to.

    I would have expected the path to be a DBFS path:

    copy_to = "/path/file.csv"
    

    This will be assumed to a DBFS path.

    You can always do:

    dbutils.fs.ls("/path")
    

    To verify the file copy.

    Though please note you do not need to copy the file to DBFS to load into a dataframe - you can read directly from the blob storage account. That would be the normal approach. Is there a reason you want to copy it locally?