[SOLVED] Reading Data from non-existant storage path cluster types

Reading Data from non-existant storage path cluster types

When reading data from an azure storage container with a path leading to nothing I get different behaviour for no-isolation-shared and Standard(previously called shared) cluster types.

no-isolation-shared (and personal as well) throws an error upon trying to read the data from the path.

Standard however "reads" a dataframe lacking any schema or content. Upon trying to display() the dataframe it throws the path does not exist error. I assume it is lazily evaluated later, but is there any specific reason for this difference.

Solution

When you try to read data from a path in Azure Storage that does not exist, Databricks may behave differently depending on the type of cluster you are using such as no-isolation-shared, personal, or standard. Each cluster type has a different level of isolation and sharing, which affects how it handles tasks like checking if a file exists.

Also, Spark uses something called lazy evaluation. It does not run your code right away when you ask it to do something like spark.read.parquet(). The actual work only happens when action like display(), count() or collect() is executed.

Now, if you are trying to read from a path that does not exist, it can depend on the type of cluster you are using:

Standard Clusters wait until you trigger an action before they check if the path exists, so the error only shows up later.

No Isolation Shared and Personal Clusters might check the path sooner, which means they can give you an error right away, even before you run an action.

So, the difference is when the cluster is looks for the file that depends on the cluster type.

To handle scenarios where the data path might not exist:

Explicitly Check Path Existence: Before reading data you can verify the path is exists or not by dbutils.fs.ls()

dbutils.fs.ls("abfss://container@account.dfs.core.windows.net/path/")
df = spark.read.parquet("abfss://container@account.dfs.core.windows.net/path/")

Enforce the Schema: If you know what the data should look like (column names and types) and avoid Spark trying to figure it out on its own which can sometimes cause it to check file paths.

Resources:
Lazy Evaluation Databricks
Admin Isolation on Shared Clusters Blog
No-isolation-shared Documents