I am using the Databricks Academy to learn. Databricks comes with data stored in adl to be used with the training.
However the data doesn't appear to be accessible. We are getting the error:
com.microsoft.azure.datalake.store.ADLException: Error getting info for file /dbacademy/people10m.parquet
The location of the data is:
people10m = spark.read.parquet("adl://devszendsadlsrdpacqncd.azuredatalakestore.net/dbacademy/people10m.parquet")
Can someone explain why we're unable to access the data
Just to add some clarity to this question, the following link shows a databricks notebook on learning Aggregations, JOINs and Nested Queries. In order to learn with the notebook there is a requirement to run the following classroom setup with the following code: %run "./Includes/Classroom-Setup"
This will execute the following code in a notebook called "Classroom-Setup"
people10m = spark.read.parquet("adl://devszendsadlsrdpacqncd.azuredatalakestore.net/dbacademy/people10m.parquet")
However, when the notebook runs the code I get the following error:
com.microsoft.azure.datalake.store.ADLException: Error getting info for file /dbacademy/people10m.parquet
Therefore, can someone let me know why I'm getting the error, and provide a workaround
As per the code you shared, I can see that you are trying to read data from Azure Data Lake Storage Gen 1(ADLS Gen1) but this service is not supported now in azure.
You will not be able to access data from ADLS gen1 First you need to Migrate data from ADLS Gen1 to ADLS Gen 2. You can refer this MS document for more information on migration.
After migrating the data to ADLS Gen2 You can access it from Azure Databricks with below code:
##Connect to ADLSGen2
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
##read file from ADLSGen2
spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")