databricksazure-databricks

Unable to read in data from Databricks Academy


I am using the Databricks Academy to learn. Databricks comes with data stored in adl to be used with the training.

However the data doesn't appear to be accessible. We are getting the error:

com.microsoft.azure.datalake.store.ADLException: Error getting info for file /dbacademy/people10m.parquet

The location of the data is:

people10m = spark.read.parquet("adl://devszendsadlsrdpacqncd.azuredatalakestore.net/dbacademy/people10m.parquet")

Can someone explain why we're unable to access the data

Just to add some clarity to this question, the following link shows a databricks notebook on learning Aggregations, JOINs and Nested Queries. In order to learn with the notebook there is a requirement to run the following classroom setup with the following code: %run "./Includes/Classroom-Setup"

This will execute the following code in a notebook called "Classroom-Setup"

people10m = spark.read.parquet("adl://devszendsadlsrdpacqncd.azuredatalakestore.net/dbacademy/people10m.parquet")

However, when the notebook runs the code I get the following error:

com.microsoft.azure.datalake.store.ADLException: Error getting info for file /dbacademy/people10m.parquet

Therefore, can someone let me know why I'm getting the error, and provide a workaround


Solution

  • As per the code you shared, I can see that you are trying to read data from Azure Data Lake Storage Gen 1(ADLS Gen1) but this service is not supported now in azure.

    enter image description here

    You will not be able to access data from ADLS gen1 First you need to Migrate data from ADLS Gen1 to ADLS Gen 2. You can refer this MS document for more information on migration.

    After migrating the data to ADLS Gen2 You can access it from Azure Databricks with below code:

    ##Connect to ADLSGen2
    spark.conf.set(
        "fs.azure.account.key.<storage-account>.dfs.core.windows.net",
        dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
    
    ##read file from ADLSGen2
    spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")