azureazure-machine-learning-service

Access individual files in a folder data asset in azure machine learning


I have a data asset in azure machine learning. This is of type folder and the folder contains 4 different files with different schemas. when I consume this data asset in the azure ML notebook, it treats the different files as partitions and messes up the schema. I want to select individual files while pulling into the notebook.

I tried to pass the file name as a parameter in the path variable as shown below:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("data_asset_name", version="1")

path = {
  'folder': data_asset.path + "file_name.csv"
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df

But this gives the following error:

UserErrorException: 
Error Code: ScriptExecution.StreamAccess.NotFound
Native Error: Dataflow visit error: ExecutionError(StreamError(NotFound))
    VisitError(ExecutionError(StreamError(NotFound)))
=> Failed with execution error: error in streaming from input data sources
    ExecutionError(StreamError(NotFound))
Error Message: The requested stream was not found. Please make sure the request uri is correct.| session_id= <some id>

How do I pull in individual files?


Solution

  • According to this documentation from_delimited_files supports paths with

    files or folders with local or cloud paths

    So, when you want to read files mention file in dictionary, if it is folder then mention folder.

    Alter your code like below.

    path = {
      'file': data_asset.path + "winequality-white.csv"
    }
    
    tbl = mltable.from_delimited_files(paths=[path],delimiter=';')
    df = tbl.to_pandas_dataframe()
    df
    

    Output:

    enter image description here