I want to modify azureml.data.dataset_factory.register_pandas_dataframe()
for my use case so it returns relative_path_with_guid
in addition to the registered_dataset
by default.
The default azureml.data.dataset_factory.register_pandas_dataframe()
function definition is
@staticmethod
@track(_get_logger, custom_dimensions={'app_name': 'TabularDataset'}, activity_type=_PUBLIC_API)
def register_pandas_dataframe(dataframe, target, name, description=None, tags=None, show_progress=True):
"""Create a dataset from pandas dataframe.
:param dataframe: Required, in memory dataframe to be uploaded.
:type dataframe: pandas.DataFrame
:param target: Required, the datastore path where the dataframe parquet data will be uploaded to.
A guid folder will be generated under the target path to avoid conflict.
:type target: typing.Union[azureml.data.datapath.DataPath, azureml.core.datastore.Datastore,
tuple(azureml.core.datastore.Datastore, str)]
:param name: Required, the name of the registered dataset.
:type name: str
:param description: Optional. A text description of the dataset. Defaults to None.
:type description: str
:param tags: Optional. Dictionary of key value tags to give the dataset. Defaults to None.
:type tags: dict[str, str]
:param show_progress: Optional, indicates whether to show progress of the upload in the console.
Defaults to be True.
:type show_progress: bool
:return: The registered dataset.
:rtype: azureml.data.TabularDataset
"""
import pandas as pd
from azureml.data.datapath import DataPath
from uuid import uuid4
console = get_progress_logger(show_progress)
console("Validating arguments.")
_check_type(dataframe, "dataframe", pd.core.frame.DataFrame)
_check_type(name, "name", str)
datastore, relative_path = parse_target(target, True)
console("Arguments validated.")
guid = uuid4()
relative_path_with_guid = "%s/%s/" % (relative_path, guid)
console("Successfully obtained datastore reference and path.")
console("Uploading file to {}".format(relative_path_with_guid))
sanitized_df = _sanitize_pandas(dataframe)
dflow = dataprep().read_pandas_dataframe(df=sanitized_df, in_memory=True)
target_directory_path = DataReference(datastore=datastore).path(relative_path_with_guid)
dflow.write_to_parquet(directory_path=target_directory_path).run_local()
console("Successfully uploaded file to datastore.")
console("Creating and registering a new dataset.")
datapath = DataPath(datastore, relative_path_with_guid)
saved_dataset = TabularDatasetFactory.from_parquet_files(datapath)
registered_dataset = saved_dataset.register(datastore.workspace, name,
description=description,
tags=tags,
create_new_version=True)
console("Successfully created and registered a new dataset.")
return registered_dataset
I have learned that changing the source code is a not good practice and I should rather make changes to the package in develop mode. Even if there is an option to do that I don't know where can I find a setup.py for the azureml-sdk package. I'm running into error when
pip install azureml-sdk -e /path/to/azureml-dev/folder
ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: /path
/to/azureml-dev/folder
I was wondering if anyone has done some kind of experimentation like this in tweaking the azureml-sdk. And how were you able to figure out the setup.py issue?
Since azureml sdk-v2 is a closed source python module, it's code cannot be modified.