azure-machine-learning-serviceazureml-python-sdkazuremlsdk

How to modify azureml python sdk v2 to serve custom use cases?


I want to modify azureml.data.dataset_factory.register_pandas_dataframe() for my use case so it returns relative_path_with_guid in addition to the registered_dataset by default.

The default azureml.data.dataset_factory.register_pandas_dataframe() function definition is

@staticmethod
    @track(_get_logger, custom_dimensions={'app_name': 'TabularDataset'}, activity_type=_PUBLIC_API)
    def register_pandas_dataframe(dataframe, target, name, description=None, tags=None, show_progress=True):
        """Create a dataset from pandas dataframe.

        :param dataframe: Required, in memory dataframe to be uploaded.
        :type dataframe: pandas.DataFrame
        :param target: Required, the datastore path where the dataframe parquet data will be uploaded to.
            A guid folder will be generated under the target path to avoid conflict.
        :type target: typing.Union[azureml.data.datapath.DataPath, azureml.core.datastore.Datastore,
            tuple(azureml.core.datastore.Datastore, str)]
        :param name: Required, the name of the registered dataset.
        :type name: str
        :param description: Optional. A text description of the dataset. Defaults to None.
        :type description: str
        :param tags: Optional. Dictionary of key value tags to give the dataset. Defaults to None.
        :type tags: dict[str, str]
        :param show_progress: Optional, indicates whether to show progress of the upload in the console.
            Defaults to be True.
        :type show_progress: bool
        :return: The registered dataset.
        :rtype: azureml.data.TabularDataset
        """
        import pandas as pd
        from azureml.data.datapath import DataPath
        from uuid import uuid4

        console = get_progress_logger(show_progress)
        console("Validating arguments.")
        _check_type(dataframe, "dataframe", pd.core.frame.DataFrame)
        _check_type(name, "name", str)
        datastore, relative_path = parse_target(target, True)
        console("Arguments validated.")

        guid = uuid4()
        relative_path_with_guid = "%s/%s/" % (relative_path, guid)
        console("Successfully obtained datastore reference and path.")

        console("Uploading file to {}".format(relative_path_with_guid))
        sanitized_df = _sanitize_pandas(dataframe)
        dflow = dataprep().read_pandas_dataframe(df=sanitized_df, in_memory=True)
        target_directory_path = DataReference(datastore=datastore).path(relative_path_with_guid)
        dflow.write_to_parquet(directory_path=target_directory_path).run_local()
                
        console("Successfully uploaded file to datastore.")

        console("Creating and registering a new dataset.")
        datapath = DataPath(datastore, relative_path_with_guid)
        saved_dataset = TabularDatasetFactory.from_parquet_files(datapath)
        registered_dataset = saved_dataset.register(datastore.workspace, name,
                                                    description=description,
                                                    tags=tags,
                                                    create_new_version=True)
        console("Successfully created and registered a new dataset.")

        return registered_dataset

I have learned that changing the source code is a not good practice and I should rather make changes to the package in develop mode. Even if there is an option to do that I don't know where can I find a setup.py for the azureml-sdk package. I'm running into error when pip install azureml-sdk -e /path/to/azureml-dev/folder

ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: /path
/to/azureml-dev/folder

I was wondering if anyone has done some kind of experimentation like this in tweaking the azureml-sdk. And how were you able to figure out the setup.py issue?


Solution

  • Since azureml sdk-v2 is a closed source python module, it's code cannot be modified.