azureazure-machine-learning-servicefine-tuningslm-phi3

How to prepare data for batch-inference in Azure ML?


The data format (.csv) that I am using for inferencing produces the error :

"each data point should be a conversation array" when running the batch scoring job. All the documentations provided online deal with image data so I am unable to figure out the format for my data.

I am trying to create a job from the "Create job" button under the batch endpoint.

I am using Azure ML platform and I have fine-tuned a phi-3-mini-4k-instruct model using the "eurlex" data available on huggingface. The training process required a jsonl format for the training data. However, while trying the run the batch inference, the data assets are only stored in csv,png etc. formats. The training data looked something like the first image. I converted this into jsonl format. Img 1, training format

I created a batch endpoint and deployed the model there. I created a job and provided the data that I formatted in the manner shown in the second image. I simple passed the whole prompt as a string in a single column dataframe which I converted into csv before writing it in a .csv file.

Img 2, inferencing format

I have also tried making a dataframe with 3 columns - system,assistant,user but it doesn't work either.


Solution

  • You have to give the csv file with columns names which is having correct name or signature as per the deployment.

    If you see this example which is the batch deployment on text summarization model

    The driver code takes the batch input then loading it using load_dataset and fetching the column text for predictions.

    def run(mini_batch):
        resultList = []
    
        print(f"[INFO] Reading new mini-batch of {len(mini_batch)} file(s).")
        ds = load_dataset("csv", data_files={"score": mini_batch})
    
        start_time = time.perf_counter()
        for idx, text in enumerate(ds["score"]["text"]):
    

    Also while invoking the batch endpoint it give the data as input type.

    input = Input(type=AssetTypes.URI_FOLDER, path="data")
    job = ml_client.batch_endpoints.invoke(
       endpoint_name=endpoint.name,
       input=input,
    )
    

    Here, the input type is of folder type and the csv files or saved in the folder data.

    So, check your driver code or scoring script what column it is accessing and give the same in csv.

    Also, you invoke the batch endpoint with number of csv files like given in above code.

    If you don't find the scoring script or signature deploy the batch endpoint with your custom scoring script with help of code/batch_driver.py in the above provide documentation.

    Create a csv files with column message containing the string then read it in your custom scoring script for prediction.