pythonjsonamazon-web-servicesamazon-s3amazon-personalize

JSON malformed error for Batch Inference Job Input - Amazon Personalize


I have created a solution version using "similar-items" recipe in Amazon Personalize and trying to test it with a batch inference job. I followed AWS documentation which states that the input should be a list of itemIds, with maximum of 500 items, and each itemId separated with a new line:

{"itemId": "105"}
{"itemId": "106"}
{"itemId": "441"}
...

Accordingly, I wrote the following code to transform my item_ids column into the described JSON format:

    # convert item_id column to required JSON format with new lines entered between items
    items_json = items_df['ITEM_ID'][1:200].to_json(orient='columns').replace(',','}\n{')

    # write output to json file
    with open('items_json.json', 'w') as f:
        json.dump(items_json, f)

    # write file to S3
    from io import StringIO  
    import s3fs

    #Connect to S3 default profile
    s3 = boto3.client('s3')

    s3.put_object(
         Body=json.dumps(items_json),
         Bucket='bucket',
         Key='personalize/batch-recommendations-input/items_json.json'
    )

Then when I run the batch inference job with that as input, it gives the following error: "User error: Input JSON is malformed."

My sample JSON input looks as follows:

    "{"itemId":"12637"} {"itemId":"12931"} {"itemId":"13005"}"

and after copying it to S3 as follows (adding backslashes to it)- don't know if that's significant in anyway:

    "{\"itemId\":\"12637\"}\n{\"itemId\":\"12931\"}\n{\"itemId\":\"13005\"}"

To me, my format looks quite similar to what they asked for, any clue what might be causing the error?


Solution

  • You just need some small changes to the use of to_json. Specifically, orient should be records and lines should be True.

    Full example:

    import pandas as pd
    import boto3
    
    items_df = pd.read_csv("...")
    
    # Make sure item ID column name is "itemId"
    item_ids_df = items_df.rename(columns={"ITEM_ID": "itemId"})[["itemId"]]
    
    # Write df to file in JSON lines format
    item_ids_df.to_json("job_input.json", orient="records", lines=True)
    
    # Upload to S3
    boto3.Session().resource('s3').Bucket(bucket).Object("job_input.json").upload_file("job_input.json")
    

    Lastly, you mentioned that the maximum number of input items is 500. Actually, your input file can have up to 50M input items or a file size of 1GB.