pythonamazon-web-servicesanaconda3amazon-comprehend

How to merge the AWS Comprehend batch_detect_key_phrases() ResultList and ErrorList


I have a dataframe with tweets. Each row corresponds to 1 tweet. I can obtain the key phrases using AWS Comprehend batch_detect_key_phrases(). batch_detect_key_phrases() returns a ResultList and ErrorList in the payload. In order to merge the key phrase results back into the dataframe they need to align with the original tweets thus I need to keep the ResultList and ErrorList in alignment.

The code here on line 267 processes the ErrorList and ResultList separately.

According to the Python Boto docs, "ErrorList (list) -- A list containing one object for each document that contained an error. The results are sorted in ascending order by the Index field and match the order of the documents in the input list..."

The code I wrote below uses the ResultList and ErrorList Index numbers to ensure they are merged properly into a keyPhrases list which will then be merged back to the original data frame. Essentially, keyPhrases[0] are the key phrases associated with dataframe row 0. If there was an error processing a tweet, then a placeholder error message would be added to that row in the dataframe.

The only other way I thought I might keep the ResultList and ErrorList in alignment would be to merge the 2 lists into a larger list ordered ascending by their respective Index. Next, I would then process that 1 larger list.

Is there an easier way to process ResultList and ErrorList such that they are kept in alignment?

keyphraseResults = {'ResultList': [
            {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
            {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
            {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
            {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
              {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}

# Holds the ordered list of key phrases that correspond to the data frame. 
keyPhrases = []

# Set it to an arbitrarily large number in case ErrorList below is empty we'll still 
# need a number for comparison. 
errIndexlist = [9999]

# This will be inserted for the rows corresponding to the ErrorList. 
ErrorMessage = "* Error processing keyphrases"

# Since the rows of the response need to be kept in alignment with the rows of the dataframe, 
# get the error indicies first, if any. These will be compared to the ResultList below.
if 'ErrorList' in keyphraseResults and len(keyphraseResults['ErrorList']) > 0:
    batchErroresults = keyphraseResults["ErrorList"]
    errIndexlist = []

    for entry in batchErroresults:
        errIndexlist.append(entry["Index"])
        print(entry)

# Sort the indicies to ensure they are in ascending order since that order is 
# important for the logic below. 
errIndexlist.sort(reverse = False)

if 'ResultList' in keyphraseResults:

    batchResults = keyphraseResults["ResultList"]

    for entry in batchResults:

        resultDict = entry["KeyPhrases"]

        if len(errIndexlist) > 0:

            if entry['Index'] < errIndexlist[0]:

                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

            else:
                # Else we have an error to merge from the PRIOR result.
                keyPhrases.append(ErrorMessage)
                errIndexlist.remove(errIndexlist[0])

                # THEN add the key phrase for the current result.
                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

print("\nFinal results are:")
for text in keyPhrases:
    print(text)

Solution

  • I figured it out based on this SO post.

    Overall, merge the ResultList and ErrorList, sort the merged list on Index then sequentially process the merged list.

    from operator import itemgetter
    
    keyphraseResults = {'ResultList': [
            {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
            {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
            {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
            {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
            'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
              {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
            'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b',   'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}
    
    keyPhrases = []
    
    # This will be inserted for the rows in ErrorList or just make it empty. 
    ErrorMessage = "* Error processing keyphrases"
    
    if len(keyphraseResults["ResultList"]) > 0 and len(keyphraseResults["ErrorList"]) > 0:
        processResults = keyphraseResults["ResultList"].copy() + keyphraseResults["ErrorList"].copy()
    elif len(keyphraseResults["ResultList"]) > 0:
        processResults = keyphraseResults["ResultList"].copy()
    else:
        processResults = keyphraseResults["ErrorList"].copy()
    
    processResults = sorted(processResults, key=itemgetter('Index'), reverse = False)
    
    for entry in processResults:
    
        if 'ErrorCode' in entry:
            keyPhrases.append(ErrorMessage)
    
        elif 'KeyPhrases' in entry:
            resultDict = entry["KeyPhrases"]
    
            results = ""
            for textDict in resultDict: 
                results = results + ", " + textDict['Text']
    
            # Remove the leading comma.
            if len(results) > 2:
                results = results[2:]
    
            keyPhrases.append(results)
    
    print("\nFinal results are:")
    for text in keyPhrases:
        print(text)