pythonaws-glueaws-glue-data-catalogamazon-athena

AWS Glue create_partition using boto3 successful, but Athena not showing results for query


I have a glue script to create new partitions using create_partition(). The glue script is running successfully, and i could see the partitions in the Athena console when using SHOW PARTITIONS. For glue script create_partitions, I did refer to this sample code here : https://medium.com/@bv_subhash/demystifying-the-ways-of-creating-partitions-in-glue-catalog-on-partitioned-s3-data-for-faster-e25671e65574

enter image description here

When I try to run a Athena query for a given partition which was newly added, I am getting no results.

enter image description here

Is it that I need to trigger the MSCK command, even if I add the partitions using create_partitions. Appreciate any suggestions please .


Solution

  • I have got the solution myself, wanted to share with SO community, so it would be useful someone. The following code when run as a glue job, creates partitions, and can also be queried in Athena for the new partition columns. Please change/add the parameter values db name, table name, partition columns as needed.

    import boto3
    import urllib.parse
    import os
    import copy
    import sys
    
    # Configure database / table name and emp_id, file_id from workflow params?
    DATABASE_NAME = 'my_db'
    TABLE_NAME = 'enter_table_name'
    emp_id_tmp = ''
    file_id_tmp = ''
    
    # # Initialise the Glue client using Boto 3
    glue_client = boto3.client('glue')
    
    #get current table schema for the given database name & table name
    def get_current_schema(database_name, table_name):
        try:
            response = glue_client.get_table(
                DatabaseName=DATABASE_NAME,
                Name=TABLE_NAME
            )
        except Exception as error:
            print("Exception while fetching table info")
            sys.exit(-1)
        
        # Parsing table info required to create partitions from table
        table_data = {}
        table_data['input_format'] = response['Table']['StorageDescriptor']['InputFormat']
        table_data['output_format'] = response['Table']['StorageDescriptor']['OutputFormat']
        table_data['table_location'] = response['Table']['StorageDescriptor']['Location']
        table_data['serde_info'] = response['Table']['StorageDescriptor']['SerdeInfo']
        table_data['partition_keys'] = response['Table']['PartitionKeys']
        
        return table_data
    
    #prepare partition input list using table_data
    def generate_partition_input_list(table_data):
        input_list = []  # Initializing empty list
        part_location = "{}/emp_id={}/file_id={}/".format(table_data['table_location'], emp_id_tmp, file_id_tmp)
        input_dict = {
            'Values': [
                emp_id_tmp, file_id_tmp
            ],
            'StorageDescriptor': {
                'Location': part_location,
                'InputFormat': table_data['input_format'],
                'OutputFormat': table_data['output_format'],
                'SerdeInfo': table_data['serde_info']
            }
        }
        input_list.append(input_dict.copy())
        return input_list
    
    #create partition dynamically using the partition input list
    table_data = get_current_schema(DATABASE_NAME, TABLE_NAME)
    input_list = generate_partition_input_list(table_data)
    try:
        create_partition_response = glue_client.batch_create_partition(
                DatabaseName=DATABASE_NAME,
                TableName=TABLE_NAME,
                PartitionInputList=input_list
            )
        print('Glue partition created successfully.') 
        print(create_partition_response)
    except Exception as e:
                # Handle exception as per your business requirements
                print(e)