aws-glueamazon-athenaaws-glue-data-cataloghive-serde

Utility that will create an AWS Athena table definition from AWS Glue catalog so I can add a WITH SERDEPROPERTIES section


[Update: looks like a aws glue get-table --database-name xyz --name tablename will give me the raw materials for the table definition, so that's progress--just wondering if something exists that automatically assembles the pieces]

[Update 2: You can have Cloudtrail show an Athena table definition, without using glue or a glue crawler as discussed in this CloudTrail specific Athena documentation. It's still got some quirks (not defining some of the more complex data elements as structs), but it's better than what glue generates if pointed at cloudtrail.]

I've got a Glue cataloged S3 bucket with cloud trail logs. The JSON nests pretty deep. Down deep, there are some properties like encryptionContext:struct<aws\:cloudtrail\:arn:string,aws\:s3\:arn:string>. Those colons in property names are throwing off Athena and causing queries to fail.

I've found out it would be easy to supply a mapping via Serde that could overcome this, but for that, I have to generate an external table definition for Athena, so that I can write a WITH SERDEPROPERTIES section in the table definition.

I'd like to auto-generate the Athena table definition (fairly complex) from the Glue catalog. Does anyone have pointers to such code or utility?

The approach to mapping the offending property names is described in this AWS Big Data Blog post, in the section "Walkthrough: Handling forbidden characters with mappings".

Here's an example of the error I'm currently getting when I do a simple select from the Athena table whose metadata is provided by Glue:

HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error: : expected at the position 997 of 'struct,stackName:string,keyId:string,aggregateField:string,filter:struct,startTimes:array>>,host:array,bucketName:string,location:array,roleArn:string,roleSessionName:string,templateURL:string,encryption:array,parameters:array>,includeShared:boolean,topicArn:string,policyName:string,attributeName:string,attributeValue:string,protocol:string,endpoint:string,returnSubscriptionArn:boolean,name:string,attributes:struct,eventCategory:string,maxResults:int,notificationARNs:array,capabilities:array,tags:array,disableRollback:boolean,lookupAttributes:array>,policy:string,description:string,keyUsage:string,customerMasterKeySpec:string,origin:string,bypassPolicyLockoutSafetyCheck:boolean,aliasName:string,targetKeyId:string,trailName:string,encryptionContext:struct<aws:cloudtrail:arn:string,aws:s3:arn:string>,keySpec:string,trailNameList:array,includeShadowTrails:boolean,s3BucketName:string,s3KeyPrefix:string,snsTopicName:string,includeGlobalServiceEvents:boolean,isMultiRegionTrail:boolean,enableLogFileValidation:boolean,kmsKeyId:string,bucketPolicy:struct,Sid:string,Condition:struct>>>>,eventSelectors:array,excludeManagementEventSources:array>>,ServerSideEncryptionConfiguration:struct>>,tagging:array,Tagging:struct>>>,x-amz-acl:array,resourceIdList:array,logging:array,website:array,lifecycle:array,notification:array,versioning:array,publicAccessBlock:array,acl:array,cors:array,object-lock:array,requestPayment:array,replication:array,resourceArn:string,DescribeFlowLogsRequest:string>' but '\' is found. (Service: null; Status Code: 0; Error Code: null; Request ID: null)


Solution

  • The usual way to get the DDL from a Glue table is to run SHOW CREATE TABLE foo, but since Glue has created a table that does not work in Athena, I assume this fails.

    In your specific case you should just us the schema suggested by the Athena documentation on querying CloudTrail. Apart from the partitioning it's as good as it gets. As you know, there are free-form properties that are service-dependent in CloudTrail events, and there is no schema that will capture everything (even if there was, as soon as a new service was launched it would be outdated). Stick with string for the columns corresponding to the free-form properties and use Athena/Presto's JSON functions to query these columns.

    I would make one small modification to the schema and have a different set of partition keys. The docs use region, year, month, day. You may want to add account ID and organization ID (if your trail is an organization trail) – but more importantly you should not have year, month, and date as separate partition keys, it makes querying date ranges unnecessarily complicated. There is no reason not to simply use "date" (or "dt" if you want to avoid having to quote it), typed either as a string or a date as the partition key, and add partitions like this:

    ALTER TABLE cloud_trail ADD 
    PARTITION (account_id = '1234567890', region = 'us-east-1', dt = '2020-05-17')
    LOCATION 's3://trails/AWSLogs/Account_ID/CloudTrail/us-east-1/2020/05/17/'
    

    Just because something is separated by slashes in the S3 key doesn't mean it has to be separate partition keys. Using a single key makes it easy to do range queries like WHERE "date" BETWEEN '2019-12-01' AND '2020-06-01'.

    Glue Crawlers are pretty terrible when you don't hit the use case they were intended for spot on, and I'm not surprised at all that it creates an unusable schema. In a way it's pretty amazing that there are so many cases of AWS services that predate Glue and where Glue just creates unusable results when used on their outputs.

    For CloudTrail the schema-discovery aspect of Glue Crawlers is not necessary, and will probably mostly cause problems due to the free-form properties. The other aspect, adding new partitions, can instead be solved with a Lambda function running once per day adding the next day's partitions (since tomorrow's partitions are deterministic you don't have to wait until there is data to add partitions).