Consider CloudFront Logs injected by AWS are stored in S3 with pattern:
s3://aws-cloudfront-log-[AWS Account ID]/[Any prefix I preferred]/E[CloudFront Distribution ID].[Year]-[Month]-[Day]-[Hour].[Hash].gz
Here is an example
s3://aws-cloudfront-log-1290287349012/my-app/E12KDDSA1S7.2024-06-26-23.9e4a2b9e.gz
According to official doc, this is the Athena DDL for CloudFront standard log: https://docs.aws.amazon.com/athena/latest/ug/cloudfront-logs.html#create-cloudfront-table-standard-logs
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_standard_logs (
`date` DATE,
time STRING,
x_edge_location STRING,
sc_bytes BIGINT,
c_ip STRING,
cs_method STRING,
cs_host STRING,
cs_uri_stem STRING,
sc_status INT,
cs_referrer STRING,
cs_user_agent STRING,
cs_uri_query STRING,
cs_cookie STRING,
x_edge_result_type STRING,
x_edge_request_id STRING,
x_host_header STRING,
cs_protocol STRING,
cs_bytes BIGINT,
time_taken FLOAT,
x_forwarded_for STRING,
ssl_protocol STRING,
ssl_cipher STRING,
x_edge_response_result_type STRING,
cs_protocol_version STRING,
fle_status STRING,
fle_encrypted_fields INT,
c_port INT,
time_to_first_byte FLOAT,
x_edge_detailed_result_type STRING,
sc_content_type STRING,
sc_content_len BIGINT,
sc_range_start BIGINT,
sc_range_end BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://DOC-EXAMPLE-BUCKET/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )
I have tried to modify the DDL SQL to the following so that it is partitioned by date
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs.editorial (
-- same set of columns
)
PARTITIONED BY(
date_filter string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://aws-cloudfront-log-1290287349012/my-app/'
TBLPROPERTIES (
'skip.header.line.count'='2',
'projection.date_filter.format'='yyyy-MM-dd-HH',
'projection.date_filter.interval'='1',
'projection.date.interval.unit'='HOURS',
'projection.date_filter.range'='2021-01-01-00,NOW',
'projection.date_filter.type'='date',
'projection.enabled'='true',
'storage.location.template'='s3://aws-cloudfront-log-1290287349012/my-app/E12KDDSA1S7.${date}'
);
However when I run select * FROM cloudfront_logs.editorial WHERE date_filter = '2024-06-26-23' LIMIT 1;
, there is nothing returned.
What is missing or wrong? It seems Athena only support partition by folder level but not filename level although for S3 objects the so call folder is not really a folder but just part of the object key in most cases.
I also found this AWS Blog which is kind of hilarious as I have to pay and maintain a set of components to fix the bad logging folder structure which AWS should cater at the beginning. https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/