amazon-web-servicesamazon-athenapartition

How to query on AWS Athena partitioned table


Summary of the Question

When I attempts to SELECT query the partitioned table with WHERE clause, Athena produce an error.

There are 4 types of partitions, in my log table.

I tried SELECT query on partitioned table. But got the below error message.

ERROR Message

GENERIC_INTERNAL_ERROR: No value present
This query ran against the "default" database, unless qualified by the query.

SELECT Query that I Tried

SELECT *
FROM logs
WHERE year='2020'
  AND month='10'
  AND day ='05';

AND

SELECT *
FROM "default"."logs"
WHERE year='2020'
  AND month='10'
  AND day ='05';

Since error message about No value present, I checked the partitions results.

SHOW PARTITIONS logs;

Results

year=2020/month=10/day=05/hour=17
year=2020/month=10/day=05/hour=11
year=2020/month=10/day=05/hour=19
year=2020/month=10/day=05/hour=04
year=2020/month=10/day=05/hour=18
year=2020/month=10/day=05/hour=15
year=2020/month=10/day=05/hour=14
year=2020/month=10/day=05/hour=16
year=2020/month=10/day=05/hour=13
year=2020/month=10/day=05/hour=21
year=2020/month=10/day=05/hour=05
year=2020/month=10/day=05/hour=08
year=2020/month=10/day=05/hour=20
year=2020/month=10/day=05/hour=12
year=2020/month=10/day=05/hour=03
year=2020/month=10/day=05/hour=01
year=2020/month=10/day=05/hour=10
year=2020/month=10/day=05/hour=02
year=2020/month=10/day=05/hour=09
year=2020/month=10/day=05/hour=22
year=2020/month=10/day=05/hour=23
year=2020/month=10/day=05/hour=06
year=2020/month=10/day=05/hour=07
year=2020/month=10/day=05/hour=00
year=2020/month=10/day=04/hour=00

I would greatly appreciate your help.


Further Info

CREATE TABLE command that I used

Create Table

CREATE EXTERNAL TABLE `logs`(
  `date` date, 
  `time` string, 
  `location` string, 
  `bytes` bigint, 
  `request_ip` string, 
  `method` string, 
  `host` string, 
  `uri` string, 
  `status` int, 
  `referrer` string, 
  `user_agent` string, 
  `query_string` string, 
  `cookie` string, 
  `result_type` string, 
  `request_id` string, 
  `host_header` string, 
  `request_protocol` string, 
  `request_bytes` bigint, 
  `time_taken` float, 
  `xforwarded_for` string, 
  `ssl_protocol` string, 
  `ssl_cipher` string, 
  `response_result_type` string, 
  `http_version` string, 
  `fle_status` string, 
  `fle_encrypted_fields` int)
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `hour` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
WITH SERDEPROPERTIES ( 
  'input.regex'='^(?!#)([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)\\\\s+([^ \\\\t]+)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://mybucket/path'
TBLPROPERTIES (
  'projection.date.format'='yyyy/MM/dd', 
  'projection.date.interval'='1', 
  'projection.date.interval.unit'='DAYS', 
  'projection.date.range'='2019/11/27, NOW-1DAYS', 
  'projection.date.type'='date', 
  'projection.day.type'='string', 
  'projection.enabled'='true', 
  'projection.hour.type'='string', 
  'projection.month.type'='string', 
  'projection.year.type'='string', 
  'skip.header.line.count'='2', 
  'storage.location.template'='s3://mybucket/path/distributionID/${year}/${month}/${day}/${hour}/', 
  'transient_lastDdlTime'='1575005094')

Solution

  • Your table uses partition projection, but your configuration does not match the partitioning correctly. Partition projection is a fairly new feature, and the documentation still leaves a bit to be desired so I completely understand that it's confusing. I think I see what you're trying to do.

    The partition projection configuration must exactly match the partition keys of the table. In your case there are four partition keys of the table, and the partition projection configuration mentions five. In addition to that the types of four are wrong, there is no string partition projection type.

    You can fix the problem by making two changes. First change the partition keys like this:

    PARTITIONED BY (
      `date` string,
      `hour` string
    )
    

    This removes the "year", "month", and "day" partition keys in favour of a "date" key. Having separate date components just because they are separate "directories" is not necessary, and just having a "date" key will make queries much easier to write.

    Then you change the table properties to this:

    TBLPROPERTIES (
      'projection.date.format' = 'yyyy/MM/dd', 
      'projection.date.interval' = '1', 
      'projection.date.interval.unit' = 'DAYS', 
      'projection.date.range' = '2019/11/27, NOW-1DAYS', 
      'projection.date.type' = 'date', 
      'projection.hour.type' = 'integer',
      'projection.hour.range' = '0-23',
      'projection.hour.digits' = '2',
      'projection.enabled' = 'true',
      'storage.location.template'='s3://mybucket/path/distributionID/${date}/${hour}/', 
      'skip.header.line.count' = '2'
    )
    

    This tells Athena that the "date" partition key is of type date, and that it's formatted as "YYYY/MM/DD" (which corresponds to the format in the S3 URIs, this is important). It also tells Athena that the "hour" partition key is an integer with range 0-23, formatted with two digits (i.e. zero-filled). Finally it specifies how these partition keys map to the locations of partitions on S3. When the date in a query is "2020/10/06" that string will be inserted verbatim in the location template.

    With those changes you should be able to run queries like the following ("date" is a reserved word and must be quoted when it's the name of a column):

    SELECT *
    FROM logs
    WHERE "date" = '2020/10/06'
    
    SELECT *
    FROM logs
    WHERE "date" BETWEEN '2020/10/01' AND '2020/10/06'
    AND hour BETWEEN 9 AND 21
    

    Note that the date format must be exactly like the format in the partition projection configuration, i.e. YYYY/MM/DD.