sqlamazon-athenaamazon-waf

Using Athena to get terminatingrule from rulegrouplist in AWS WAF logs


I followed these instructions to get my AWS WAF data into an Athena table.

I would like to query the data to find the latest requests with an action of BLOCK. This query works:

SELECT
  from_unixtime(timestamp / 1000e0) AS date,
  action,
  httprequest.clientip AS ip,
  httprequest.uri AS request,
  httprequest.country as country,
  terminatingruleid,
  rulegrouplist
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;

My issue is cleanly identifying the "terminatingrule" - the reason the request was blocked. As an example, a result has

terminatingrule = AWS-AWSManagedRulesCommonRuleSet

And

rulegrouplist = [
  {
    "nonterminatingmatchingrules": [],
    "rulegroupid": "AWS#AWSManagedRulesAmazonIpReputationList",
    "terminatingrule": "null",
    "excludedrules": "null"
  },
  {
    "nonterminatingmatchingrules": [],
    "rulegroupid": "AWS#AWSManagedRulesKnownBadInputsRuleSet",
    "terminatingrule": "null",
    "excludedrules": "null"
  },
  {
    "nonterminatingmatchingrules": [],
    "rulegroupid": "AWS#AWSManagedRulesLinuxRuleSet",
    "terminatingrule": "null",
    "excludedrules": "null"
  },
  {
    "nonterminatingmatchingrules": [],
    "rulegroupid": "AWS#AWSManagedRulesCommonRuleSet",
    "terminatingrule": {
      "rulematchdetails": "null",
      "action": "BLOCK",
      "ruleid": "NoUserAgent_HEADER"
    },
    "excludedrules":"null"
  }
]

The piece of data I would like separated into a column is rulegrouplist[terminatingrule].ruleid which has a value of NoUserAgent_HEADER

AWS provide useful information on querying nested Athena arrays, but I have been unable to get the result I want.

I have framed this as an AWS question but since Athena uses SQL queries, it's likely that anyone with good SQL skills could work this out.


Solution

  • It's not entirely clear to me exactly what you want, but I'm going to assume you are after the array element where terminatingrule is not "null" (I will also assume that if there are multiple you want the first).

    The documentation you link to say that the type of the rulegrouplist column is array<string>. The reason why it is string and not a complex type is because there seems to be multiple different schemas for this column, one example being that the terminatingrule property is either the string "null", or a struct/object – something that can't be described using Athena's type system.

    This is not a problem, however. When dealing with JSON there's a whole set of JSON functions that can be used. Here's one way to use json_extract combined with filter and element_at to remove array elements where the terminatingrule property is the string "null" and then pick the first of the remaining elements:

    SELECT
      element_at(
        filter(
          rulegrouplist,
          rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON)
        ),
        1
      ) AS first_non_null_terminatingrule
    FROM waf_logs
    WHERE action = 'BLOCK'
    ORDER BY date DESC
    

    You say you want the "latest", which to me is ambiguous and could mean both first non-null and last non-null element. The query above will return the first non-null element, and if you want the last you can change the second argument to element_at to -1 (Athena's array indexing starts from 1, and -1 is counting from the end).

    To return the individual ruleid element of the json:

    SELECT from_unixtime(timestamp / 1000e0) AS date, action, httprequest.clientip AS ip, httprequest.uri AS request, httprequest.country as country, terminatingruleid, json_extract(element_at(filter(rulegrouplist,rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON)  ),1), '$.terminatingrule.ruleid') AS ruleid
    FROM waf_logs
    WHERE action='BLOCK'
    ORDER BY date DESC