amazon-web-servicesamazon-cloudwatchamazon-cloudwatchlogsaws-cloudwatch-log-insights

CloudWatch Insights Query - How to get a single count from counts


I have a log file which contains playerId values, some players have multiple entries in the file. I want to get an exact distinct count of to unique players, regardless of if they have 1 or multiple entries in the log file.

Using the query below it scans 497 records and finds 346 unique rows (346 is the number I want) Query:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/ 
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId

497 records 346 rows

If I change my query to use count_distinct instead, I get exactly what I want. Example below:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/ 
| parse @message "\"playerId\": \"*\"" as playerId
| stats count_distinct(playerId) as CT 

with count_distinct function

The problem with count_distinct however is that as the query expands to a larger timeframe/more records the number of entries get into the thousands, and tens of thousands. This presents an issue as the numbers become approximations, due to the nature of Insights count_distinct behaviour...

"Returns the number of unique values for the field. If the field has very high cardinality (contains many unique values), the value returned by count_distinct is just an approximation.".

Docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html

This is not acceptable, as I require exact numbers. Playing with the query a little, and sticking with count(), not count_distinct() I believe is the answer, however I've not been able to come to a single number... Examples which do not work... Any thoughts?

Ex 1:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/ 
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId
| stats count(*) 

We are having trouble understanding the query.

To be clear, I'm looking for an exact count to be returned in a single row showing the number.


Solution

  • What if we introduce a dummy field that's hardcoded to "1"? The idea is to retrieve its min value so that it stays as a "1" even if the same playerId occurs more than once. And then we sum this field.

    The log entry might look like this:

    [1]"playerId": "1b45b168-00ed-42fe-a977-a8553440fe1a"
    

    Query:

    fields @timestamp, @message
    | sort @timestamp desc
    | filter @message like /(playerId)/ 
    | parse @message "[*]\"playerId\": \"*\"" as dummyValue, playerId
    | stats sum(min(dummyValue)) by playerId as CT 
    

    References used: