I have a table with ORC
Serde in Athena. The table contains a string column named greeting_message
. It can contain null values as well. I want to find how many rows in the table have a particular text as the pattern.
Let's say my sample data looks like below:
|greeting_message |
|-----------------|
|hello world |
|What's up |
| |
|hello Sam |
| |
|hello Ram |
|good morning, hello |
| |
|the above row has null |
| Good morning Sir |
Now for the above table, if we see there are a total of 10 rows. 7 of them are having not null values and 3 of them just has null/empty value.
I want to know what percentage of rows contain a specific word.
For example, consider the word hello
. It is present in 4 rows, so the percentage of such rows is 4/10 which is 40 %.
Another example: the word morning
is present in 2 messages. So the percentage of such rows is 2/10 which is 20 %.
Note that I am considering null
also in the count of the denominator.
The syntax of prestoDB (Amazon Athena engine) is different than MySQL. The following example is creating a temp table WITH greetings AS
and then SELECT
from that table:
WITH greetings AS
(SELECT 'hello world' as greeting_message UNION ALL
SELECT 'Whats up' UNION ALL
SELECT '' UNION ALL
SELECT 'hello Sam' UNION ALL
SELECT '' UNION ALL
SELECT 'hello Ram' UNION ALL
SELECT 'good morning, hello' UNION ALL
SELECT '' UNION ALL
SELECT 'the above row has null' UNION ALL
SELECT 'Good morning Sir')
SELECT count_if(regexp_like(greeting_message, '.*hello.*')) / cast(COUNT(1) as real) AS hello_percentage,
count_if(regexp_like(greeting_message, '.*morning.*')) / cast(COUNT(1) as real) AS morning_percentage
FROM greetings
will give the following results
hello_percentage | morning_percentage |
---|---|
0.4 | 0.2 |
The regex_like function can support many regex options including spaces (\s) and other string matching requirements.