mysqlcontainsamazon-athenaprestostring-function

contains function for String in presto Athena


I have a table with ORC Serde in Athena. The table contains a string column named greeting_message. It can contain null values as well. I want to find how many rows in the table have a particular text as the pattern.

Let's say my sample data looks like below:

|greeting_message |
|-----------------|
|hello world      |
|What's up        |
|                 |
|hello Sam        |
|                 |
|hello Ram        |
|good morning, hello |
|                 |
|the above row has null |
| Good morning Sir |

Now for the above table, if we see there are a total of 10 rows. 7 of them are having not null values and 3 of them just has null/empty value.

I want to know what percentage of rows contain a specific word.

For example, consider the word hello. It is present in 4 rows, so the percentage of such rows is 4/10 which is 40 %.

Another example: the word morning is present in 2 messages. So the percentage of such rows is 2/10 which is 20 %.

Note that I am considering null also in the count of the denominator.


Solution

  • The syntax of prestoDB (Amazon Athena engine) is different than MySQL. The following example is creating a temp table WITH greetings AS and then SELECT from that table:

    WITH greetings AS
      (SELECT 'hello world' as greeting_message UNION ALL
       SELECT 'Whats up' UNION ALL
       SELECT '' UNION ALL
       SELECT 'hello Sam' UNION ALL
       SELECT '' UNION ALL
       SELECT 'hello Ram' UNION ALL
       SELECT 'good morning, hello' UNION ALL
       SELECT '' UNION ALL
       SELECT 'the above row has null' UNION ALL
       SELECT 'Good morning Sir')
    
     SELECT count_if(regexp_like(greeting_message, '.*hello.*')) / cast(COUNT(1) as real) AS hello_percentage, 
           count_if(regexp_like(greeting_message, '.*morning.*')) / cast(COUNT(1) as real) AS morning_percentage 
    FROM greetings
    

    will give the following results

    hello_percentage morning_percentage
    0.4 0.2

    The regex_like function can support many regex options including spaces (\s) and other string matching requirements.