sqlimpala

APACHE IMPALA SQL Average calculating


I need help to calculate the average of picks/lines by an area.
AREA = CLASSIFICATION_NAME
PICKS/Lines = PICKS

The average should be an hourly average.

I have an Article Table:

| ARTICLE_ID | CLASSIFICATION_NAME |
| ---------- | ------------------- |
| 1          | 15deg_A_class       |
| 2          | 15deg_B_class       |
| 3          | 15deg_C_class       |
| 4          | 4deg_C_class        |
| 5          | frozen_manual_class |

And I have a pick_table:

|ID |Datetime           |Picks|Article|
|-- |-------------------|-----|-------|
|1  |16.05.2023 11:50:00|1    |1      |
|2  |16.05.2023 11:02:00|1    |3      |
|3  |16.05.2023 11:02:01|1    |3      |
|...|...                |...  |...    |

For each pick line which exists the column PICK is set with 1. One day has about 40000 picks in all.

I join this two tables like:

SELECT
from_timestamp(to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss'), 'dd.MM.yyyy HH:mm:ss') AS pick_date,
date_part('hour', to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss')) AS pick_hour,
CASE WHEN article.classification_name LIKE '15deg_%' THEN pick_table.picks END AS '15DEG',
pick_table.picks AS picks
FROM
pick_table
JOIN article ON article.id = pick_table.article;

I tried to get the average like:

SELECT
date_part('hour', to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss')) AS pick_hour,
AVG(CASE WHEN article.classification_name LIKE '15deg_%' THEN pick_table.picks END) AS '15DEG'
FROM
pick_table
JOIN article ON article.id = pick_table.article
GROUP BY date_part('hour', to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss'))
ORDER BY date_part('hour', to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss')) desc;

And also tried it like:

WITH test AS(
SELECT
from_timestamp(to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss'), 'dd.MM.yyyy HH:mm:ss') AS pick_date,
    date_part('hour', to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss')) AS pick_hour,
    SUM(CASE
        WHEN article.classification_name LIKE '15deg_%' THEN picks END) AS '15DEG',
    SUM(CASE     
        WHEN article.classification_name LIKE '4deg_%' THEN picks END) AS '4DEG',
    SUM(CASE 
        WHEN article.classification_name LIKE 'frozen%' THEN picks END) AS 'FROZEN',
    SUM(CASE 
        WHEN article.classification_name LIKE '15deg_A%' THEN picks END) AS 'A',
    SUM(CASE 
        WHEN article.classification_name LIKE '15deg_B%' THEN picks END) AS 'B',
    pick_table.picks AS picks
FROM pick_table
JOIN article ON article.id = pick_table.article
GROUP BY
from_timestamp(to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss'), 'dd.MM.yyyy HH:mm:ss'),
date_part('hour', to_timestamp(CONCAT(substr(pick_table.datetime,1,10),' ', substr(pick_table.datetime,12,8)),'yyyy-MM-dd HH:mm:ss')),
picks)
SELECT
pick_hour,
AVG(15DEG) AS '15DEG',
AVG(4DEG) AS '4DEG',
AVG(FROZEN) AS 'FROZEN',
AVG(A) AS 'A',
AVG(B) AS 'B'
FROM test
WHERE substr(pick_date,1,10) >= '14.05.2023' AND substr(pick_date,1,10) < '15.05.2023' 
GROUP BY pick_hour
ORDER BY pick_hour desc;

I don't get the result I want to have.

Result should be like:

|Hour|15DEG|4DEG|FROZEN|A  |B   |
|----|-----|----|------|---|----|
|0   |2560 |500 |0     |150|1200|
|1   |3750 |480 |1     |120|1150|
|... |...  |... |...   |...|... |

And at least in front end I would have an output like:

|Average  |15DEG|4DEG|FROZEN|A  |B   |
|---------|-----|----|------|---|----|
|Last Hour|4032 |2667|377   |571|3928|
|Today    |3904 |2835|384   |573|3821|
|Daily AVG|3967 |2751|381   |562|3870|

How an I do that? Can anybody help me please. Thank you.


Solution

  • Tested on PostgreSQL, so you may have to adapt a little bit:

    First, the test tables and data :

    create table article as select * from (values
    (1, '15deg_A_class'),
    (2, '15deg_B_class'),
    (3, '15deg_C_class'),
    (4, '4deg_C_class'),
    (5, 'frozen_manual_class')) l(article_id, classification_name);
    
    select * from article;
    
     article_id | classification_name 
    ------------+---------------------
              1 | 15deg_A_class
              2 | 15deg_B_class
              3 | 15deg_C_class
              4 | 4deg_C_class
              5 | frozen_manual_class
    
    create table pick_table as select * from (values
    (1, '2023-05-16 11:50:00'::timestamp, 1, 1),
    (2, '2023-05-16 11:02:00', 1, 3),
    (3, '2023-05-16 11:02:01', 1, 3)) l(id, datetime, picks, article);
    
    select * from pick_table;
    
     id |      datetime       | picks | article 
    ----+---------------------+-------+---------
      1 | 2023-05-16 11:50:00 |     1 |       1
      2 | 2023-05-16 11:02:00 |     1 |       3
      3 | 2023-05-16 11:02:01 |     1 |       3
    

    The trick is to group by classification and hour like that :

    select substring(a.classification_name, 1, 5) as the_class,
           date_trunc('HOUR', datetime) as the_hour,
           sum(picks) nbpicks
    from pick_table p join article a on a.article_id = p.article
    group by the_class, the_hour
    
     the_class |      the_hour       | nbpicks 
    -----------+---------------------+---------
     15deg     | 2023-05-16 11:00:00 |       3
    

    You have also different types of aggregation that overlap each other, so we also need the following query to build the result set for A and B columns:

    select substring(a.classification_name, 1, 7) as the_class,
           date_trunc('HOUR', datetime) as the_hour,
           sum(picks) nbpicks
    from pick_table p join article a on a.article_id = p.article
    where a.classification_name like '15deg%'
    group by the_class, the_hour
    
     the_class |      the_hour       | nbpicks 
    -----------+---------------------+---------
     15deg_A   | 2023-05-16 11:00:00 |       1
     15deg_C   | 2023-05-16 11:00:00 |       2
    

    Then using the result set composed of those two results set above, we pivot it to compute the stats :

    with X as (select substring(a.classification_name, 1, 5) as the_class,
                      date_trunc('HOUR', datetime) as the_hour,
                      sum(picks) nbpicks
               from pick_table p join article a on a.article_id = p.article
               group by the_class, the_hour
               union all
               select substring(a.classification_name, 1, 7) as the_class,
                      date_trunc('HOUR', datetime) as the_hour,
                      sum(picks) nbpicks
               from pick_table p join article a on a.article_id = p.article
               where a.classification_name like '15deg%'
               group by the_class, the_hour)
    select 'Last Hour',
           sum(case when the_class = '15deg' then nbpicks else 0 end) as "15DEG",
           sum(case when the_class = '4deg_' then nbpicks else 0 end) as "4DEG",
           sum(case when the_class = 'froze' then nbpicks else 0 end) as "FROZEN",
           sum(case when the_class = '15deg_A' then nbpicks else 0 end) as "A",
           sum(case when the_class = '15deg_B' then nbpicks else 0 end) as "B"
    from X
    where the_hour = date_trunc('HOUR', now() - interval '1' hour)
    union all
    select 'Today',
           avg(case when the_class = '15deg' then nbpicks else 0 end) as "15DEG",
           avg(case when the_class = '4deg_' then nbpicks else 0 end) as "4DEG",
           avg(case when the_class = 'froze' then nbpicks else 0 end) as "FROZEN",
           avg(case when the_class = '15deg_A' then nbpicks else 0 end) as "A",
           avg(case when the_class = '15deg_B' then nbpicks else 0 end) as "B"
    from X
    where date_trunc('DAY', the_hour) = date_trunc('DAY', now())
    union all
    select 'Daily AVG',
           avg(case when the_class = '15deg' then nbpicks else 0 end) as "15DEG",
           avg(case when the_class = '4deg_' then nbpicks else 0 end) as "4DEG",
           avg(case when the_class = 'froze' then nbpicks else 0 end) as "FROZEN",
           avg(case when the_class = '15deg_A' then nbpicks else 0 end) as "A",
           avg(case when the_class = '15deg_B' then nbpicks else 0 end) as "B"
    from X;