sqlhadoopimpalaqsqlquery

Combining Aggregate Function with resampling in Impala


I have Table in Hadoop in which I have data for different sensor units with a sampling time ts of 1 mSec. I can resample the data for a single unit with a combination of different aggregate functions using the following query in Impala (Let's say I want to resample the data for each 5 minute using LAST_VALUE() as aggregate function):

SELECT DISTINCT * 
from ( select ts_resample, unit,
last_value(Val1) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val2
from (
SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/300 as bigint)*300 as TIMESTAMP) as ts_resample, 
ts as ts, unit as unit, Val1 as Val1, Val2 as Val2
FROM Sensor_Data.Table1 WHERE unit='Unit1') as t) as tt

If I am running this query for a single unit then I am getting the correct asnwer and there is no issue.

But if I want to resample the data for each Unit based on some aggregation function e.g. LAST_VALUE() then I am getting the wrong answer and resample result for each unit is same although I have different data for each unit. The query that I am running is given below where I am not specifying any unit in WHERE clause:

SELECT DISTINCT * 
from(
select ts_resample, unit,
last_value(Val1) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val1, 
last_value(Val2) over (partition by ts_resample order by ts rows between unbounded preceding and unbounded following) as Val2
from (
SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/300 as bigint)*300 as TIMESTAMP) as ts_resample, 
ts as ts, unit as unit, Val1 as Val1, Val2 as Val2
FROM Sensor_Data.Table1) as t) as tt

The result obtained using the above query for present three units in the data is given below:

ts_resample             unit    Val1    Val2
2020-12-01 00:00:00     unit1   0.8974  10.485
2020-12-01 00:00:00     unit2   0.8974  10.485
2020-12-01 00:00:00     unit3   0.8974  10.485
2020-12-01 00:05:00     unit1   0.9041  11.854
2020-12-01 00:05:00     unit2   0.9041  11.854
2020-12-01 00:05:00     unit3   0.9041  11.854

Actually I want ot get the last value for each unit that is different for each unit. Like below:

ts_resample             unit    Val1    Val2
2020-12-01 00:00:00     unit1   0.8974  10.485
2020-12-01 00:00:00     unit2   0.9014  11.954
2020-12-01 00:00:00     unit3   0.7854  10.821
2020-12-01 00:05:00     unit1   0.9841  11.125
2020-12-01 00:05:00     unit2   0.8742  10.963
2020-12-01 00:05:00     unit3   0.9632  11.784

Could anybody tell me that what is the problem in my query ?

Thanks


Solution

  • I solved this problem by giving unit information in the partition by with ts_resample. The final solution is given below:

    SELECT DISTINCT * 
    from(
    select ts_resample, unit,
    last_value(Val1) over (partition by ts_resample, unit order by ts rows between unbounded preceding and unbounded following) as Val1, 
    last_value(Val2) over (partition by ts_resample, unit order by ts rows between unbounded preceding and unbounded following) as Val2
    from (
    SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/300 as bigint)*300 as TIMESTAMP) as ts_resample, 
    ts as ts, unit as unit, Val1 as Val1, Val2 as Val2
    FROM Sensor_Data.Table1) as t) as tt
    

    After this I have the result as I wanted and showed in my question.