I'm working with a dataset where an IoT device sends a wh
(watt-hour) value at discrete timestamps, identified by an operationId
. I want to visualize this data by plotting the sum of average power per operation, broken down by hour.
Here is the raw data:
timestamp | operationId | wh |
---|---|---|
2025-04-01T14:10:59.000000Z | 1001 | 0 |
2025-04-01T14:20:01.000000Z | 1002 | 0 |
2025-04-01T15:06:29.000000Z | 1003 | 0 |
2025-04-01T18:18:05.000000Z | 1001 | 200 |
2025-04-01T20:06:36.000000Z | 1003 | 200 |
2025-04-01T22:20:10.000000Z | 1002 | 300 |
I'm using lag()
with a window function to get the previous wh
and timestamp, and compute the average power:
SELECT
timestamp AS end_time,
cast(prev_ts AS timestamp) AS start_time,
operationId,
(wh - prev_wh) / ((cast(timestamp AS DOUBLE) - prev_ts) / 3600000000.0) AS mean_power_w
FROM (
SELECT
timestamp,
wh,
operationId,
lag(wh) OVER (PARTITION BY operationId ORDER BY timestamp) AS prev_wh,
lag(cast(timestamp AS DOUBLE)) OVER (PARTITION BY operationId ORDER BY timestamp) AS prev_ts
FROM meter
)
WHERE prev_ts IS NOT NULL
ORDER BY timestamp;
This gives me:
end_time | start_time | mean_power_w | operationId |
---|---|---|---|
2025-04-01T18:18:05.000000Z | 2025-04-01T14:10:59.000000Z | 48.563334682314 | 1001 |
2025-04-01T20:06:36.000000Z | 2025-04-01T15:06:29.000000Z | 39.984450491475 | 1003 |
2025-04-01T22:20:10.000000Z | 2025-04-01T14:20:01.000000Z | 37.488284910965 | 1002 |
Now I’d like to expand each row to hourly buckets between start_time
and end_time
, filling each hour with the cumulative energy based on the mean power.
For our example:
ts | operationId | energy |
---|---|---|
2025-04-01T14:00:00.000Z | 1001 | 39.67354647241263 |
2025-04-01T14:00:00.000Z | 1002 | 24.981776528168066 |
2025-04-01T15:00:00.000Z | 1001 | 48.563334682314 |
2025-04-01T15:00:00.000Z | 1003 | 35.66390848003506 |
2025-04-01T15:00:00.000Z | 1002 | 37.488284910965 |
2025-04-01T16:00:00.000Z | 1001 | 48.563334682314 |
2025-04-01T16:00:00.000Z | 1003 | 39.984450491475 |
2025-04-01T16:00:00.000Z | 1002 | 37.488284910965 |
2025-04-01T17:00:00.000Z | 1001 | 48.563334682314 |
2025-04-01T17:00:00.000Z | 1003 | 39.984450491475 |
2025-04-01T17:00:00.000Z | 1002 | 37.488284910965 |
2025-04-01T18:00:00.000Z | 1001 | 14.636449480641858 |
2025-04-01T18:00:00.000Z | 1003 | 39.984450491475 |
2025-04-01T18:00:00.000Z | 1002 | 37.488284910965 |
2025-04-01T19:00:00.000Z | 1003 | 39.984450491475 |
2025-04-01T19:00:00.000Z | 1002 | 37.488284910965 |
2025-04-01T20:00:00.000Z | 1003 | 4.39828955406225 |
2025-04-01T20:00:00.000Z | 1002 | 37.488284910965 |
2025-04-01T21:00:00.000Z | 1002 | 37.488284910965 |
2025-04-01T22:00:00.000Z | 1002 | 12.600229095074349 |
How can I expand the results between start_time
and end_time
per hour and fill each hour with the correct mean_power_w
(converted to wh
for that period)?
"Cumulative" might not be the right term here. What I’m actually trying to do is expand each operation’s average power output over time.
So for each operationId
, I calculate the average power in W between two measurements, then I want to apply that average power to every hour between the start and end timestamps, converting it to Wh for each hour.
So the goal isn’t to sum up values cumulatively across time, but rather to spread the average energy contribution of an operation across each hour it spans.
Is there a way to achieve this using SAMPLE BY
, joins, or maybe generate_series-like logic in QuestDB?
It took me a while to come with a suitable solution. I thought I had one (which was mega complex) but then I realised, speaking with a colleague, that this should be easier. And here it is
with
sampled AS (
select timestamp, operationId, sum(wh) as wh from meter
sample by 1h
FILL(0)
),
sessions AS (
select *, SUM(case when wh>0 THEN 1 END) OVER (partition by operationId order by timestamp desc) as session
from sampled
),
counts AS (
select timestamp, operationId,
FIRST_VALUE(wh) over (partition by operationId, session order by timestamp desc) as wh,
COUNT(*) over (partition by operationId, session) as attributable_hours
FROM sessions
)
SELECT timestamp, operationId, wh/attributable_hours as wh_avg
--,wh, attributable_hours,
FROM counts
;
Basically what you need here is to slice, for each operationId, all the hours with no wh value (or 0) preceeding a row with a value for wh. Then we divide the wh for this row across all the rows in the same slice, to attribute the average. I am calling each of this slices a "session".
My query starts with the `sampled` subquery, which just makes sure we have an entry for each operationId and missing hourly interval, filling with 0 wh all the intepolated rows.
Now the real trick for the query. We divide the data in "sessions". A session is defined by all the rows with no value for wh before a row with a value for wh. Or, if we were to reverse the timestamp order, a session would be defined by a row with a value for wh, followed by several rows with zero value for the same operationId. We can express this with this window function:
SUM(case when wh>0 THEN 1 END) OVER (partition by operationId order by timestamp desc) as session
Now, for each operationId we will have multiple sessions (1,2,3...), and if we did something like...
COUNT() as attributable_hours GROUP BY operationId, session
...we would get how many attributable rows each session has.
With this info I can go to the next subquery that I named `counts` in which I am using a window function to COUNT
the number of rows per session (notice how the count
window function is not using order by
so this will not be a running count, but all the rows for the same session will have the same value as `attributable_hours`.
I am also getting the FIRST_VALUE
for the session sorted by reverse timestamp, which is the `wh` value for the only row with value in each session.
And then the latest query is just a simple division, to average the wh
reported in the session across the number of attributable_hours.
I left a line commented out in case you want to see more clearly the intermediate values for the calculation.
Note if you want to filter the results by timestamp or operationId, you should ideally add the filter at the first query, the one named sampled
, so the rest of the process is done on the relevant subset of data.