I'm quite a novice sql user and I'm struggling with a sqlite
query over a GROUP BY
. Here's a simplified example of my problem using this table data:
id, pid, tid, duration
1, 1, 12, 0.099999
2, 1, 13, 0.105
3, 1, 14, 0.102
4, 1, 15, 0.1
5, 1, 22, 0.12
6, 1, 23, 0.101
7, 1, 24, 0.11
8, 2, 13, 0.105
9, 2, 14, 0.102
10, 2, 15, 0.1
11, 2, 16, 0.11
12, 2, 17, 0.11
13, 2, 18, 0.0995
14, 2, 19, 0.0998
15, 1, 12, 0.099999
16, 1, 13, 0.105
17, 1, 23, 0.101
18, 1, 24, 0.11
19, 2, 15, 0.1
20, 2, 16, 0.11
21, 2, 17, 0.11
22, 2, 18, 0.0995
23, 2, 19, 0.0998
24, 1, 13, 0.105
25, 1, 15, 0.1
26, 1, 22, 0.12
27, 1, 23, 0.101
28, 1, 24, 0.11
29, 2, 13, 0.105
30, 2, 14, 0.102
31, 2, 15, 0.1
32, 2, 16, 0.11
33, 2, 19, 0.0998
I'm trying to count the number of unique tid
s for each pid
and sum the duration
for each pid
. While tid
is unique, it's corresponding duration
is not unique. Note: in the real case duration
is actually a joined column on tid
and I'm only showing the columns I'm trying to operate on here.
I also want to figure out the number of "visits" - if there is a break in the "unique_timesteps" of more than 5 tid
s, then that counts as a seperate visit - so in this example, pid=1
would have 2 visits, while pid=2
would have one visit.
Here's what I expect the final result to be:
┌───────┬───────────┬────────────────────┬────────┐
│ pid │ num_times │ exposure_time │ visits │
│ int64 │ int64 │ double │ int64 │
├───────┼───────────┼────────────────────┼──────-─┤
│ 1 │ 7 │ 0.7379990000000001 │ 2 │
│ 2 │ 7 │ 0.7263000000000001 │ 1 │
└───────┴───────────┴────────────────────┴────────┘
I run the following query on the data but it's not returning the expected results for the group by.
SELECT
pid,
COUNT(DISTINCT(tid)) AS num_times,
SUM(DISTINCT(duration)) AS exposure_time,
GROUP_CONCAT(DISTINCT(id)) AS rows,
GROUP_CONCAT(DISTINCT(tid)) AS unique_timesteps
FROM
distinct_example
GROUP BY
pid
ORDER BY
pid;
Here's the result where I'm adding the last two columns for context:
┌───────┬───────────┬────────────────────┬──────────────────────────────────────────────────┬──────────────────────┐
│ pid │ num_times │ exposure_time │ rows │ unique_timesteps │
│ int64 │ int64 │ double │ varchar │ varchar │
├───────┼───────────┼────────────────────┼──────────────────────────────────────────────────┼──────────────────────┤
│ 1 │ 7 │ 0.7379990000000001 │ 24,4,2,7,25,27,16,1,15,18,3,5,28,6,17,26 │ 12,23,15,22,24,13,14 │
│ 2 │ 7 │ 0.6163 │ 13,14,32,9,8,19,23,33,12,31,30,20,22,11,10,21,29 │ 16,19,17,13,14,18,15 │
└───────┴───────────┴────────────────────┴──────────────────────────────────────────────────┴──────────────────────┘
In each case, there are 7 unique timesteps recorded for each pid
as correctly shown in the second column. The third column should be the summation of the duration
column for the respective group by
operation on pid
. This should give (0.737999, 0.726300) as the result for exposure_time
as duration
should be summed on the unique_timesteps
rather than unique durations
. However I've mis-understood the use of DISTINCT
within the GROUP BY
and one of the durations for pid=2
has been neglected (0.11).
I guess I should be doing some subquery to select the correct rows to use as an index but I can't get my head around how to do this.
I have no clue on how to calculate the visits
result but presume it could be accomplished with some subqueries and a window function?
exposure_time
: ensuring uniquenessAs you correctly analyzed, your DISTINCT duration
returns only one occurrence when two tid
share the same duration
. What naturally maps to your need would be PostgreSQL's DISTINCT ON
, which doesn't exist on SQLite, so you'll have to emulate it.
We could use subselects with LIMIT 1
(to get 1 value for each selected tid),
but I prefer using Common Table Expressions that can be seen as sequential intermediate temporary tables creation, helping stay organized (you construct your query progressively, intermediate table by intermediate table) and easing diagnosis (you can select *
from any intermediate table to see if its step of the process is correct regarding to your expectations).
So we'll first renormalize all your data into CTEs, ensuring they only hold unique values (of course if you still have your original tid
/ duration
table feel free to start from it),
then sum up from those tables without DISTINCT
(because we're sure we only have one row per tid
/ duration
entry, and per tid
/ pid
couple):
WITH
-- Get only 1 tid / pid pair
pidtid AS (SELECT DISTINCT pid, tid FROM distinct_example),
-- Reconstitute single tids
tid as (SELECT DISTINCT tid, duration FROM distinct_example),
-- Now sum up pid by pid
sum as
(
SELECT pid, COUNT(1) num_times, SUM(duration) exposure_time
FROM tid JOIN pidtid USING (tid)
GROUP BY pid
)
SELECT * from sum;
visits
: window function to count how far same pid entries are separated fromFor your visits
, you're right, a window function will be the easiest way (we could also join the table to itself ON b.id BETWEEN a.id - 6 and a.id - 1 WHERE b.pid = a.pid
to find near predecessors).
We cannot compare previous rows to the current row using a window function (MIN(CASE WHEN pid = CURRENT.pid THEN 0 ELSE 1 END) OVER (ORDER BY id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS INCREASE
),
so we'll partition by pid and use their order within the full dataset, which I understand id
is, to see how far each entry is.
WITH
prev AS
(
SELECT
id, pid, tid,
CASE WHEN id <= 6 + LAG(id) OVER (PARTITION BY pid ORDER BY id) THEN 0 ELSE 1 END AS new_visit
FROM distinct_example
),
visits AS (SELECT pid, SUM(new_visit) visits FROM prev GROUP BY pid)
SELECT * FROM visits;
Now we can join our two tables (each of those guaranteeing one entry per pid
):
WITH
-- Get only 1 tid / pid pair
pidtid AS (SELECT DISTINCT pid, tid FROM distinct_example),
-- Reconstitute single tids
tid AS (SELECT DISTINCT tid, duration FROM distinct_example),
-- Now sum up pid by pid
sum AS
(
SELECT pid, COUNT(1) num_times, SUM(duration) exposure_time
FROM tid JOIN pidtid USING (tid)
GROUP BY pid
),
-- Start another "thread" for visits (another = we do not SELECT from the previous CTEs, we start straight from distinct_example)
-- The following CTE is optional, it recomputes ids *if they are not sequential*.
-- (some RDBMS allow choosing the same name as the original table to transparently override any use of the original table in the subsequent SELECTs)
distexreindexed AS (select row_number() OVER (ORDER BY id) id, pid, tid, duration FROM distinct_example),
prev AS
(
SELECT
id, pid, tid,
CASE WHEN id <= 6 + LAG(id) OVER (PARTITION BY pid ORDER BY id) THEN 0 ELSE 1 END AS new_visit
FROM distexreindexed -- ← Switch to the reindexed table.
),
visits AS (SELECT pid, SUM(new_visit) visits FROM prev GROUP BY pid)
SELECT sum.*, visits FROM sum JOIN visits USING (pid) ORDER BY pid;
pid | num_times | exposure_time | visits |
---|---|---|---|
1 | 7 | 0.737999 | 2 |
2 | 7 | 0.7263000000000001 | 1 |
We probably could have compacted it (grouping some queries), but helping those two distinct problems stay distinct in the final query will ease maintenance.
Everything runs in this fiddle.