I have a dataset in SQL like this:
CREATE TABLE hospital_visits (
name VARCHAR(50),
hospital_visit_date DATE
);
INSERT INTO hospital_visits (name, hospital_visit_date) VALUES
('patient1', '2024-01-15'),
('patient2', '2024-02-20'),
('patient3', '2024-03-10'),
('patient4', '2024-04-05'),
('patient5', '2024-05-12'),
('patient6', '2024-01-08'),
('patient6', '2024-03-15'),
('patient7', '2024-02-14'),
('patient7', '2024-04-20'),
('patient8', '2024-01-25'),
('patient8', '2024-05-30'),
('patient9', '2024-01-12'),
('patient9', '2024-02-28'),
('patient9', '2024-05-15'),
('patient10', '2024-03-05'),
('patient10', '2024-04-10'),
('patient10', '2024-06-01'),
('patient11', '2024-01-20'),
('patient11', '2024-02-25'),
('patient11', '2024-04-15'),
('patient11', '2024-06-10'),
('patient12', '2024-01-05'),
('patient12', '2024-02-12'),
('patient12', '2024-03-20'),
('patient12', '2024-05-08'),
('patient12', '2024-06-15');
I first ran a query which does a group by/count on the number of visits:
WITH visit_counts AS (
SELECT name, COUNT(*) as num_visits
FROM hospital_visits
GROUP BY name
)
SELECT num_visits, COUNT(*) as num_people
FROM visit_counts
GROUP BY num_visits
ORDER BY num_visits;
num_visits num_people
1 5
2 3
3 2
4 1
5 1
I then wrote an SQL query which looks at the average time between consecutive visits for different visit combinations:
WITH visit_sequence AS (
SELECT
name,
hospital_visit_date,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY hospital_visit_date) as visit_number,
LAG(hospital_visit_date) OVER (PARTITION BY name ORDER BY hospital_visit_date) as prev_visit_date
FROM hospital_visits
),
visit_transitions AS (
SELECT
name,
CASE WHEN visit_number = 1 THEN NULL ELSE visit_number - 1 END as current_visit,
visit_number as next_visit,
hospital_visit_date,
prev_visit_date,
hospital_visit_date - prev_visit_date as days_between_visits
FROM visit_sequence
)
SELECT
current_visit,
next_visit,
COUNT(*) as num_people,
ROUND(AVG(days_between_visits), 1) as avg_days_between
FROM visit_transitions
GROUP BY current_visit, next_visit
ORDER BY current_visit, next_visit;
current_visit next_visit num_people avg_days_between
NA 1 12 NA
1 2 7 59.4
2 3 4 54.0
3 4 2 52.5
4 5 1 38.0
I am wondering: how can I check to make sure that both queries are correct? For example, the first row of the second query is 12 and the sum of the second column in the first query is also 12. Is this a logical check I can use?
num_people
Yes, for the count of people you're right finding that 1st row of query 2 = sum(column 2) over (query 1)
, because
sum(column 2) over (query 1)
represents "count of people that visited once + count of people that visited twice + … + count of people that visited 5 times",1st row of query 2 = sum(column 2) over (query 1)
represents "1st visit of all those people" (so even those who visited 5 times had a first visit)Logically, only those who visited twice or more had a second visit. Thus 2nd row of query 2 = sum(column 2) over (query 1) where num_visits >= 2
(7 = sum of only the 4 last rows of query 1 = 3 + 2 + 1 + 1).
Thus the count of people having a nth visit is the sum of every people having visited a total of n times or more,
which we can express in terms of an SQL windowed sum:
WITH visit_people_counts AS (
/* your first query here */
)
SELECT num_visits, SUM(num_people) OVER (ORDER BY num_visits DESC) people_having_a_nth_visit
FROM visit_people_counts
ORDER BY num_visits;
(see it as the first SELECT
in this fiddle)
avg_days_between
We could cross validate by verifying the grand total (relying on "the sum of averages should be the average of sums"), and by hand-validating one item of this grand total.
For the hand validation, beware measuring something significant; and for both the hand validation and the grand total, beware measuring the same reality.
For example, we shouldn't consider that patient12
has an average of 38 days between his visits, because the "average for 5" in the results array doesn't mean "the average for patient12
", but "the average for the 5th visit of patient12
" (so it's an uninteresting case: OK, 38 days between 05-08 and 06-15 is quickly validated, but we didn't do any average: it's unsignificant).
On the other hand, we cannot isolate "people with 1 visit only" in your results table either, because "1" regroups the first (and only visit) of patients with 1 visit, as well as the first visit of patients up to 5 visits.
So for now we'll not try verifying an individual line from your query (which would force us to at least reuse the first CTE, thus trust it, which is not the way to validate…).
Instead we'll take the individual validation in the "average of sums (of delays between two visits)" pool, which will be easier.
So let's start with the "average of sums" way:
The people doing 5 visits with "on average" n days between each could have done each visit exactly n days after the previous, to get this average (thus on days 0, n, 2n, 3n, 4n). Or it could have visited 4 times in a row (days 0, 1, 2, 3) and then 4n-3 days after (→ day 4n), it would give the same average.
What is important here is that the last visit always occur on day 4n: the difference between first and last visit always = (count of visits - 1) * avg(duration between visits).
Thus we can simply our "sum of delays between visits" as the MAX(hospital_visit_date) - MIN(hospital_visit_date)
divided by COUNT(hospital_visit_date) - 1
:
WITH by_people AS
(
SELECT
DAYS(MAX(hospital_visit_date)) - DAYS(MIN(hospital_visit_date)) as total_span,
COUNT(hospital_visit_date) as num_visits
FROM hospital_visits
GROUP BY name
)
SELECT
num_visits, COUNT(*) num_people, AVG(1.0 * total_span) avg_total_span,
CASE WHEN num_visits > 1 THEN ROUND(AVG(1.0 * total_span) / (num_visits - 1), 1) END as avg_days_between
FROM by_people
GROUP BY num_visits
ORDER BY num_visits;
(second SELECT
of the same fiddle)
NUM_VISITS | NUM_PEOPLE | AVG_TOTAL_SPAN | AVG_DAYS_BETWEEN |
---|---|---|---|
1 | 5 | 0 | null |
2 | 3 | 86.3 | 86.3 |
3 | 2 | 106 | 53.0 |
4 | 1 | 142 | 47.3 |
5 | 1 | 162 | 40.5 |
In fact at first I got only round numbers, but then I found it suspect that it had no decimal part, and understood that AVG(162) / 5
returned the euclidian divide of 162 by 5, that is, 40 (with remainder 2) instead of 40.5.
So I introduced a 1.0 *
in my expression to make it a floating point AVG()
.
So I learnt two things while doing this query:
DAYS(…) - DAYS(…)
to get day count between two dates. Just using … - …
for periods spanning over more than a month returns an unexpected number.patient12
got 510, which reads as "5 months, 10 days" (January 5th to June 10th) but gives us an unexpected average of course.AVG()
Let's consider that this rounding error served as the hand-validation of one row (here patient12
).
And thus our grand total of average days between visits should be (3 * 1 * 86.3 + 2 * 2 * 53 + 3 * 47.3 + 4 * 40.5) / 10 =
55.34 as an average delay over all visits.
If I wrap your last SELECT
in a CTE then compute the grand total:
SELECT SUM(1.0 * num_people * avg_days_between) / SUM(num_people) FROM res WHERE current_visit IS NOT NULL;
I only get 55.0 within the Developer-C 11.1 DB2 fiddle.
So based on our now long-time experience with AVG()
, we get back to your query, put an 1.0 into AVG(1.0 * days_between_visits)
, and here we go with the fixed results:
CURRENT_VISIT | NEXT_VISIT | NUM_PEOPLE | AVG_DAYS_BETWEEN |
---|---|---|---|
1 | 2 | 7 | 59.4 |
2 | 3 | 4 | 54 |
3 | 4 | 2 | 52.5 |
4 | 5 | 1 | 38 |
null | 1 | 12 | null |
(last query in the fiddle)
… But after having re-read your question, I saw that you already had averages with decimals.
While copy-pasting your data and query to the fiddle returned:
DATE - DATE
rule that I discoveredMorality: even if your query works in your environment, to make it universal and avoid unexpected side effects in other versions / configurations of DB2, you should apply the 2 rules that I learnt:
AVG()
ing them