sqldb2

Comparing transition counts to total counts in SQL?


I have a dataset in SQL like this:

CREATE TABLE hospital_visits (
    name VARCHAR(50),
    hospital_visit_date DATE
);


INSERT INTO hospital_visits (name, hospital_visit_date) VALUES

('patient1', '2024-01-15'),
('patient2', '2024-02-20'),
('patient3', '2024-03-10'),
('patient4', '2024-04-05'),
('patient5', '2024-05-12'),


('patient6', '2024-01-08'),
('patient6', '2024-03-15'),
('patient7', '2024-02-14'),
('patient7', '2024-04-20'),
('patient8', '2024-01-25'),
('patient8', '2024-05-30'),


('patient9', '2024-01-12'),
('patient9', '2024-02-28'),
('patient9', '2024-05-15'),
('patient10', '2024-03-05'),
('patient10', '2024-04-10'),
('patient10', '2024-06-01'),


('patient11', '2024-01-20'),
('patient11', '2024-02-25'),
('patient11', '2024-04-15'),
('patient11', '2024-06-10'),


('patient12', '2024-01-05'),
('patient12', '2024-02-12'),
('patient12', '2024-03-20'),
('patient12', '2024-05-08'),
('patient12', '2024-06-15');

I first ran a query which does a group by/count on the number of visits:

WITH visit_counts AS (
    SELECT name, COUNT(*) as num_visits
    FROM hospital_visits
    GROUP BY name
)
SELECT num_visits, COUNT(*) as num_people
FROM visit_counts

GROUP BY num_visits

ORDER BY num_visits;

  num_visits num_people
          1          5
          2          3
          3          2
          4          1
          5          1

I then wrote an SQL query which looks at the average time between consecutive visits for different visit combinations:

 WITH visit_sequence AS (
    SELECT 
        name,
        hospital_visit_date,
        ROW_NUMBER() OVER (PARTITION BY name ORDER BY hospital_visit_date) as visit_number,
        LAG(hospital_visit_date) OVER (PARTITION BY name ORDER BY hospital_visit_date) as prev_visit_date
    FROM hospital_visits
),
visit_transitions AS (
    SELECT 
        name,
        CASE WHEN visit_number = 1 THEN NULL ELSE visit_number - 1 END as current_visit,
        visit_number as next_visit,
        hospital_visit_date,
        prev_visit_date,
        hospital_visit_date - prev_visit_date as days_between_visits
    FROM visit_sequence
)
SELECT 
    current_visit,
    next_visit,
    COUNT(*) as num_people,
    ROUND(AVG(days_between_visits), 1) as avg_days_between
FROM visit_transitions
GROUP BY current_visit, next_visit
ORDER BY current_visit, next_visit;


 current_visit next_visit num_people avg_days_between
            NA          1         12               NA
             1          2          7             59.4
             2          3          4             54.0
             3          4          2             52.5
             4          5          1             38.0

I am wondering: how can I check to make sure that both queries are correct? For example, the first row of the second query is 12 and the sum of the second column in the first query is also 12. Is this a logical check I can use?


Solution

  • Logically verifying column num_people

    Yes, for the count of people you're right finding that 1st row of query 2 = sum(column 2) over (query 1), because

    Logically, only those who visited twice or more had a second visit. Thus 2nd row of query 2 = sum(column 2) over (query 1) where num_visits >= 2 (7 = sum of only the 4 last rows of query 1 = 3 + 2 + 1 + 1).

    Thus the count of people having a nth visit is the sum of every people having visited a total of n times or more,
    which we can express in terms of an SQL windowed sum:

    WITH visit_people_counts AS (
    /* your first query here */
    )
    SELECT num_visits, SUM(num_people) OVER (ORDER BY num_visits DESC) people_having_a_nth_visit
    FROM visit_people_counts
    ORDER BY num_visits;
    

    (see it as the first SELECT in this fiddle)

    Logically verifying column avg_days_between

    We could cross validate by verifying the grand total (relying on "the sum of averages should be the average of sums"), and by hand-validating one item of this grand total.

    For the hand validation, beware measuring something significant; and for both the hand validation and the grand total, beware measuring the same reality.

    For example, we shouldn't consider that patient12 has an average of 38 days between his visits, because the "average for 5" in the results array doesn't mean "the average for patient12", but "the average for the 5th visit of patient12" (so it's an uninteresting case: OK, 38 days between 05-08 and 06-15 is quickly validated, but we didn't do any average: it's unsignificant).
    On the other hand, we cannot isolate "people with 1 visit only" in your results table either, because "1" regroups the first (and only visit) of patients with 1 visit, as well as the first visit of patients up to 5 visits.

    So for now we'll not try verifying an individual line from your query (which would force us to at least reuse the first CTE, thus trust it, which is not the way to validate…).
    Instead we'll take the individual validation in the "average of sums (of delays between two visits)" pool, which will be easier.

    So let's start with the "average of sums" way:

    Computing average delay patient by patient

    The people doing 5 visits with "on average" n days between each could have done each visit exactly n days after the previous, to get this average (thus on days 0, n, 2n, 3n, 4n). Or it could have visited 4 times in a row (days 0, 1, 2, 3) and then 4n-3 days after (→ day 4n), it would give the same average.
    What is important here is that the last visit always occur on day 4n: the difference between first and last visit always = (count of visits - 1) * avg(duration between visits).

    Thus we can simply our "sum of delays between visits" as the MAX(hospital_visit_date) - MIN(hospital_visit_date) divided by COUNT(hospital_visit_date) - 1:

    WITH by_people AS
    (
        SELECT
            DAYS(MAX(hospital_visit_date)) - DAYS(MIN(hospital_visit_date)) as total_span,
            COUNT(hospital_visit_date) as num_visits
        FROM hospital_visits
        GROUP BY name
    )
    SELECT
        num_visits, COUNT(*) num_people, AVG(1.0 * total_span) avg_total_span,
        CASE WHEN num_visits > 1 THEN ROUND(AVG(1.0 * total_span) / (num_visits - 1), 1) END as avg_days_between
    FROM by_people
    GROUP BY num_visits
    ORDER BY num_visits;
    

    (second SELECT of the same fiddle)

    NUM_VISITS NUM_PEOPLE AVG_TOTAL_SPAN AVG_DAYS_BETWEEN
    1 5 0 null
    2 3 86.3 86.3
    3 2 106 53.0
    4 1 142 47.3
    5 1 162 40.5

    In fact at first I got only round numbers, but then I found it suspect that it had no decimal part, and understood that AVG(162) / 5 returned the euclidian divide of 162 by 5, that is, 40 (with remainder 2) instead of 40.5.
    So I introduced a 1.0 * in my expression to make it a floating point AVG().

    So I learnt two things while doing this query:

    Let's consider that this rounding error served as the hand-validation of one row (here patient12).

    And thus our grand total of average days between visits should be (3 * 1 * 86.3 + 2 * 2 * 53 + 3 * 47.3 + 4 * 40.5) / 10 = 55.34 as an average delay over all visits.

    Validating against delay number of visits by number of visits

    If I wrap your last SELECT in a CTE then compute the grand total:
    SELECT SUM(1.0 * num_people * avg_days_between) / SUM(num_people) FROM res WHERE current_visit IS NOT NULL;
    I only get 55.0 within the Developer-C 11.1 DB2 fiddle.

    So based on our now long-time experience with AVG(), we get back to your query, put an 1.0 into AVG(1.0 * days_between_visits), and here we go with the fixed results:

    CURRENT_VISIT NEXT_VISIT NUM_PEOPLE AVG_DAYS_BETWEEN
    1 2 7 59.4
    2 3 4 54
    3 4 2 52.5
    4 5 1 38
    null 1 12 null

    (last query in the fiddle)

    Conclusion

    … But after having re-read your question, I saw that you already had averages with decimals.

    While copy-pasting your data and query to the fiddle returned:

    1. integer averages
    2. "random" results for the number of days due to the DATE - DATE rule that I discovered

    Morality: even if your query works in your environment, to make it universal and avoid unexpected side effects in other versions / configurations of DB2, you should apply the 2 rules that I learnt: