I have a subscription table with 4 fields: id, customer_id, start_date and end_date. It lists all subscriptions of my customers. No subscription has an empty end_date, and a customer can have several subscriptions at once.
For example, a customer id 37 can have the following subscriptions:
id | customer_id | start_at | end_at |
---|---|---|---|
44 | 37 | 2019-03-21 | 2019-03-21 |
17819 | 37 | 2020-03-23 | 2020-03-23 |
22302 | 37 | 2020-04-24 | 2021-07-25 |
42213 | 37 | 2021-04-25 | 2023-04-26 |
92013 | 37 | 2023-04-26 | 2024-04-26 |
These records mean that customer 37 was a subscriber on 2019-03-21, then on 2020-03-23, then from 2020-04-24 to 2024-04-26, for a total of 1463 days.
I am trying to write a query to get the number of days each customer has been a subscriber in a given period. Customer 37 has been a subscriber 365 days in 2023. Subscriptions can overlap, as a subscriber can have multiple subscriptions at once.
The result of the query should be something like:
customer_id | total_subscription_days |
---|---|
37 | 1463 |
38 | 526 |
39 | 426 |
40 | 365 |
41 | 325 |
My database is running on MySQL 8.2.12.
I tried using lag, lead, CTEs, least and greatest, to no avail. I tried chatgpt and stackoverflow.
EDIT: here below is what I tried so far:
1st try:
SELECT
customer_id,
SUM(DATEDIFF(
LEAST(end_at, '2023-12-31'),
GREATEST(start_at, '2023-01-01')
) + 1) AS total_subscription_days
FROM (
SELECT
customer_id,
start_at,
end_at
FROM
subscription
WHERE
start_at <= '2023-12-31' AND end_at >= '2023-01-01'
UNION ALL
SELECT
s1.customer_id,
LEAD(s1.end_at) OVER (PARTITION BY s1.customer_id ORDER BY s1.end_at),
'2023-12-31'
FROM
subscription s1
LEFT JOIN
subscription s2 ON s1.customer_id = s2.customer_id
AND s1.end_at < s2.start_at
WHERE
s2.start_at IS NOT NULL
) AS merged_subscriptions
GROUP BY
customer_id;
Even though I wanted to know subscription days in 2023, I had results bigger than 365 days. So it seems to count duplicates because of the join.
2nd try:
WITH subscription_periods AS (
SELECT
customer_id,
start_at,
end_at,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY start_at) AS period_number
FROM
subscription
WHERE
start_at <= '2023-12-31' AND end_at >= '2023-01-01' AND customer_id < 100
),
subscription_days AS (
SELECT
customer_id,
SUM(
DATEDIFF(
LEAST(end_at, '2023-12-31'),
GREATEST(start_at, '2023-01-01')
) + 1
) AS days
FROM
(
SELECT
customer_id,
start_at,
LEAD(end_at) OVER (PARTITION BY customer_id ORDER BY start_at) AS end_at
FROM
subscription_periods
) AS overlapping_periods
WHERE
end_at >= '2023-01-01'
GROUP BY
customer_id
)
SELECT
customer_id,
SUM(days) AS total_subscription_days
FROM
subscription_days
GROUP BY
customer_id;
I restricted to the first 100 customers, otherwise I was getting a 504 error. This query seems to not be able to take into accounts gaps between subscription. For a customer having been subscribed from 2023-01-01 to 2023-04-01 and then from 2023-05-01 to 2023-08-01, it seems to be counting days between 2023-01-01 and 2023-08-01.
It is kind of a merge overlapping intervals problem:
set @date1 = '2023-01-01';
set @date2 = '2023-12-31';
with cte1 as (
select customer_id, start_at, end_at
from t
where start_at <= @date2 and end_at >= @date1
), cte2 as (
select *, case when start_at <= max(end_at) over (partition by customer_id order by start_at rows between unbounded preceding and 1 preceding) then 0 else 1 end as newgrp
from cte1
), cte3 as (
select *, sum(newgrp) over (partition by customer_id order by start_at) as grpnum
from cte2
)
select
customer_id,
greatest(min(start_at), @date1) as date1,
least(max(end_at), @date2) as date2,
datediff(
least(max(end_at), @date2),
greatest(min(start_at), @date1)
) + 1 as diff
from cte3
group by customer_id, grpnum
DB<>Fiddle - returns total for each island
DB<>Fiddle - returns total for all islands