Here is a simplified version of my problem
We got a table called positions
, that stores the movement of an item across several containers.
Each record contains
container
)date_from
and date_to
, that contain the timestamps of when the item entered and left the containerBetween two consecutive records there may be "time gaps". I.e. the item is inside container A until 10am, then it appears in container B at 4pm, and there is nothing in between.
Here is an example dataset
ID | container |
date_from |
date_to |
---|---|---|---|
1 | A | 2023-10-01T00:00:00 | 2023-10-01T10:00:00 |
2 | A | 2023-10-03T09:00:00 | 2023-10-03T11:00:00 |
3 | B | 2023-10-04T02:00:00 | 2023-10-04T03:00:00 |
4 | C | 2023-10-04T06:00:00 | 2023-10-04T08:00:00 |
5 | C | 2023-10-05T00:00:00 | 2023-10-06T10:00:00 |
6 | A | 2023-10-06T11:00:00 | 2023-10-06T20:00:00 |
7 | C | 2023-10-06T21:00:00 | 2023-10-07T10:00:00 |
I need to squash all consecutive adjacent positions that
date_from
of the second position is within a certain time threshold from date_to
of the previous position.For each subsequence that I squash, I need to take the first value of date_from
and the last value of date_to
, and put them together in the same result row.
For instance, if there are 5 consecutive records inside container A, and according to the rule they are close enough to be squashed, the final row in which I squash those positions will have
container
= Adate_from
taken from the first of the 5 positions I squasheddate_to
taken from the last of the 5 positions WITH with_next_position AS (
SELECT
id,
container,
date_from,
date_to,
(
SELECT subquery.id
FROM positions subquery
WHERE subquery.date_from > base.date_from
ORDER BY subquery.date_from ASC
LIMIT 1
) AS next_position_id
FROM positions
),
with_time_lapse AS (
SELECT
with_next_position.date_from AS date_from,
with_next_position.date_to AS date_from,
with_next_position.container AS container,
CASE
WHEN join_table.date_from IS NOT NULL
THEN EXTRACT(EPOCH FROM (join_table.date_from - with_next_position.date_to))
ELSE
NULL
END AS time_lapse,
join_table.marina_id AS next_container
FROM
with_next_position
FULL OUTER JOIN with_next_position join_table ON join_table.id = with_next_position.next_position_id
WHERE
with_next_position.container IS NOT NULL
),
with_marked_to_squash AS (
SELECT
date_from,
date_to,
container,
CASE
WHEN next_container = container AND time_lapse <= 10000000 # This is where I put the threshold
THEN TRUE
ELSE
FALSE
END AS to_squash
FROM with_time_lapse
)
with_marked_first_to_squash AS (
SELECT
date_from,
date_to,
container,
CASE
WHEN to_squash
THEN (
SELECT CASE WHEN to_squash THEN FALSE ELSE TRUE END
FROM with_marked_to_squash subquery
WHERE subquery.date_from < with_marked_to_squash.date_from
ORDER BY subquery.date_from DESC
LIMIT 1
)
ELSE
FALSE
END AS first_to_squash
FROM with_marked_to_squash
),
with_first_to_squash AS (
SELECT
date_from,
date_to,
container,
(
SELECT subquery.date_from
FROM with_marked_first_to_squash subquery
WHERE subquery.date_from < with_marked_first_to_squash.date_from AND first_to_squash IS TRUE
ORDER BY subquery.date_from DESC
LIMIT 1
) AS first_date_in_position
FROM with_marked_first_to_squash
WHERE to_squash IS FALSE
)
SELECT
COALESCE(first_date_in_position, date_from) AS date_from,
date_to,
container
EXTRACT(EPOCH FROM (date_to - COALESCE(first_date_in_position, date_from))) AS time_spent
FROM with_first_to_squash
ORDER BY date_from
The query above is correct, it does what I expect it to do. However, a performance problem arises when extracting the subquery with_first_to_squash
. If I chop the query until BEFORE with_first_to_squash
, the performance improves exponentially.
I think the cause of the performance issue is that, by running consecutively with_marked_first_to_squash
and with_first_to_squash
, I am making the database engine go over two nested loops:
with_marked_first_to_squash
)date_from
In the moment in which I remove the second subquery, things get blazing fast.
I am sure that there is a solution that would allow to extract date_from
from the first position in the subsequence, probably something involving partitions, but I am not familiar with partitions and with their syntax. Is there anybody who could give me a hint?
I suspect the subqueries in your select
lists are what is killing your performance.
Please try the following window function solution to your gaps-and-islands problem as it needs to sort only once:
with squashes as (
select *,
case
when container = lag(container) over w
and date_from - lag(date_to) over w <= interval '5 days' then false
else true
end as keep_me
from positions
window w as (order by date_from)
), islands as (
select *, sum(keep_me::int) over (order by date_from) as group_num
from squashes
)
select container, min(date_from) as date_from, max(date_to) as date_to
from islands
group by group_num, container
order by group_num;
Working fiddle