sqlpostgresqlperformancedatabase-performancepartition

PostgreSQL - performance of squashing adjacent rows by a distance threshold


The context

Here is a simplified version of my problem

We got a table called positions, that stores the movement of an item across several containers.

Each record contains

Between two consecutive records there may be "time gaps". I.e. the item is inside container A until 10am, then it appears in container B at 4pm, and there is nothing in between.

Here is an example dataset

ID container date_from date_to
1 A 2023-10-01T00:00:00 2023-10-01T10:00:00
2 A 2023-10-03T09:00:00 2023-10-03T11:00:00
3 B 2023-10-04T02:00:00 2023-10-04T03:00:00
4 C 2023-10-04T06:00:00 2023-10-04T08:00:00
5 C 2023-10-05T00:00:00 2023-10-06T10:00:00
6 A 2023-10-06T11:00:00 2023-10-06T20:00:00
7 C 2023-10-06T21:00:00 2023-10-07T10:00:00

The requirements

I need to squash all consecutive adjacent positions that

  1. Are in the same container (the item never leaves that container over the subsequence)
  2. And are "close enough" to each other: i.e. for which date_from of the second position is within a certain time threshold from date_to of the previous position.

For each subsequence that I squash, I need to take the first value of date_from and the last value of date_to, and put them together in the same result row.

For instance, if there are 5 consecutive records inside container A, and according to the rule they are close enough to be squashed, the final row in which I squash those positions will have

The PostgreSQL query I wrote

    WITH with_next_position AS (
      SELECT
        id,
        container,
        date_from,
        date_to,
        (
          SELECT subquery.id
          FROM positions subquery
          WHERE subquery.date_from > base.date_from
          ORDER BY subquery.date_from ASC
          LIMIT 1
        ) AS next_position_id

      FROM positions
    ),

    with_time_lapse AS (
      SELECT
        with_next_position.date_from AS date_from,
        with_next_position.date_to AS date_from,
        with_next_position.container AS container,
        CASE
        WHEN join_table.date_from IS NOT NULL
          THEN EXTRACT(EPOCH FROM (join_table.date_from - with_next_position.date_to))
        ELSE
          NULL
        END AS time_lapse,
        join_table.marina_id AS next_container

      FROM
        with_next_position
        FULL OUTER JOIN with_next_position join_table ON join_table.id = with_next_position.next_position_id

      WHERE
        with_next_position.container IS NOT NULL
    ),

    with_marked_to_squash AS (
      SELECT
        date_from,
        date_to,
        container,
        CASE
        WHEN next_container = container AND time_lapse <= 10000000 # This is where I put the threshold
          THEN TRUE
        ELSE
          FALSE
        END AS to_squash

      FROM with_time_lapse
    )

    with_marked_first_to_squash AS (
      SELECT
        date_from,
        date_to,
        container,
        CASE
        WHEN to_squash
          THEN (
            SELECT CASE WHEN to_squash THEN FALSE ELSE TRUE END
            FROM with_marked_to_squash subquery
            WHERE subquery.date_from < with_marked_to_squash.date_from
            ORDER BY subquery.date_from DESC
            LIMIT 1
          )
        ELSE
          FALSE
        END AS first_to_squash

      FROM with_marked_to_squash
    ),

    with_first_to_squash AS (
      SELECT
        date_from,
        date_to,
        container,
        (
          SELECT subquery.date_from
          FROM with_marked_first_to_squash subquery
          WHERE subquery.date_from < with_marked_first_to_squash.date_from AND first_to_squash IS TRUE
          ORDER BY subquery.date_from DESC
          LIMIT 1
        ) AS first_date_in_position

      FROM with_marked_first_to_squash

      WHERE to_squash IS FALSE
    )

    SELECT
      COALESCE(first_date_in_position, date_from) AS date_from,
      date_to,
      container
      EXTRACT(EPOCH FROM (date_to - COALESCE(first_date_in_position, date_from))) AS time_spent

    FROM with_first_to_squash

    ORDER BY date_from

The performance problem

The query above is correct, it does what I expect it to do. However, a performance problem arises when extracting the subquery with_first_to_squash. If I chop the query until BEFORE with_first_to_squash, the performance improves exponentially.

I think the cause of the performance issue is that, by running consecutively with_marked_first_to_squash and with_first_to_squash, I am making the database engine go over two nested loops:

In the moment in which I remove the second subquery, things get blazing fast.

I am sure that there is a solution that would allow to extract date_from from the first position in the subsequence, probably something involving partitions, but I am not familiar with partitions and with their syntax. Is there anybody who could give me a hint?


Solution

  • I suspect the subqueries in your select lists are what is killing your performance.

    Please try the following window function solution to your gaps-and-islands problem as it needs to sort only once:

    with squashes as (
      select *,
             case
               when     container = lag(container) over w
                    and date_from - lag(date_to) over w <= interval '5 days' then false 
               else true
             end as keep_me
        from positions
      window w as (order by date_from)
    ), islands as (
      select *, sum(keep_me::int) over (order by date_from) as group_num
        from squashes
    )
    select container, min(date_from) as date_from, max(date_to) as date_to
      from islands
     group by group_num, container
     order by group_num;
    

    Working fiddle