sqloracle-database

How to group by changes in one column and order other columns


I have a table that is similar to this:

NAME POSITION START_DATE END_DATE OFFICE
John Doe Technician 01-JAN-2018 31-DEC-2019 Floor 7
John Doe Technician 01-JAN-2020 31-AUG-2020 Floor 8
John Doe Technician 01-SEP-2020 31-DEC-2020 Floor 9
John Doe Senior Welder 01-JAN-2021 28-FEB-2025 Floor 9
John Doe Technician 01-MAR-2025 31-DEC-2025 Floor 7

Now I am trying to write a query to order by changes in the POSITION column but orders in the sequence of changes to POSITION and also modifies the start and end dates based on the minimum and maximum. So the end result would be:

NAME POSITION START_DATE END_DATE
John Doe Technician 01-JAN-2018 31-AUG-2020
John Doe Senior Welder 01-JAN-2021 28-FEB-2025
John Doe Technician 01-MAR-2025 31-DEC-2025

I am stuck at:

select
name, position, min(start_date), max(end_date)
from employee_table
group by name, position 

but unsure how to proceed

Many Thanks


Solution

  • Theory

    Analytic functions

    The easiest way for a row to know its neighbours is to use an analytic function (also called window function, because they apply to a shifting window):
    they are computed internally by the database server when it has started accumulated the rows that will make up the result set, after the where but before the group by, select, and order by,
    and allow each result row to have a value depending on its neighbors, for example you could easily return the previous position with lag(position) over (partition by name order by start_date) as previous_position:

    NAME POSITION PREVIOUS_POSITION
    John Doe Technician null
    John Doe Technician Technician
    John Doe Technician Technician
    John Doe Senior Welder Technician
    John Doe Technician Senior Welder

    Note that the order by of a window is totally independent from the order by of the query as a whole;
    thus you could definitely compute your windows with a chronological order, but display the end result in reverse chronological order.
    But this implies you must explicitely declare an order by in windows (at least in order-based analytic functions: sum() does not need one), you cannot rely on the query's one.

    CTE (Common Table Expression)

    So we'll want to group by "some kind of group number computed by comparing the row's position with the previous one".
    But this makes two analytic functions (first comparing with the previous, then using it to get a group number),
    and SQL forbids nested windows in a select.

    We will have to use one select per step of our groups numbering:

    PostgreSQL's documentation is a great introduction,
    but the general structure of CTEs is:

    with
    cte1 as (select … from employee_table),
    cte2 as (select … from cte1),
    cte3 as (select … from cte2 join cte1 on … join employee_table on …) -- you can join CTEs from any level, and of course real tables too.
    select … from cte3 join cte1 on …; -- And there comes our main query.
    

    Let's go

    with
    -- Detect if each position is new compared to the previous one.
    -- Using the lag() analytic function we can compare the current row's position to the position of its predecessor row.
    newpos as
    (
        select
            e.*,
            -- In fact do it in reverse logic: detect if the position is _not_ new compared to the previous, so that the first position (which has no previous) does not get null.
            case when lag(position) over (partition by name order by start_date) = position then 0 else 1 end newpos
        from employee_table e
    ),
    -- Label each row with its group identifier: we can compute the group ID as "the count of all new positions until now" 
    posgroup as
    (
        select
            newpos.*,
            -- We will use an implicit feature of analytic functions: if they are given an order by, they compute from the first row of the group (partition) until the current row (instead of over the whole partition).
            sum(newpos) over (partition by name order by start_date) gid
        from newpos
    )
    -- And now our main query only has to group by name and group id
    select name, position, min(start_date) start_date, max(end_date) end_date
    from posgroup
    group by name, gid, position
    order by name, gid;
    
    NAME POSITION START_DAT END_DATE
    John Doe Technician 01-JAN-18 31-DEC-20
    John Doe Senior Welder 01-JAN-21 28-FEB-25
    John Doe Technician 01-MAR-25 31-DEC-25

    (as running in a fiddle)