I have a table that is similar to this:
NAME | POSITION | START_DATE | END_DATE | OFFICE |
---|---|---|---|---|
John Doe | Technician | 01-JAN-2018 | 31-DEC-2019 | Floor 7 |
John Doe | Technician | 01-JAN-2020 | 31-AUG-2020 | Floor 8 |
John Doe | Technician | 01-SEP-2020 | 31-DEC-2020 | Floor 9 |
John Doe | Senior Welder | 01-JAN-2021 | 28-FEB-2025 | Floor 9 |
John Doe | Technician | 01-MAR-2025 | 31-DEC-2025 | Floor 7 |
Now I am trying to write a query to order by changes in the POSITION column but orders in the sequence of changes to POSITION and also modifies the start and end dates based on the minimum and maximum. So the end result would be:
NAME | POSITION | START_DATE | END_DATE |
---|---|---|---|
John Doe | Technician | 01-JAN-2018 | 31-AUG-2020 |
John Doe | Senior Welder | 01-JAN-2021 | 28-FEB-2025 |
John Doe | Technician | 01-MAR-2025 | 31-DEC-2025 |
I am stuck at:
select
name, position, min(start_date), max(end_date)
from employee_table
group by name, position
but unsure how to proceed
Many Thanks
The easiest way for a row to know its neighbours is to use an analytic function (also called window function, because they apply to a shifting window):
they are computed internally by the database server when it has started accumulated the rows that will make up the result set, after the where
but before the group by
, select
, and order by
,
and allow each result row to have a value depending on its neighbors, for example you could easily return the previous position with lag(position) over (partition by name order by start_date) as previous_position
:
NAME | POSITION | PREVIOUS_POSITION |
---|---|---|
John Doe | Technician | null |
John Doe | Technician | Technician |
John Doe | Technician | Technician |
John Doe | Senior Welder | Technician |
John Doe | Technician | Senior Welder |
Note that the order by
of a window is totally independent from the order by
of the query as a whole;
thus you could definitely compute your windows with a chronological order, but display the end result in reverse chronological order.
But this implies you must explicitely declare an order by
in windows (at least in order-based analytic functions: sum()
does not need one), you cannot rely on the query's one.
So we'll want to group by "some kind of group number computed by comparing the row's position
with the previous one".
But this makes two analytic functions (first comparing with the previous, then using it to get a group number),
and SQL forbids nested windows in a select
.
We will have to use one select
per step of our groups numbering:
select
select
, allows an intermediate select
to be reused as if it was a table, but adds:
select
s is stacked CTEsselect
can access a CTE of any level (due to the stacked instead of nested nature)PostgreSQL's documentation is a great introduction,
but the general structure of CTEs is:
with
cte1 as (select … from employee_table),
cte2 as (select … from cte1),
cte3 as (select … from cte2 join cte1 on … join employee_table on …) -- you can join CTEs from any level, and of course real tables too.
select … from cte3 join cte1 on …; -- And there comes our main query.
with
-- Detect if each position is new compared to the previous one.
-- Using the lag() analytic function we can compare the current row's position to the position of its predecessor row.
newpos as
(
select
e.*,
-- In fact do it in reverse logic: detect if the position is _not_ new compared to the previous, so that the first position (which has no previous) does not get null.
case when lag(position) over (partition by name order by start_date) = position then 0 else 1 end newpos
from employee_table e
),
-- Label each row with its group identifier: we can compute the group ID as "the count of all new positions until now"
posgroup as
(
select
newpos.*,
-- We will use an implicit feature of analytic functions: if they are given an order by, they compute from the first row of the group (partition) until the current row (instead of over the whole partition).
sum(newpos) over (partition by name order by start_date) gid
from newpos
)
-- And now our main query only has to group by name and group id
select name, position, min(start_date) start_date, max(end_date) end_date
from posgroup
group by name, gid, position
order by name, gid;
NAME | POSITION | START_DAT | END_DATE |
---|---|---|---|
John Doe | Technician | 01-JAN-18 | 31-DEC-20 |
John Doe | Senior Welder | 01-JAN-21 | 28-FEB-25 |
John Doe | Technician | 01-MAR-25 | 31-DEC-25 |
(as running in a fiddle)