I have this table in R:
CREATE TABLE myt
(
name VARCHAR(50),
start_date DATE,
end_date DATE,
var VARCHAR(10)
);
INSERT INTO myt (name, start_date, end_date, var) VALUES
('RED', '2020-01-01', '2021-06-01', 'a'),
('RED', '2021-01-01', '2022-01-01', 'b'),
('RED', '2022-06-01', '2023-01-01', 'a'),
('BLUE', '2020-01-01', '2021-01-01', 'x'),
('BLUE', '2020-06-01', '2021-06-01', 'y'),
('BLUE', '2021-02-01', '2022-01-01', 'x'),
('GREEN', '2020-01-01', '2021-01-01', 'p'),
('GREEN', '2021-01-01', '2022-01-01', 'q'),
('GREEN', '2022-01-01', '2023-01-01', 'p'),
('YELLOW', '2020-01-01', '2023-01-01', 'alpha'),
('YELLOW', '2021-01-01', '2022-01-01', 'beta'),
('PURPLE', '2020-01-01', '2021-06-01', 'same'),
('PURPLE', '2021-01-01', '2022-01-01', 'same'),
('ORANGE', '2020-01-01', '2021-01-01', 'red'),
('ORANGE', '2020-06-01', '2021-06-01', 'blue'),
('ORANGE', '2020-09-01', '2021-03-01', 'green');
I want to see if for any names, var has more than 1 distinct value in the same time period. I want to only display rows where this happens.
I think the result should look like this:
name var start_date end_date
---------------------------------
BLUE x 2020-01-01 2021-01-01
BLUE y 2020-06-01 2021-06-01
BLUE x 2021-02-01 2022-01-01
ORANGE red 2020-01-01 2021-01-01
ORANGE blue 2020-06-01 2021-06-01
ORANGE green 2020-09-01 2021-03-01
RED a 2020-01-01 2021-06-01
RED b 2021-01-01 2022-01-01
YELLOW alpha 2020-01-01 2023-01-01
YELLOW beta 2021-01-01 2022-01-01
I tried to do this with a self join:
SELECT DISTINCT t1.name, t1.var, t1.start_date, t1.end_date
FROM myt t1
JOIN myt t2 ON t1.name = t2.name
WHERE t1.var != t2.var
AND t1.start_date < t2.end_date
AND t1.end_date > t2.start_date
ORDER BY t1.name, t1.start_date;
Is this the correct way to use a self-join?
If I got it right - you could get your expected result in a way you tried - using self join - just handle the border dates in ON conditions.
Try it like this:
Select m.name, m.var, m.start_date, m.end_date
From myt m
Inner Join myt m2 ON(m2.name = m.name AND
(m2.start_date Between m.start_date+1 And m.end_date-1
OR
m.start_date Between m2.start_date+1 And m2.end_date-1
)AND
m2.var != m.var)
Group By m.name, m.var, m.start_date, m.end_date
Order By m.name, m.start_date
R e s u l t :
NAME | VAR | START_DATE | END_DATE |
---|---|---|---|
BLUE | x | 2020-01-01 | 2021-01-01 |
BLUE | y | 2020-06-01 | 2021-06-01 |
BLUE | x | 2021-02-01 | 2022-01-01 |
ORANGE | red | 2020-01-01 | 2021-01-01 |
ORANGE | blue | 2020-06-01 | 2021-06-01 |
ORANGE | green | 2020-09-01 | 2021-03-01 |
RED | a | 2020-01-01 | 2021-06-01 |
RED | b | 2021-01-01 | 2022-01-01 |
YELLOW | alpha | 2020-01-01 | 2023-01-01 |
YELLOW | beta | 2021-01-01 | 2022-01-01 |
NOTE:
The key here is to define the join that should connect rows with the same name having any overlaping period that includes different values in var column. The +1 and -1 when establishing overlaping periods are here to handle border dates. Without it all three rows with name = GREEN (using the data provided) would be fetched. So, we are selecting same names having any overlaping period for different vars. Group By clause eliminates cross joined duplicate rows - you can use Distinct instead.