sqldb2

Find out if a name has more than one distinct value in the same time period


I have this table in R:

CREATE TABLE myt 
(
    name VARCHAR(50),
    start_date DATE,
    end_date DATE,
    var VARCHAR(10)
);

INSERT INTO myt (name, start_date, end_date, var) VALUES
('RED', '2020-01-01', '2021-06-01', 'a'),
('RED', '2021-01-01', '2022-01-01', 'b'),
('RED', '2022-06-01', '2023-01-01', 'a'),

('BLUE', '2020-01-01', '2021-01-01', 'x'),
('BLUE', '2020-06-01', '2021-06-01', 'y'),
('BLUE', '2021-02-01', '2022-01-01', 'x'),

('GREEN', '2020-01-01', '2021-01-01', 'p'),
('GREEN', '2021-01-01', '2022-01-01', 'q'),
('GREEN', '2022-01-01', '2023-01-01', 'p'),

('YELLOW', '2020-01-01', '2023-01-01', 'alpha'),
('YELLOW', '2021-01-01', '2022-01-01', 'beta'),

('PURPLE', '2020-01-01', '2021-06-01', 'same'),
('PURPLE', '2021-01-01', '2022-01-01', 'same'),

('ORANGE', '2020-01-01', '2021-01-01', 'red'),
('ORANGE', '2020-06-01', '2021-06-01', 'blue'),
('ORANGE', '2020-09-01', '2021-03-01', 'green');

I want to see if for any names, var has more than 1 distinct value in the same time period. I want to only display rows where this happens.

I think the result should look like this:

   name   var start_date   end_date
   ---------------------------------
   BLUE     x 2020-01-01 2021-01-01
   BLUE     y 2020-06-01 2021-06-01
   BLUE     x 2021-02-01 2022-01-01
 ORANGE   red 2020-01-01 2021-01-01
 ORANGE  blue 2020-06-01 2021-06-01
 ORANGE green 2020-09-01 2021-03-01
    RED     a 2020-01-01 2021-06-01
    RED     b 2021-01-01 2022-01-01
 YELLOW alpha 2020-01-01 2023-01-01
 YELLOW  beta 2021-01-01 2022-01-01

I tried to do this with a self join:

  SELECT DISTINCT t1.name, t1.var, t1.start_date, t1.end_date
  FROM myt t1
  JOIN myt t2 ON t1.name = t2.name
  WHERE t1.var != t2.var
    AND t1.start_date < t2.end_date 
    AND t1.end_date > t2.start_date
  ORDER BY t1.name, t1.start_date;

Is this the correct way to use a self-join?


Solution

  • If I got it right - you could get your expected result in a way you tried - using self join - just handle the border dates in ON conditions.
    Try it like this:

    Select   m.name, m.var, m.start_date, m.end_date
    From     myt m
    Inner Join myt m2 ON(m2.name = m.name AND 
                        (m2.start_date Between m.start_date+1 And m.end_date-1
                         OR 
                         m.start_date Between m2.start_date+1 And m2.end_date-1
                        )AND
                        m2.var != m.var)
    Group By  m.name, m.var, m.start_date, m.end_date
    Order By m.name, m.start_date
    

    R e s u l t :

    NAME VAR START_DATE END_DATE
    BLUE x 2020-01-01 2021-01-01
    BLUE y 2020-06-01 2021-06-01
    BLUE x 2021-02-01 2022-01-01
    ORANGE red 2020-01-01 2021-01-01
    ORANGE blue 2020-06-01 2021-06-01
    ORANGE green 2020-09-01 2021-03-01
    RED a 2020-01-01 2021-06-01
    RED b 2021-01-01 2022-01-01
    YELLOW alpha 2020-01-01 2023-01-01
    YELLOW beta 2021-01-01 2022-01-01

    fiddle

    NOTE:
    The key here is to define the join that should connect rows with the same name having any overlaping period that includes different values in var column. The +1 and -1 when establishing overlaping periods are here to handle border dates. Without it all three rows with name = GREEN (using the data provided) would be fetched. So, we are selecting same names having any overlaping period for different vars. Group By clause eliminates cross joined duplicate rows - you can use Distinct instead.