sqldb2

Using Lag and Lead functions in SQL


I have this table on medical patients:

CREATE TABLE SAMPLE_DATA (
    patient_name VARCHAR(50),
    
    year INTEGER,
    
    gender CHAR(1),
    
    patient_weight DECIMAL(5,2),
    
    location VARCHAR(20)
);

INSERT INTO SAMPLE_DATA (patient_name, year, gender, patient_weight, location) VALUES
    ('Sarah', 2010, 'F', 65.00, 'hospital'),
    ('Sarah', 2012, 'F', 66.00, 'home'),
    ('Sarah', 2013, 'F', 67.00, 'hospital'),
    
    ('Michael', 2011, 'M', 78.00, 'hospital'),
    ('Michael', 2013, 'M', 76.00, 'home'),
    ('Michael', 2015, 'M', 77.00, 'hospital'),
    
    ('James', 2010, 'M', 82.00, 'home'),
    ('James', 2014, 'M', 80.00, 'hospital'),
    
    ('Emma', 2012, 'F', 70.00, 'hospital'),
    ('Emma', 2013, 'F', 71.00, 'home'),
    ('Emma', 2015, 'F', 71.00, 'hospital'),
    
    ('Robert', 2011, 'M', 88.00, 'hospital'),
    ('Robert', 2014, 'M', 85.00, 'home'),
    
    ('Maria', 2010, 'F', 63.00, 'hospital'),
    ('Maria', 2012, 'F', 64.00, 'home'),
    ('Maria', 2015, 'F', 64.00, 'hospital');

It looks like this originally:

patient_name | year | gender | patient_weight | location
-------------|------|--------|----------------|----------
Sarah        | 2010 | F      | 65.00         | hospital
Sarah        | 2012 | F      | 66.00         | home
Sarah        | 2013 | F      | 67.00         | hospital
Michael      | 2011 | M      | 78.00         | hospital
Michael      | 2013 | M      | 76.00         | home
Michael      | 2015 | M      | 77.00         | hospital
James        | 2010 | M      | 82.00         | home
James        | 2014 | M      | 80.00         | hospital
Emma         | 2012 | F      | 70.00         | hospital
Emma         | 2013 | F      | 71.00         | home
Emma         | 2015 | F      | 71.00         | hospital
Robert       | 2011 | M      | 88.00         | hospital
Robert       | 2014 | M      | 85.00         | home
Maria        | 2010 | F      | 63.00         | hospital
Maria        | 2012 | F      | 64.00         | home
Maria        | 2015 | F      | 64.00         | hospital

Desired Result: I want to transform this table such that it shows what happened to the patient between each pair of measurements:

 patient_name | start_year | gender | start_weight | years_until_next | location_change    
-------------|------------|---------|--------------|------------------|-------------------
Sarah        | 2010       | F       | 65.00       | 2               | hospital-home     
Sarah        | 2012       | F       | 66.00       | 1               | home-hospital     
Michael      | 2011       | M       | 78.00       | 2               | hospital-home     
Michael      | 2013       | M       | 76.00       | 2               | home-hospital     
James        | 2010       | M       | 82.00       | 4               | home-hospital     
Emma         | 2012       | F       | 70.00       | 1               | hospital-home     
Emma         | 2013       | F       | 71.00       | 2               | home-hospital     
Robert       | 2011       | M       | 88.00       | 3               | hospital-home     
Maria        | 2010       | F       | 63.00       | 2               | hospital-home     
Maria        | 2012       | F       | 64.00       | 3               | home-hospital

I am new to LAG and LEAD functions in SQL, I tried to do the following:

WITH next_measurements AS (
    SELECT 
        patient_name,
        year as start_year,
        gender,
        patient_weight as start_weight,
        location as start_location,
        LEAD(year) OVER (
            PARTITION BY patient_name 
            ORDER BY year
        ) as next_year,
        LEAD(location) OVER (
            PARTITION BY patient_name 
            ORDER BY year
        ) as next_location
    FROM sample_data
)
SELECT 
    patient_name,
    start_year,
    gender,
    start_weight,
    (next_year - start_year) as years_until_next,
    LOWER(start_location) || '-' || LOWER(next_location) as location_change
FROM next_measurements
WHERE next_year IS NOT NULL
ORDER BY patient_name, start_year;

The code seems to work:

 patient_name start_year gender start_weight years_until_next location_change
         Emma       2012      F           70                1   hospital-home
         Emma       2013      F           71                2   home-hospital
        James       2010      M           82                4   home-hospital
        Maria       2010      F           63                2   hospital-home
        Maria       2012      F           64                3   home-hospital
      Michael       2011      M           78                2   hospital-home
      Michael       2013      M           76                2   home-hospital
       Robert       2011      M           88                3   hospital-home
        Sarah       2010      F           65                2   hospital-home
        Sarah       2012      F           66                1   home-hospital

Is this the correct way to use these functions?


Solution

  • Lag: Used to access data or information from previous rows.

    Lead: Used to access data or information from subsequent / following rows.

    Both Lag and Lead helps you in accessing data from multiple rows without utilizing self-joins and can be used for comparing your current data with previous or next rows respectively.

    Your query is correctly retrieving data from next rows using Lead based on year and location and finding time gaps and location changes.

    But I don't see Lag being used any where in your query and I don't think it is needed in your desired results. But if you have used both to get prior changes as well then it would have looked some thing like this:

    WITH next_measurements AS (
        SELECT 
            patient_name,
            year as start_year,
            gender,
            patient_weight as start_weight,
            location as start_location,
            LEAD(year) OVER (
                PARTITION BY patient_name 
                ORDER BY year
            ) as next_year,
            LEAD(location) OVER (
                PARTITION BY patient_name 
                ORDER BY year
            ) as next_location ,
            LAG(year) OVER (
                PARTITION BY patient_name 
                ORDER BY year
            ) as prior_year,
            LAG(location) OVER (
                PARTITION BY patient_name 
                ORDER BY year
            ) as prior_location
    
        FROM sample_data
    )
    SELECT 
        patient_name,
        start_year,
        gender,
        start_weight,
        (next_year - start_year) as years_until_next,
        LOWER(start_location) || '-' || LOWER(next_location) as location_change ,
    
        (start_year - prior_year) as years_until_prior,
        LOWER(prior_location) || '-' || LOWER(start_location) as prior_location_change 
    
    FROM next_measurements
    WHERE next_year IS NOT NULL
    ORDER BY patient_name, start_year;
    

    Official Documentation link: https://www.ibm.com/docs/en/db2/12.1?topic=expressions-olap-specification#sdx-synid_frag-lag-function

    Examples: https://www.ibm.com/docs/en/psfa/7.1.0?topic=functions-lag-lead-family-syntax