I have this table on medical patients:
CREATE TABLE SAMPLE_DATA (
patient_name VARCHAR(50),
year INTEGER,
gender CHAR(1),
patient_weight DECIMAL(5,2),
location VARCHAR(20)
);
INSERT INTO SAMPLE_DATA (patient_name, year, gender, patient_weight, location) VALUES
('Sarah', 2010, 'F', 65.00, 'hospital'),
('Sarah', 2012, 'F', 66.00, 'home'),
('Sarah', 2013, 'F', 67.00, 'hospital'),
('Michael', 2011, 'M', 78.00, 'hospital'),
('Michael', 2013, 'M', 76.00, 'home'),
('Michael', 2015, 'M', 77.00, 'hospital'),
('James', 2010, 'M', 82.00, 'home'),
('James', 2014, 'M', 80.00, 'hospital'),
('Emma', 2012, 'F', 70.00, 'hospital'),
('Emma', 2013, 'F', 71.00, 'home'),
('Emma', 2015, 'F', 71.00, 'hospital'),
('Robert', 2011, 'M', 88.00, 'hospital'),
('Robert', 2014, 'M', 85.00, 'home'),
('Maria', 2010, 'F', 63.00, 'hospital'),
('Maria', 2012, 'F', 64.00, 'home'),
('Maria', 2015, 'F', 64.00, 'hospital');
It looks like this originally:
patient_name | year | gender | patient_weight | location
-------------|------|--------|----------------|----------
Sarah | 2010 | F | 65.00 | hospital
Sarah | 2012 | F | 66.00 | home
Sarah | 2013 | F | 67.00 | hospital
Michael | 2011 | M | 78.00 | hospital
Michael | 2013 | M | 76.00 | home
Michael | 2015 | M | 77.00 | hospital
James | 2010 | M | 82.00 | home
James | 2014 | M | 80.00 | hospital
Emma | 2012 | F | 70.00 | hospital
Emma | 2013 | F | 71.00 | home
Emma | 2015 | F | 71.00 | hospital
Robert | 2011 | M | 88.00 | hospital
Robert | 2014 | M | 85.00 | home
Maria | 2010 | F | 63.00 | hospital
Maria | 2012 | F | 64.00 | home
Maria | 2015 | F | 64.00 | hospital
Desired Result: I want to transform this table such that it shows what happened to the patient between each pair of measurements:
patient_name | start_year | gender | start_weight | years_until_next | location_change
-------------|------------|---------|--------------|------------------|-------------------
Sarah | 2010 | F | 65.00 | 2 | hospital-home
Sarah | 2012 | F | 66.00 | 1 | home-hospital
Michael | 2011 | M | 78.00 | 2 | hospital-home
Michael | 2013 | M | 76.00 | 2 | home-hospital
James | 2010 | M | 82.00 | 4 | home-hospital
Emma | 2012 | F | 70.00 | 1 | hospital-home
Emma | 2013 | F | 71.00 | 2 | home-hospital
Robert | 2011 | M | 88.00 | 3 | hospital-home
Maria | 2010 | F | 63.00 | 2 | hospital-home
Maria | 2012 | F | 64.00 | 3 | home-hospital
I am new to LAG and LEAD functions in SQL, I tried to do the following:
WITH next_measurements AS (
SELECT
patient_name,
year as start_year,
gender,
patient_weight as start_weight,
location as start_location,
LEAD(year) OVER (
PARTITION BY patient_name
ORDER BY year
) as next_year,
LEAD(location) OVER (
PARTITION BY patient_name
ORDER BY year
) as next_location
FROM sample_data
)
SELECT
patient_name,
start_year,
gender,
start_weight,
(next_year - start_year) as years_until_next,
LOWER(start_location) || '-' || LOWER(next_location) as location_change
FROM next_measurements
WHERE next_year IS NOT NULL
ORDER BY patient_name, start_year;
The code seems to work:
patient_name start_year gender start_weight years_until_next location_change
Emma 2012 F 70 1 hospital-home
Emma 2013 F 71 2 home-hospital
James 2010 M 82 4 home-hospital
Maria 2010 F 63 2 hospital-home
Maria 2012 F 64 3 home-hospital
Michael 2011 M 78 2 hospital-home
Michael 2013 M 76 2 home-hospital
Robert 2011 M 88 3 hospital-home
Sarah 2010 F 65 2 hospital-home
Sarah 2012 F 66 1 home-hospital
Is this the correct way to use these functions?
Lag: Used to access data or information from previous rows.
Lead: Used to access data or information from subsequent / following rows.
Both Lag and Lead helps you in accessing data from multiple rows without utilizing self-joins and can be used for comparing your current data with previous or next rows respectively.
Your query is correctly retrieving data from next rows using Lead based on year and location and finding time gaps and location changes.
But I don't see Lag being used any where in your query and I don't think it is needed in your desired results. But if you have used both to get prior changes as well then it would have looked some thing like this:
WITH next_measurements AS (
SELECT
patient_name,
year as start_year,
gender,
patient_weight as start_weight,
location as start_location,
LEAD(year) OVER (
PARTITION BY patient_name
ORDER BY year
) as next_year,
LEAD(location) OVER (
PARTITION BY patient_name
ORDER BY year
) as next_location ,
LAG(year) OVER (
PARTITION BY patient_name
ORDER BY year
) as prior_year,
LAG(location) OVER (
PARTITION BY patient_name
ORDER BY year
) as prior_location
FROM sample_data
)
SELECT
patient_name,
start_year,
gender,
start_weight,
(next_year - start_year) as years_until_next,
LOWER(start_location) || '-' || LOWER(next_location) as location_change ,
(start_year - prior_year) as years_until_prior,
LOWER(prior_location) || '-' || LOWER(start_location) as prior_location_change
FROM next_measurements
WHERE next_year IS NOT NULL
ORDER BY patient_name, start_year;
Official Documentation link: https://www.ibm.com/docs/en/db2/12.1?topic=expressions-olap-specification#sdx-synid_frag-lag-function
Examples: https://www.ibm.com/docs/en/psfa/7.1.0?topic=functions-lag-lead-family-syntax