I have a large database where rows can be linked to individuals, one or more rows can be identified as belonging to the same event in addition one individual might have more than one event. I can’t find the correct way to declare what data items to return and correctly specify date filters around my selection. I’m using an SQL select statement which is actually running embedded within an R script but I’m not sure that should matter - I am pretty out of my comfort zone using SQL.
For each event_id I need to know the distinct values for individual, event_id, first start_date, last end_date, first type, first sample, minimum age – there are a few fields the dataset must be sorted by to ensure the correct order to extract first records.
I want to apply the following filters to my dataset where the first ‘type’ is between 20-22 or 30-39, minimum age >=65 and the maximum end_date was between ‘1 April 2019’ and ’31 March 2020’ It’s the data filter that is causing me the issue.
below is an example of my dataset
individual | event_id | start_date | end_date | type | sample | age | sort1 | sort2 | sort3 |
---|---|---|---|---|---|---|---|---|---|
A | 6 | 10/11/2014 | 10/11/2014 | 10 | A | 82 | 0 | 1 | 10 |
A | 8 | 21/02/2019 | 27/02/2019 | 20 | B | 85 | 2 | 0 | 12 |
A | 8 | 27/02/2019 | 06/04/2019 | 10 | B | 85 | 2 | 1 | 13 |
B | 19 | 22/06/2015 | 22/06/2015 | 10 | B | 64 | 0 | 1 | 110 |
B | 20 | 30/03/2019 | 01/04/2019 | 10 | C | 68 | 0 | 0 | 120 |
B | 20 | 01/04/2019 | 05/04/2019 | 10 | C | 68 | 2 | 0 | 130 |
B | 20 | 05/04/2019 | 20/04/2019 | 10 | C | 68 | 2 | 1 | 140 |
B | 5 | 10/04/2000 | 20/04/2000 | 30 | E | 49 | 0 | 1 | 190 |
C | 17 | 01/03/2018 | 23/03/2018 | 30 | A | 80 | 0 | 1 | 220 |
C | 17 | 23/03/2018 | 04/04/2019 | 10 | B | 81 | 0 | 1 | 230 |
expected result
individual | event_id | start_date | end_date | type | sample | age |
---|---|---|---|---|---|---|
A | 8 | 21/02/2019 | 06/04/2019 | 20 | B | 85 |
C | 17 | 01/03/2018 | 04/04/2019 | 30 | A | 80 |
I can almost achieve what I want using the SQL selection below however I can’t figure out how/where to also add a filter within the date range based on max(end_date).
SELECT distinct individual, event_id,
min(start_date) OVER (PARTITION BY individual, event_id) dstart,
max(end_date) OVER (PARTITION BY individual, event_id) dend,
min(age) OVER (PARTITION BY individual, event_id) age,
FIRST_VALUE(sample) OVER (PARTITION BY individual, event_id ORDER BY individual, start_date, end_date, sort1, sort2,sort3) sample,
FIRST_VALUE(type) OVER (PARTITION BY individual, event_id ORDER BY individual, start_date, end_date, sort1, sort2,sort3) adm_type
FROM ANALYSIS.TABLE z
WHERE
exists (
select * from ANALYSIS.TABLE where individual=z.individual and event_id=z.event_id
and (type between '20' and '22' or type between '30' and '39'))
AND AGE >64
ORDER BY individual, event_id
Any suggestions would be most appreciated
Some of your filters apply to the calculated values, and when using an over()
clause you cannot use those in the same "level" of where clause. That is, you need to "nest" the query one layer down, and then you can refer to those ccalculated columns by their column alias e.g. age > 64
SELECT
*
from (
SELECT
individual
, event_id
, row_number() OVER (PARTITION BY individual, event_id) rm
, min(start_date) OVER (PARTITION BY individual, event_id) dstart
, max(end_date) OVER (PARTITION BY individual, event_id) dend
, min(age) OVER (PARTITION BY individual, event_id) age
, FIRST_VALUE(sample) OVER (PARTITION BY individual, event_id
ORDER BY individual, start_date, end_date, sort1, sort2, sort3) sample
, FIRST_VALUE(type) OVER (PARTITION BY individual, event_id
ORDER BY individual, start_date, end_date, sort1, sort2, sort3) adm_type
FROM ANALYSIS.TABLE z
WHERE EXISTS (
select *
from ANALYSIS.TABLE
where individual = z.individual
and event_id = z.event_id
and (type between '20' and '22' or type between '30' and '39')
)
) d
WHERE rn = 1
AND AGE > 64
AND dend BETWEEN '2019-04-01' and '2020-03-31'
ORDER BY
individual
, event_id
I have left the inner query to use the where exists
subquery because I'm not sure if you want the start/end dates or age restricted by only those types, or if want the start/end dates and age restricted by those who have at least participated in those types.
Also, I'm not sure that you really need "select distinct" but I have used row_number() over()
as an alternative method to ensure you only get one row per individual, event_id. I suspect this will be more efficient than sorting all columns of the result to decide if there are duplicates.