sqlselectaggregate-functionsdatefilter

Filtering on date range defined using SQL aggregate function


I have a large database where rows can be linked to individuals, one or more rows can be identified as belonging to the same event in addition one individual might have more than one event. I can’t find the correct way to declare what data items to return and correctly specify date filters around my selection. I’m using an SQL select statement which is actually running embedded within an R script but I’m not sure that should matter - I am pretty out of my comfort zone using SQL.

For each event_id I need to know the distinct values for individual, event_id, first start_date, last end_date, first type, first sample, minimum age – there are a few fields the dataset must be sorted by to ensure the correct order to extract first records.

I want to apply the following filters to my dataset where the first ‘type’ is between 20-22 or 30-39, minimum age >=65 and the maximum end_date was between ‘1 April 2019’ and ’31 March 2020’ It’s the data filter that is causing me the issue.

below is an example of my dataset

individual event_id start_date end_date type sample age sort1 sort2 sort3
A 6 10/11/2014 10/11/2014 10 A 82 0 1 10
A 8 21/02/2019 27/02/2019 20 B 85 2 0 12
A 8 27/02/2019 06/04/2019 10 B 85 2 1 13
B 19 22/06/2015 22/06/2015 10 B 64 0 1 110
B 20 30/03/2019 01/04/2019 10 C 68 0 0 120
B 20 01/04/2019 05/04/2019 10 C 68 2 0 130
B 20 05/04/2019 20/04/2019 10 C 68 2 1 140
B 5 10/04/2000 20/04/2000 30 E 49 0 1 190
C 17 01/03/2018 23/03/2018 30 A 80 0 1 220
C 17 23/03/2018 04/04/2019 10 B 81 0 1 230

expected result

individual event_id start_date end_date type sample age
A 8 21/02/2019 06/04/2019 20 B 85
C 17 01/03/2018 04/04/2019 30 A 80

I can almost achieve what I want using the SQL selection below however I can’t figure out how/where to also add a filter within the date range based on max(end_date).

SELECT distinct individual, event_id,
            min(start_date) OVER (PARTITION BY individual, event_id) dstart,
            max(end_date) OVER (PARTITION BY individual, event_id) dend,
            min(age) OVER (PARTITION BY individual, event_id) age,
            FIRST_VALUE(sample) OVER (PARTITION BY individual, event_id ORDER BY individual, start_date, end_date, sort1, sort2,sort3) sample,
            FIRST_VALUE(type) OVER (PARTITION BY individual, event_id ORDER BY individual, start_date, end_date, sort1, sort2,sort3) adm_type
   FROM ANALYSIS.TABLE z
   WHERE
    exists (
      select * from ANALYSIS.TABLE where individual=z.individual and event_id=z.event_id
                                    and (type between '20' and '22' or type between '30' and '39'))
      AND AGE >64
     ORDER BY individual, event_id

Any suggestions would be most appreciated


Solution

  • Some of your filters apply to the calculated values, and when using an over() clause you cannot use those in the same "level" of where clause. That is, you need to "nest" the query one layer down, and then you can refer to those ccalculated columns by their column alias e.g. age > 64

    SELECT
        *
    from (
        SELECT
              individual
            , event_id
            , row_number() OVER (PARTITION BY individual, event_id)     rm
            , min(start_date) OVER (PARTITION BY individual, event_id)  dstart
            , max(end_date) OVER (PARTITION BY individual, event_id)    dend
            , min(age) OVER (PARTITION BY individual, event_id)         age
            , FIRST_VALUE(sample) OVER (PARTITION BY individual, event_id
                                        ORDER BY individual, start_date, end_date, sort1, sort2, sort3) sample
            , FIRST_VALUE(type) OVER (PARTITION BY individual, event_id 
                                        ORDER BY individual, start_date, end_date, sort1, sort2, sort3) adm_type
        FROM ANALYSIS.TABLE z
        WHERE EXISTS (
            select * 
            from ANALYSIS.TABLE 
            where individual = z.individual 
            and event_id = z.event_id
            and (type between '20' and '22' or type between '30' and '39')
            )
        ) d
    WHERE rn = 1
    AND AGE > 64
    AND dend BETWEEN '2019-04-01' and '2020-03-31'
    ORDER BY
          individual
        , event_id
    

    I have left the inner query to use the where exists subquery because I'm not sure if you want the start/end dates or age restricted by only those types, or if want the start/end dates and age restricted by those who have at least participated in those types.

    Also, I'm not sure that you really need "select distinct" but I have used row_number() over() as an alternative method to ensure you only get one row per individual, event_id. I suspect this will be more efficient than sorting all columns of the result to decide if there are duplicates.