sas

Dedup based on the value of a variable


suppose to have the following:

data DB;
input ID :$20. Admission :date09. Discharge :date09. Index1;
format Admission date9. Discharge date9.;
cards;
0001 13JAN2017 25JAN2017 0 
0001 22FEB2018 03MAR2018 0 
0001 30JAN2019 04MAR2019 0 
0001 30JAN2019 04MAR2019 1 
0001 25DEC2020 02JAN2021 0 
0002 15JAN2014 29JAN2014 1 
0002 01FEB2015 10FEB2015 0 
0002 01FEB2015 10FEB2015 0 
0003 22MAR2019 23MAR2019 1 
0003 22MAR2019 23MAR2019 0
0003 02JUL2019 28AUG2019 0
0004 01SEP2022 15SEP2022 1
0004 02DEC2023 05DEC2023 1
;run;

Is there a way to get the following? In other words I need to remove duplicated rows (identical admission-discharge date) and retain the inly one where Index1 = 1. I tried with proc sort to order by the date and the Index1 (descending) without success. Thank you in advance!


data DB1;
input ID :$20. Admission :date09. Discharge :date09. Index1;
format Admission date9. Discharge date9.;
cards;
0001 13JAN2017 25JAN2017 0 
0001 22FEB2018 03MAR2018 0 
0001 30JAN2019 04MAR2019 1 
0001 25DEC2020 02JAN2021 0 
0002 15JAN2014 29JAN2014 1 
0002 01FEB2015 10FEB2015 0 
0003 22MAR2019 23MAR2019 1   
0003 02JUL2019 28AUG2019 0              
0004 01SEP2022 15SEP2022 1            
0004 02DEC2023 05DEC2023 1
;run;

Solution

  • You can sort it by id admission discharge index1 which will ensure that if there are any duplicate values of the combination of id admission discharge, index1 is last. From there, you can just use a DATA Step with by-group processing and only output the last value of the group of id admission discharge.

    Note that in the code below, if(last.discharge) automatically takes into account the groups before it. In other words, last.discharge means it's the last value within the combination of id admission discharge.

    proc sort data=db;
        by id admission discharge index1;
    run;
    
    data want;
        set db;
        by id admission discharge;
        if(last.discharge);
    run;
    

    Or, you can do it equivalently with a SQL having clause:

    proc sql;
        create table want as
            select *
            from db
            group by id, admission, discharge
            having index1 = max(index1)
        ;
    quit;
    
    ID      Admission   Discharge   Index1
    0001    13JAN2017   25JAN2017   0
    0001    22FEB2018   03MAR2018   0
    0001    30JAN2019   04MAR2019   1
    0001    25DEC2020   02JAN2021   0
    0002    15JAN2014   29JAN2014   1
    0002    01FEB2015   10FEB2015   0
    0003    22MAR2019   23MAR2019   1
    0003    02JUL2019   28AUG2019   0
    0004    01SEP2022   15SEP2022   1
    0004    02DEC2023   05DEC2023   1