I am working with a dataset that has ID, Event ID, Event date time, and event reason. My problem is that sometimes there is a unique event ID but the exact same event date time, ID, and event reason. Is there a way to create a variable to flag the duplicate observations in this case?
The data looks like the below
data test;
input ID event_ID$ event_dt_tm :datetime18. event$;
format event_dt_tm datetime20.;
datalines;
1 62-1 25NOV22:16:09:00 SaO2
1 62-2 25NOV22:16:09:00 SaO2
2 74-1 24FEB19:10:26:00 RR
2 74-2 03MAR19:19:11:00 HR
3 51-1 04DEC22:09:10:00 NC
3 51-2 04DEC22:09:10:00 NC
3 51-3 12DEC22:13:28:00 SaO2
;
run;
Thank you in advance! Any help would be appreciated
This is a good use case for by-group processing. First, sort your data by event_id event_dt_tm event
so we know it's in order. Next we'll apply by-group processing and check if the unique combination of event_id event_dt_tm event
is not the first value. If it isn't, then we know there is more than one value in that group.
proc sort data=test;
by id event_dt_tm event;
run;
data want;
set test;
by id event_dt_tm event;
flag_duplicate_event = (first.event = 0);
run;
ID event_ID event_dt_tm event flag_duplicate_event
1 62-1 25NOV2022:16:09:00 SaO2 0
1 62-2 25NOV2022:16:09:00 SaO2 1
2 74-1 24FEB2019:10:26:00 RR 0
2 74-2 03MAR2019:19:11:00 HR 0
3 51-1 04DEC2022:09:10:00 NC 0
3 51-2 04DEC2022:09:10:00 NC 1
3 51-3 12DEC2022:13:28:00 SaO2 0
Explanation
When you use by-group processing, you unlock the first.
and last.
special variables. These are 1 or 0 depending on if we're at the start or end of a group, respectively. You can tell where you are in the group depending on the variable values:
This is also done hierarchically. If we have a by group of group1 group2 group3
, first.group3
automatically implies we're looking at the unique combination of group1 group2 group3
.
In other words, we're checking if the combination of id event_dt_tm event
is unique. If the value of first.event
is 0, then we know there is more than one value in the group. Another way to look at this is by using put
statements in the log and seeing how the values change:
data _null_;
set test;
by id event_dt_tm event;
if(_N_ = 1) then do;
put 'ID' +3 'first.id' +4 'first.event_dt_tm' +3 'first.event';
end;
put id +3 first.id +10 first.event_dt_tm +18 first.event;
run;
ID first.id first.event_dt_tm first.event
1 1 1 1
1 0 0 0
2 1 1 1
2 0 1 1
3 1 1 1
3 0 0 0
3 0 1 1