I am trying to create a dummy variable A to indicate the number of ID
appearances. If ID
appears exactly twice, then A
equals 0; if ID
appears more than 2 times, then A
equals 1.
My desired output is as follows:
ID YEAR A
1078 1989 0
1078 1999 0
1161 1969 0
1161 2002 0
1230 1995 0
1230 2002 0
1279 1996 0
1279 2003 0
1447 1993 0
1447 2001 0
1487 1967 1
1487 2008 1
1487 2009 1
1678 1979 0
1678 2002 0
My code is:
data new data;
set data;
by ID YEAR;
if First.ID then count=0;
count+1;
run;
data newdata;
set newdata;
by ID YEAR count;
if sum(count)=3 then A=0;
if sum(count)>3 then A=1;
run;
But the output is incorrect.
You can't use sum(count) to sum the count across multiple observations. Instead you need to first count the observations, and then merge the counts into the original data.
Data step solution
data newdata;
set data;
by ID YEAR;
if First.ID then count=0;
count+1;
if last.id then output;
drop year;
run;
if last.id then output
means that newdata will only have the last observation for each id. We want that because the last observation has the county of the number of observations per id.
This step merges the original data "data" with the counts from "newdata" and computes a.
data newdata2;
merge data newdata;
by ID;
if count=2 then A=0;
if count>2 then A=1;
drop count;
run;
proc sql solution You can do it even more quickly with proc sql:
proc sql;
create table newdata as
select id, year, (case when count(*) = 2 then 0 else 1 end) as a
from data
group by id;
quit;
group by id
means that summary statistics such as count() are calculated by id, so the count(*)
means count observations in each id. (case when count(*) = 2 then 0 else 1 end) as a
creates a variable a. If the count for a given id is 2 then a takes the value of 0, else a takes the value of 1.