I have a large dataset with hundreds of variables collected over several timepoints. The variables are already defined by timepoint, but each observation is a different timepoint. It's like the dataset was planned in wide format but collected in long format, like this:
data have;
input id timepoint $ var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 time_1 1 . . note1 . .
1 time_2 . 2 . . note2 .
1 time_3 . . 3 . . note3
2 time_1 1 . . note1 . .
2 time_2 . 2 . . note2 .
2 time_3 . . 3 . . note3
;
run;
The variable timepoint
is redundant; the variables already describe the timepoint. I need to collapse the dataset into a single observation per id (into just the standard wide format), like this:
data want;
input id var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 1 2 3 note1 note2 note3
2 1 2 3 note1 note2 note3
;
run;
Note: Unfortunately, variable names do not always end in _t1, _t2, etc. as in my example (e.g., note_t1_2) so I cannot easily reference suffixes.
My first thought was to break up the dataset into separate visit datasets (data timepoint_1_have; set have; if timepoint = "time_1";
, etc.), and then merge them by id. A straight merge resulted in loss of data (I'm sure that was obvious, but I thought maybe the missing values would be overwritten). So I thought, I'll drop all the variables with only missing values before merging by id. This proved incredibly difficult and I cannot find a way to do this for both character and numeric variables without pages of macros...
So I tried different strategies using retain
. Since first.id for variables at timepoints >1 have missing values, I thought retaining instances of nonmissing values down timepoints and then keeping last.id might work:
data timepoint_1_want;
set timepoint_1_have;
array Nums[*] _numeric_;
array Chars[*] _character_;
by id;
do i = 1 to dim(Nums);
if not missing(Nums[i]) then do;
retain Nums[i];
end;
do i = 1 to dim(Chars);
if not missing(Chars[i]) then do;
retain Chars[i];
end;
drop i;
if last.id then output;
run;
But retain
can't be used within a do loop, so this didn't work either:
12155 data timepoint_1_want;
12156 set timepoint_1_have;
12157 array Nums[*] _numeric_;
12158 array Chars[*] _character_;
12159 by id;
12160 do i = 1 to dim(Nums);
12161 IF not missing(Nums[i]) THEN do;
12162 retain Nums[i];
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
_CHARACTER_, _CHAR_, _NUMERIC_.
ERROR 76-322: Syntax error, statement will be ignored.
12163 end;
12164 do i = 1 to dim(Chars);
12165 IF not missing(Chars[i]) THEN do;
12166 retain Chars[i];
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
_CHARACTER_, _CHAR_, _NUMERIC_.
ERROR 76-322: Syntax error, statement will be ignored.
12167 end;
12168 drop i;
12169 IF last.id THEN output;
12170 run;
You can use the update trick.
data have;
input id timepoint $ var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 time_1 1 . . note1 . .
1 time_2 . 2 . . note2 .
1 time_3 . . 3 . . note3
2 time_1 1 . . note1 . .
2 time_2 . 2 . . note2 .
2 time_3 . . 3 . . note3
;
run;
data want;
update have(keep=id obs=0) have(drop=timepoint);
by id;
run;