I have table in SAS with missing values like below:
col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111 | | abc | ... | abc
222 | 11 | C1 | ... | 11
333 | 18 | | ... | 12
... | ... | ... | ... | ...
And I need to delete from above table variables where is more than 80% missing values (>=80%).
How can I do taht in SAS ?
The macro below will create a macro variable named &drop_vars
that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.
%macro get_missing_vars(lib=, dsn=, threshold=);
%global drop_vars;
/* Generate a select statement that calculates the proportion missing:
nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */
proc sql noprint;
select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
into :calculate_pct_missing separated by ','
from dictionary.columns
where libname = upcase("&lib")
AND memname = upcase("&dsn")
;
quit;
/* Calculate the percent missing */
proc sql;
create table pct_missing as
select &calculate_pct_missing.
from &lib..&dsn.
;
quit;
/* Convert to a long table */
proc transpose data=pct_missing out=drop_list;
var _NUMERIC_;
run;
/* Get a list of variables to drop that are >= the drop threshold */
proc sql noprint;
select _NAME_
into :drop_vars separated by ' '
from drop_list
where COL1 GE &threshold.
;
quit;
%mend;
It has three parameters:
lib
: Library of your dataset
dsn
: Dataset name without the library
threshold
: Proportion of missing values a variable must meet or exceed to be dropped
For example, let's generate some sample data and use this. col1 col2 col3
all have 80% missing values.
data have;
array col[10];
do i = 1 to 10;
do j = 1 to 10;
col[j] = i;
if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
end;
output;
end;
drop i j;
run;
We'll run the macro and check the log:
%get_missing_vars(lib=work, dsn=have, threshold=0.8);
%put &drop_vars;
The log shows:
col1 col2 col3
Now we can pass this into a simple data step.
data want;
set have;
drop &drop_vars;
run;