sasmissing-dataproc-sql4gl

How to delete variables with huge percent of missings in table in SAS?


I have table in SAS with missing values like below:

col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111  |      | abc  | ... | abc
222  | 11   | C1   | ... | 11
333  | 18   |      | ... | 12
...  | ...  | ...  | ... | ...

And I need to delete from above table variables where is more than 80% missing values (>=80%).

How can I do taht in SAS ?


Solution

  • The macro below will create a macro variable named &drop_vars that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.

    %macro get_missing_vars(lib=, dsn=, threshold=);
        %global drop_vars;
    
        /* Generate a select statement that calculates the proportion missing:
           nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */   
        proc sql noprint;
            select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
            into :calculate_pct_missing separated by ','
            from dictionary.columns
            where     libname = upcase("&lib")
                  AND memname = upcase("&dsn")
            ;
        quit;
    
        /* Calculate the percent missing */
        proc sql;
            create table pct_missing as
                select &calculate_pct_missing.
                from &lib..&dsn.
            ;
        quit;
    
        /* Convert to a long table */
        proc transpose data=pct_missing out=drop_list;
            var _NUMERIC_;
        run;
    
        /* Get a list of variables to drop that are >= the drop threshold */
        proc sql noprint;
            select _NAME_
            into :drop_vars separated by ' '
            from drop_list
            where COL1 GE &threshold.
            ;
        quit;
    %mend;
    

    It has three parameters:

    lib: Library of your dataset

    dsn: Dataset name without the library

    threshold: Proportion of missing values a variable must meet or exceed to be dropped

    For example, let's generate some sample data and use this. col1 col2 col3 all have 80% missing values.

    data have;
        array col[10];
        do i = 1 to 10;
            do j = 1 to 10;
                col[j] = i;
                if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
            end;
            
            output;
        end;
    
        drop i j;
    run; 
    

    We'll run the macro and check the log:

    %get_missing_vars(lib=work, dsn=have, threshold=0.8);
    
    %put &drop_vars;
    

    The log shows:

    col1 col2 col3
    

    Now we can pass this into a simple data step.

    data want;
        set have;
        drop &drop_vars;
    run;