[SOLVED] Create sub-datasets based on pattern matching on the variable names

Create sub-datasets based on pattern matching on the variable names

I have the following dataset:

data have;
  input ID :$20. M1r2c1: $20. M2r2c1: $20. M11rxc1: $20. M11rxc2: $20.; 
cards;
00x1 Frequent     Rank 1    10-500    Frequent
0sx2 Infreqent    Rank 5    10-500    Infrequent
0tb3 Often                  10-30     Frequent
er04 Not Frequent Rank 3              Frequent
;

Is there a way to create a list of datasets (subsets of the dataset "have") based on the first pattern i.e., M*? In other words:

dataset1 will contain variable ID and M1*
dataset2 will contain variable ID and M2*
dataset3 will contain variable ID and M11* (i.e., M11rxc1 and M11rxc2 but not M1r2c1)
........

The dataset "have" contains 300 columns but M* is from 1 to 27.

Solution

It would be much easier if you renamed the variables so there was a delimiter after the numeric suffix.

Obs    _NAME_     newname

 1     ID         ID
 2     M1r2c1     M1_r2c1
 3     M2r2c1     M2_r2c1
 4     M11rxc1    M11_rxc1
 5     M11rxc2    M11_rxc2

Then you could use the : wildcard in your keep statement.

keep ID M1_: ;

Here is one way to convert strings like M1r2c1 into M1_r2c1.

prxchange('s/^(M\d+)/$1_/',1,name)

Let's first fix your data step so it can actually run.

data have;
  length ID M1r2c1 M2r2c1 M11rxc1 M11rxc2 $20; 
  input ID 1-4 M1r2c1 6-17 M2r2c1 19-25 M11rxc1 29-35 M11rxc2 39-50; 
cards;
00x1 Frequent     Rank 1    10-500    Frequent
0sx2 Infreqent    Rank 5    10-500    Infrequent
0tb3 Often                  10-30     Frequent
er04 Not Frequent Rank 3              Frequent
;

Now lets use an SQL query to make a macro variable with oldname=newname pairs that we can use to generate a RENAME statement or the RENAME= dataset option.

proc sql noprint;
select catx('='
        ,nliteral(name)
        ,nliteral(prxchange('s/^(M\d+)/$1_/',1,name))
        )
  into :renames separated by ' '
  from dictionary.columns
  where libname='WORK'
    and memname='HAVE'
;
quit;

Now we can run a data step to make all three of your output datasets.

data want1(keep=id m1_: ) want2(keep=id m2_:) want11(keep=id m11_:) ;
  set have(rename=(&renames));
run;

Results:

If you don't want to bother trying to learn how to use regular expressions you could use some other simpler SAS functions. Like VERIFY() and SUBSTR() to find the prefix for your variable names.

proc sql ;
create table groups as
  select name
       , substrn(name,1,verify(upcase(name),'M0123456789')-1) as prefix length=32
       , input(substr(calculated prefix,2),32.) as group
  from dictionary.columns
  where libname='WORK'
    and memname='HAVE'
    and upcase(name) like 'M%'
  order by group,name
;
quit;

Result

Obs     name      prefix    group

 1     M1r2c1      M1          1
 2     M2r2c1      M2          2
 3     M11rxc1     M11        11
 4     M11rxc2     M11        11