sqlsassas-macro

Extracting all the 16 digit account numbers from a text field and creating a field for that account number


I have a dataset with a text field that contains long free flowing text values , I need to identify and extract all the 16 digit account number from that text field and create a column from those extracted account number

Data that I have

    input acct_num   txt_field  ;
    DATALINES; 
    3435436     Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024  
    7789976     Data declined and assigned to 7890512323454545  

Data that I need

acct_num              txt_field                                                                       acct1               acct2
    3435436     Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024   3456123789065322    0909876789432123
    7789976     Data declined and assigned to 7890512323454545                                 7890512323454545  

As of now I have used Prxparse and prxmatch functions but those work when you know what exactly to look for in the text field, here I am just looking for any 16 digit values


Solution

  • You're on the right track with regex. Use call prxnext() to iterate through all instances of 16-digit account numbers. The regex \b\d{16}\b will find these.

    data want;
        set have;
        length acct_num_16 $200.;
        retain exprid;
       
        /* Generate an expression ID */
        if(_N_ = 1) then exprid = prxparse('/\b\d{16}\b/');
    
        /* Scan all of the text */
        start = 1;
        stop  = length(txt_field);
    
        /* Find the first value */
        call prxnext(exprid, start, stop, txt_field, pos, len);
    
        /* Keep scanning until there are no more account numbers found and 
           append it to a comma-separated list */
        do while (pos > 0);
            acct_num_16 = catx(',', acct_num_16, substr(txt_field, pos, len) );
            call prxnext(exprid, start, stop, txt_field, pos, len);
        end;
    
        keep acct_num txt_field acct_num_16;
    run;
    
    acct_num    txt_field                    acct_num_16
    3435436     Payment issue reported ...   3456123789065322,0909876789432123
    7789976     Data declined and ...        7890512323454545