I have a dataset with a text field that contains long free flowing text values , I need to identify and extract all the 16 digit account number from that text field and create a column from those extracted account number
Data that I have
input acct_num txt_field ;
DATALINES;
3435436 Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024
7789976 Data declined and assigned to 7890512323454545
Data that I need
acct_num txt_field acct1 acct2
3435436 Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024 3456123789065322 0909876789432123
7789976 Data declined and assigned to 7890512323454545 7890512323454545
As of now I have used Prxparse and prxmatch functions but those work when you know what exactly to look for in the text field, here I am just looking for any 16 digit values
You're on the right track with regex. Use call prxnext()
to iterate through all instances of 16-digit account numbers. The regex \b\d{16}\b
will find these.
data want;
set have;
length acct_num_16 $200.;
retain exprid;
/* Generate an expression ID */
if(_N_ = 1) then exprid = prxparse('/\b\d{16}\b/');
/* Scan all of the text */
start = 1;
stop = length(txt_field);
/* Find the first value */
call prxnext(exprid, start, stop, txt_field, pos, len);
/* Keep scanning until there are no more account numbers found and
append it to a comma-separated list */
do while (pos > 0);
acct_num_16 = catx(',', acct_num_16, substr(txt_field, pos, len) );
call prxnext(exprid, start, stop, txt_field, pos, len);
end;
keep acct_num txt_field acct_num_16;
run;
acct_num txt_field acct_num_16
3435436 Payment issue reported ... 3456123789065322,0909876789432123
7789976 Data declined and ... 7890512323454545