sasproc-sql

Comparing two strings without the last word


I have a dataset with two name variables. I want to compare both names, but excluding the last word from them before the comparison. This is because sometimes married people change their last names.

So, my data:

data have;
    infile datalines delimiter=',';
    length oldname $ 50 newname $ 50;
    input oldname $ newname $;
    datalines;
Maria Jose Silva, Maria José Silva Sampaoli
Juana Alencastro Albuquerque, Juana Alencastro Albuquerque
Julia Souza Maciel, Julia Souza Maciel Augusto
;
run;

Some letters might also differ between both names, for example "Jose" and "José", as the second word of the first record. So I want account for minor mistakes in the records and thougth of using compged function to get a score of the difference. But I want to remove the last name before doing so.

I asked chatGPT for a solution and it gave me an answer combining scan and countw functions:

proc sql;
create table want as
select
*,
compged(catx(' ', scan(oldname, 1, countw(oldname)-1)), catx(' ', scan(newname, 1, countw(newname)-1))) as GED
from have;

However this didnt work, since scan function third argument need to me a modifier, and not a number.

I also checked for previous answers on StackOverflow, and found this one. But I couldnt find out how to fit this solution into my data. I am more familiar PROC SQL framework, and I would appreciate a solution within it.


Solution

  • If your source string always use space as delimeter, you can use findc() function to find the position of last word, then use substr() to subset the source string:

    proc sql noprint;
      create table want as 
      select *, 
        substr(oldname,1,findc(oldname,'','stb')) as oldname_sub,
        substr(newname,1,findc(newname,'','stb')) as newname_sub
      from have;
    quit;