I have a dataset with two name variables. I want to compare both names, but excluding the last word from them before the comparison. This is because sometimes married people change their last names.
So, my data:
data have;
infile datalines delimiter=',';
length oldname $ 50 newname $ 50;
input oldname $ newname $;
datalines;
Maria Jose Silva, Maria José Silva Sampaoli
Juana Alencastro Albuquerque, Juana Alencastro Albuquerque
Julia Souza Maciel, Julia Souza Maciel Augusto
;
run;
Some letters might also differ between both names, for example "Jose" and "José", as the second word of the first record. So I want account for minor mistakes in the records and thougth of using compged
function to get a score of the difference. But I want to remove the last name before doing so.
I asked chatGPT for a solution and it gave me an answer combining scan
and countw
functions:
proc sql;
create table want as
select
*,
compged(catx(' ', scan(oldname, 1, countw(oldname)-1)), catx(' ', scan(newname, 1, countw(newname)-1))) as GED
from have;
However this didnt work, since scan function third argument need to me a modifier, and not a number.
I also checked for previous answers on StackOverflow, and found this one. But I couldnt find out how to fit this solution into my data. I am more familiar PROC SQL
framework, and I would appreciate a solution within it.
If your source string always use space as delimeter, you can use findc()
function to find the position of last word, then use substr()
to subset the source string:
proc sql noprint;
create table want as
select *,
substr(oldname,1,findc(oldname,'','stb')) as oldname_sub,
substr(newname,1,findc(newname,'','stb')) as newname_sub
from have;
quit;