I have a character variable that's long (up to 12,000 characters), and I would like to find a string within the variable that sounds like a certain word.
I'd also like to create a variable that equals one if the string is in the variable. Let's say, for argument's sake, the word that I'm trying to find is "Mary" (not case-sensitive). Here are four sample strings in a variable called "string" in a dataset called "question":
The flag variable should = 1 for strings 1 and 3 (because Mary and marry).
Unfortunately, I don't think I can use this code:
DATA answer;
SET question;
IF FINDW(string, SOUNDEX("Mary")) ne 0 THEN flag=1;
ELSE flag=0;
RUN;
It doesn't work because SAS is trying to find the soundex code for "Mary" in the string (not the actual string "Mary"). Any thoughts on how to get around this?
Here's one way. Loops through each word and computes the soundex for that word. If the soundex matches, it breaks out of the loop, for efficiency.
data test_set;
infile datalines dsd;
length string $100;
input string;
datalines;
Mary had a little lamb its fleece was white as snow
Jack be nimble Jack be quick Jack jump over the candlestick
I think you and I should marry each other
I actually do not want to get married
;
run;
data test_set1(keep=string flag);
set test_set;
length i_word $100;
flag = 0;
mary_soundex = soundex("mary");
word_count = countw(string);
i = 1;
do while (i le word_count and flag ne 1);
i_word = scan(string, i);
i_word_soundex = soundex(i_word);
if mary_soundex eq i_word_soundex then flag = 1;
i = i + 1;
end;
run;
More on breaking sentences into words: http://blogs.sas.com/content/iml/2016/07/11/break-sentence-into-words-sas.html