stringsasfindsoundex

How to flag if a word sounds like any word within a character field, using SOUNDEX?


I have a character variable that's long (up to 12,000 characters), and I would like to find a string within the variable that sounds like a certain word.
I'd also like to create a variable that equals one if the string is in the variable. Let's say, for argument's sake, the word that I'm trying to find is "Mary" (not case-sensitive). Here are four sample strings in a variable called "string" in a dataset called "question":

The flag variable should = 1 for strings 1 and 3 (because Mary and marry).

Unfortunately, I don't think I can use this code:

DATA answer;
   SET question;
   IF FINDW(string, SOUNDEX("Mary")) ne 0 THEN flag=1;
     ELSE flag=0;
RUN;

It doesn't work because SAS is trying to find the soundex code for "Mary" in the string (not the actual string "Mary"). Any thoughts on how to get around this?


Solution

  • Here's one way. Loops through each word and computes the soundex for that word. If the soundex matches, it breaks out of the loop, for efficiency.

    data test_set;
        infile datalines dsd;
        length string $100;
        input string;
        datalines;
    Mary had a little lamb its fleece was white as snow
    Jack be nimble Jack be quick Jack jump over the candlestick
    I think you and I should marry each other
    I actually do not want to get married
    ;
    run;
    
    data test_set1(keep=string flag);
        set test_set;
    
        length i_word $100;
    
        flag = 0;
    
        mary_soundex = soundex("mary");
    
        word_count = countw(string);
    
        i = 1;
    
        do while (i le word_count and flag ne 1);
            i_word = scan(string, i);
            i_word_soundex = soundex(i_word);
            if mary_soundex eq i_word_soundex then flag = 1;
            i = i + 1;
        end;
    run;
    

    More on breaking sentences into words: http://blogs.sas.com/content/iml/2016/07/11/break-sentence-into-words-sas.html