sasproc-sqlregex-replace

How to remove strings and punctuation from a columns using SAS or PROC SQL?


I have a column with a lot of company names without any cleaning as below. But I want to do the cleaning to remove a list of strings from company names so I can group them together.

list of strings: com,llc,ltd,corp,stores

| Company |
| --- |
| Amazon |
| Amazon.com |
| Amazon.llc |
| Amazon.ltd |
| Amazon corp |
| Amazon stores |

I also want to remove punctuation and strip spaces from the company names, so the desired outcome can be like:

| Company |
| --- |
| Amazon |
| Amazon |
| Amazon |
| Amazon |
| Amazon |
| Amazon |

I have done my project using Python but I need to convert it to SAS or PROC SQL, but I'm not familiar with these 2 languages, so appreciate any help.


Solution

  • A regular expression should make this easy:

    ([\. ])(com|ltd|llc|corp|stores)

    https://regex101.com/r/FhnEwN/1

    In SAS, finding and replacing strings with regex is done with the prxchange function, which works like this:

    string = prxchange('s/regex here/replacement string/', times, string);

    Where times is the number of replacements you want to make. -1 means to replace every match. If you leave your replacement string as a null value, like this: 's/regex here//' then you will effectively remove the matched value.

    Converting it to SAS:

    data want;
        set have;
        company = prxchange('s/([\. ])(com|ltd|llc|corp|stores)//', -1, company);
    run;
    
    Company
    Amazon
    Amazon
    Amazon
    Amazon
    Amazon
    Amazon