I have a column with a lot of company names without any cleaning as below. But I want to do the cleaning to remove a list of strings from company names so I can group them together.
list of strings: com,llc,ltd,corp,stores
| Company |
| --- |
| Amazon |
| Amazon.com |
| Amazon.llc |
| Amazon.ltd |
| Amazon corp |
| Amazon stores |
I also want to remove punctuation and strip spaces from the company names, so the desired outcome can be like:
| Company |
| --- |
| Amazon |
| Amazon |
| Amazon |
| Amazon |
| Amazon |
| Amazon |
I have done my project using Python but I need to convert it to SAS or PROC SQL, but I'm not familiar with these 2 languages, so appreciate any help.
A regular expression should make this easy:
([\. ])(com|ltd|llc|corp|stores)
https://regex101.com/r/FhnEwN/1
In SAS, finding and replacing strings with regex is done with the prxchange
function, which works like this:
string = prxchange('s/regex here/replacement string/', times, string);
Where times
is the number of replacements you want to make. -1 means to replace every match. If you leave your replacement string as a null value, like this: 's/regex here//'
then you will effectively remove the matched value.
Converting it to SAS:
data want;
set have;
company = prxchange('s/([\. ])(com|ltd|llc|corp|stores)//', -1, company);
run;
Company
Amazon
Amazon
Amazon
Amazon
Amazon
Amazon