stringparsingstata

How can I parse string variable by first instance of numeric character?


I am trying to parse out medication names from the dosage in a string variable. My end goal is to create two variables one being medication change and the other being dosage change. Here is a small example of my data:

frame create test
frame change test
input str109 med_name
"NITROFURANTOIN MACROCRYSTAL 100 MG CAPSULE"
"ACETAMINOPHEN 500 MG TABLET"
"APIXABAN 5 MG TABLET"
"ATOVAQUONE 500 MG/5 ML ORAL SUSPENSION""
"ATOVAQUONE 750 MG/5 ML ORAL SUSPENSION"
"ATOVAQUONE 750 MG/5 ML ORAL SUSPENSION"

I have tried to install and use strkeep package but it would split "ACETAMINOPHEN 500 MG TABLET" into "500" and "ACETAMINOPHENMGTABLET".


Solution

  • I used moss from SSC to find the first instance of a space followed by a number.

    clear 
    input str109 med_name
    "NITROFURANTOIN MACROCRYSTAL 100 MG CAPSULE"
    "ACETAMINOPHEN 500 MG TABLET"
    "APIXABAN 5 MG TABLET"
    "ATOVAQUONE 500 MG/5 ML ORAL SUSPENSION""
    "ATOVAQUONE 750 MG/5 ML ORAL SUSPENSION"
    "ATOVAQUONE 750 MG/5 ML ORAL SUSPENSION"
    end 
    
    moss med_name, match("( [0-9])") regex 
    
    gen wanted1 = substr(med_name, 1, _pos1 - 1)
    gen wanted2 = substr(med_name, _pos1, .)
    
    l wanted?, sep(0)
    
         +------------------------------------------------------------+
         |                     wanted1                        wanted2 |
         |------------------------------------------------------------|
      1. | NITROFURANTOIN MACROCRYSTAL                 100 MG CAPSULE |
      2. |               ACETAMINOPHEN                  500 MG TABLET |
      3. |                    APIXABAN                    5 MG TABLET |
      4. |                  ATOVAQUONE    500 MG/5 ML ORAL SUSPENSION |
      5. |                  ATOVAQUONE    750 MG/5 ML ORAL SUSPENSION |
      6. |                  ATOVAQUONE    750 MG/5 ML ORAL SUSPENSION |
         +------------------------------------------------------------+
    

    This could be frustrated by any drug name including numerals at the start of any word.