stringsubstringstatadummy-data

Stata - Generate a dummy variable if any string variable of a list begins with specific characters


I am trying to create dummy variables in Stata that are 1 if any of the variables dx1 through dx25 start with a specific string. I know that I can do this using something like the following but for all 25 dx variables:

gen dummy=0
replace dummy=1 if substr(dx1,1,4)=="6542" | substr(dx2,1,4)=="6542" 

I would then create other dummies equal to 1 if any of the dxs start with these:

   6542     6522    6696    6410    6411    6412    6630    218     6426    459 490 491 492 493 494 495 496 9971    250     2810    28249   05410   054     657 V27.2   V27.3   V27.4   V27.5   V27.6   V27.7

I have been trying to figure out a more efficient and elegant way to do this.

Data Structure example (I will keep it to dx1 through dx5 here for space reasons):

     +---------------------------------------+
     |   dx1     dx2     dx3     dx4     dx5 |
     |---------------------------------------|
  1. | 65421    V270                         |
  2. | 65221   65801   64232   65951   64892 |
  3. | 64511    V270                         |
  4. | 64781    V270                         |
  5. | 65571   66331   64891     340    V270 |
     |---------------------------------------|
  6. | 66401   67202   66331    V270         |
  7. | 66411    V270   V1321                 |
  8. | 65571    V270   V5864                 |
  9. | 65421    V270    V252                 |
 10. | 64511   64231   66331   66401    V270 |
     |---------------------------------------|
 11. | 65651   66401    V270                 |
 12. |   650    V270                         |
 13. | 64881   66541   66331    V270    V161 |
 14. | 66311   65971    V270                 |
 15. | 64781    V270   V1589                 |
     |---------------------------------------|
 16. | 65571   66191    V270                 |
 17. | 64241   66401    V270                 |
 18. | 66031   65971   66071    V270         |
 19. | 64841   66401   30520    V270         |
     +---------------------------------------+

Solution

  • I first try to make things work. After that, if it's too inefficient for my needs (and sometimes if aesthetically unpleasant), I try to work things out in a different way. Following your line of thought, why not try loops:

    clear all
    set more off
    
    *----- Example data -----
    
    input ///
    str10(dx1     dx2     dx3     dx4     dx5)
        65421    V270                         
        65221   65801   64232   65951   64892 
        64511    V270                         
        64781    V270                         
        65571   66331   64891     340    V270 
        66401   67202   66331    V270         
        66411    V270   V1321                 
        65571    V270   V5864                 
        65421    V270    V252                 
       64511   64231   66331   66401    V270 
       65651   66401    V270                 
         650    V270                         
       64881   66541   66331    V270    V161 
       66311   65971    V270                 
       64781    V270   V1589                
       65571   66191    V270                 
       64241   66401    V270                 
       66031   65971   66071    V270         
       64841   66401   30520    V270         
    end
    
    list in 1/15
    
    *----- what you want -----
    
    local li "6542 6522 6696 6410 6411 6412 6630 218 6426 459 490 491 492 493 494 495 496 9971 250 2810 28249 05410 054 657 V27.2 V27.3 V27.4 V27.5 V27.6 V27.7"
    
    quietly foreach val of local li {
    
        local tname = strtoname("ind`val'")
        gen byte `tname' = 0     
    
        foreach var of varlist dx* {
            replace `tname' = 1 if substr(`var',1,4) == "`val'"
        }
    
    }
    
    browse
    

    I'm using the strings of interest to name the indicator variables (you call them dummy). Because some strings would make illegal Stata names, I use the strtoname() function. This naming convention is not mandatory of course.

    There's more evaluation going on than actually needed but it might suffice as it is. For each element of the local li, no more evaluation needs to be done after the first replace is executed. But the code checks for all dx variables.

    Maybe there's a better way to achieve your end result, but you don't say what that is. This seems to be only some intermediate step.

    Run help <command_or_function> for details on specific syntax.

    (Note that in your original post

    list dx1 dx2 dx3 dx4 dx5 in 1/20
    

    is more efficient than

    list dx1 dx2 dx3 dx4 dx5 if _n<20
    

    because Stata does not need to check the if condition is met for every observation in the database. It simply lists the first 20 observations.)