I am trying to create dummy variables in Stata that are 1 if any of the variables dx1
through dx25
start with a specific string. I know that I can do this using something like the following but for all 25 dx
variables:
gen dummy=0
replace dummy=1 if substr(dx1,1,4)=="6542" | substr(dx2,1,4)=="6542"
I would then create other dummies equal to 1 if any of the dx
s start with these:
6542 6522 6696 6410 6411 6412 6630 218 6426 459 490 491 492 493 494 495 496 9971 250 2810 28249 05410 054 657 V27.2 V27.3 V27.4 V27.5 V27.6 V27.7
I have been trying to figure out a more efficient and elegant way to do this.
Data Structure example (I will keep it to dx1
through dx5
here for space reasons):
+---------------------------------------+
| dx1 dx2 dx3 dx4 dx5 |
|---------------------------------------|
1. | 65421 V270 |
2. | 65221 65801 64232 65951 64892 |
3. | 64511 V270 |
4. | 64781 V270 |
5. | 65571 66331 64891 340 V270 |
|---------------------------------------|
6. | 66401 67202 66331 V270 |
7. | 66411 V270 V1321 |
8. | 65571 V270 V5864 |
9. | 65421 V270 V252 |
10. | 64511 64231 66331 66401 V270 |
|---------------------------------------|
11. | 65651 66401 V270 |
12. | 650 V270 |
13. | 64881 66541 66331 V270 V161 |
14. | 66311 65971 V270 |
15. | 64781 V270 V1589 |
|---------------------------------------|
16. | 65571 66191 V270 |
17. | 64241 66401 V270 |
18. | 66031 65971 66071 V270 |
19. | 64841 66401 30520 V270 |
+---------------------------------------+
I first try to make things work. After that, if it's too inefficient for my needs (and sometimes if aesthetically unpleasant), I try to work things out in a different way. Following your line of thought, why not try loops:
clear all
set more off
*----- Example data -----
input ///
str10(dx1 dx2 dx3 dx4 dx5)
65421 V270
65221 65801 64232 65951 64892
64511 V270
64781 V270
65571 66331 64891 340 V270
66401 67202 66331 V270
66411 V270 V1321
65571 V270 V5864
65421 V270 V252
64511 64231 66331 66401 V270
65651 66401 V270
650 V270
64881 66541 66331 V270 V161
66311 65971 V270
64781 V270 V1589
65571 66191 V270
64241 66401 V270
66031 65971 66071 V270
64841 66401 30520 V270
end
list in 1/15
*----- what you want -----
local li "6542 6522 6696 6410 6411 6412 6630 218 6426 459 490 491 492 493 494 495 496 9971 250 2810 28249 05410 054 657 V27.2 V27.3 V27.4 V27.5 V27.6 V27.7"
quietly foreach val of local li {
local tname = strtoname("ind`val'")
gen byte `tname' = 0
foreach var of varlist dx* {
replace `tname' = 1 if substr(`var',1,4) == "`val'"
}
}
browse
I'm using the strings of interest to name the indicator variables (you call them dummy). Because some strings would make illegal Stata names, I use the strtoname()
function. This naming convention is not mandatory of course.
There's more evaluation going on than actually needed but it might suffice as it is. For each element of the local li
, no more evaluation needs to be done after the first replace
is executed. But the code checks for all dx
variables.
Maybe there's a better way to achieve your end result, but you don't say what that is. This seems to be only some intermediate step.
Run help <command_or_function>
for details on specific syntax.
(Note that in your original post
list dx1 dx2 dx3 dx4 dx5 in 1/20
is more efficient than
list dx1 dx2 dx3 dx4 dx5 if _n<20
because Stata does not need to check the if
condition is met for every observation in the database. It simply lists the first 20 observations.)