I am trying to generate a new variable that is equal to the share of winners by state for each year in Stata.
I am using the egen
command and I would like to know if this is the appropriate command for what I am looking for. My dataset is extremely large so it is hard for me to check manually. I have created year dummies for each year and the award_winner is a binary variable where 1 is equal to businesses that won the award and 0 if the business did not win the award that year.
sort state year_dummy*
by state year_dummy*: egen winner_bystate_year = mean(award_winner)
This is easy enough to test with a small fake dataset in which correct answers are clear. I don't know why you introduced dummy variables when you could work directly with year
, but the answer's the same.
clear
set obs 12
gen state = cond(_n < 7, "A", "B")
egen year = seq(), from(2019) to(2020) block(3)
gen award_winner = real(word("0 0 0 0 0 1 0 1 1 1 1 1", _n))
gen order = _n
tab year, gen(year)
bysort state year?: egen suggested = mean(award_winner)
bysort state year: egen better = mean(award_winner)
sort order
list, sepby(state year)
+-----------------------------------------------------------------------+
| state year award_~r order year1 year2 sugges~d better |
|-----------------------------------------------------------------------|
1. | A 2019 0 1 1 0 0 0 |
2. | A 2019 0 2 1 0 0 0 |
3. | A 2019 0 3 1 0 0 0 |
|-----------------------------------------------------------------------|
4. | A 2020 0 4 0 1 .3333333 .3333333 |
5. | A 2020 0 5 0 1 .3333333 .3333333 |
6. | A 2020 1 6 0 1 .3333333 .3333333 |
|-----------------------------------------------------------------------|
7. | B 2019 0 7 1 0 .6666667 .6666667 |
8. | B 2019 1 8 1 0 .6666667 .6666667 |
9. | B 2019 1 9 1 0 .6666667 .6666667 |
|-----------------------------------------------------------------------|
10. | B 2020 1 10 0 1 1 1 |
11. | B 2020 1 11 0 1 1 1 |
12. | B 2020 1 12 0 1 1 1 |
+-----------------------------------------------------------------------+
The general principle is simple and important: to test code for statistical software, use a simple dataset for which there are known or obvious answers. Here "known" could be answers given by an existing implementation in the same or other software that is presumed correct.