performanceloopsstatacalculated-columnsprocessing-efficiency

How to speed up row-specific operation based on values of other variables


Say I have this data:

sysuse auto2, clear
keep if _n<=4

describe
local N = r(N)

gen a1 = price
gen a2 = mpg
gen a3 = headroom 
gen a4 = trunk
gen a5 = weight
gen a6 = length

input yearA yearB 
1 4
1 5
2 5
1 6

keep a1-a6 yearA yearB

I'd like to do a row-specific operation based on the value of other variables. As an example, I'd like to add up all a columns corresponding to some row-specific rule, in this case starting a year after yearA and a year before yearB. So, if yearA==1 and yearB==5, the starting year is 2 and the end year is 4, so we would add a2, a3, and a4 together to get that row's total. Each row has its own rule corresponding to (a function of) its values of yearA and yearB.

I came up with the following solution, which works, but it is clunky and slow:

gen total = .
forvalues i = 1/`N' {
    local start = yearA[`i']+1
    local end = yearB[`i']-1
    display "`start' `end'"
    *annoyingly, you can't replace with egen, so create a new variable and delete it
    egen total`i' = rowtotal(a`start'-a`end')
    replace total = total`i' if _n==`i'
    drop total`i'
}

As noted in the comment in the loop, I resorted to creating a new variable for each row and deleting it after using its value. Why? Because it doesn't seem like one can use replace with egen.

The actual application creates multiple variables and there are millions of observations, so it takes many hours or even days to run. What is a faster way to accomplish my goal? I am in now tied to doing things row-by-row if there is a better way.


Solution

  • gen wanted = 0 
    
    forval j = 1/6 { 
        replace wanted = wanted + a`j' if inrange(`j', yearA + 1, yearB - 1)
    }