In a database containing firm and patent class values, I want to calculate the following variables:
Technological abandonment: Number of previously active technological patent classes abandoned annually.
Specifically, I want to create variables that calculate the number of patent classes (variable = class) that the firm has used in the past 3 years (t-3, t-2, and t-1) (min observation of one year prior is acceptable if the firm history initially doesn't have 3 years) but are missing in this year (t) I would like to do the same with a 5 year window as well.
I have a dataset containing millions of rows, so a fast data.table solution is much preferred.
In the following dataset:
df <- data.table(year=c(1979,1979,1980,1980,1981,1981,1982,1983,1983,1984,1984),
category = c("A","A","B","C","A","D","F","F","C","A","B"))
The desired outcome would be (for a three year window):
year class tech_aband_3
1: 1979 A 0
2: 1979 A 0
3: 1980 B 1
4: 1980 C 1
5: 1981 A 2
6: 1981 D 2
7: 1982 F 4
8: 1983 F 3
9: 1983 C 3
10: 1984 A 3
11: 1984 B 3
Many thanks in advance.
Assuming that all years are represented in the data (if not, you'd need to fill missing years for the following to work), you can try:
library(data.table)
df[, .(category = list(unique(category))), by = year
][, tech_aband_3 := lengths(mapply(\(x, y) setdiff(unlist(x), y),
transpose(shift(list(category), 1:3, fill = first(category[[1]]))),
category))
][, .(category = unlist(category)), by = .(year, tech_aband_3)
][ df, on = .(year, category)
]
year tech_aband_3 category
<num> <int> <char>
1: 1979 0 A
2: 1979 0 A
3: 1980 1 B
4: 1980 1 C
5: 1981 2 A
6: 1981 2 D
7: 1982 4 F
8: 1983 3 F
9: 1983 3 C
10: 1984 3 A
11: 1984 3 B