rdplyrgroup-bylagplm

lag/lead N time periods over a grouped variables


I'm grouping some variables and lagging (4 time periods) but I get NA values. I've tried using dplyr::lag and plm::lag

test = structure(list(cnty90 = c(1005, 1005, 1005, 1005, 1005, 1005, 
                                 1005, 1005, 1005, 1005, 1005, 1005, 1005, 1005, 1005, 1005, 1015, 
                                 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 
                                 1015, 1015, 1015, 1015, 1031, 1031, 1031, 1031, 1031, 1031, 1031, 
                                 1031, 1031, 1031, 1031, 1031, 1031, 1031, 1031, 1031, 1039, 1039), 
                      year = c(1868, 1872, 1876, 1880, 1884, 1888, 1892, 1896, 1900,
                               1904, 1908, 1912, 1916, 1920, 1924, 1928, 1868, 1872, 1876, 1880,
                               1884, 1888, 1892, 1896, 1900, 1904, 1908, 1912, 1916, 1920, 1924,
                               1928, 1868, 1872, 1876, 1880, 1884, 1888, 1892, 1896, 1900, 1904,
                               1908, 1912, 1916, 1920, 1924, 1928, 1868, 1872), 
                      numdailies = c(0, 0, 0, 0, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 
                                     0, 2,2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
                                     0, 0, 0, 0, 1, 0, 0, 0, 0, 0)), row.names = c(NA, 50L), class = "data.frame")


using dplyr

test %>% 
    group_by(cnty90, year) %>% 
    mutate(numdailies_l1 = lag(numdailies, n = 1),
           changedailies_for=lead(numdailies, n=1))

# A tibble: 50 x 6
# Groups:   cnty90, year [50]
   cnty90  year numdailies numdailies_l1 changedailies changedailies_for
    <dbl> <dbl>      <dbl>         <dbl>         <dbl>             <dbl>
 1   1005  1868          0            NA            NA                NA
 2   1005  1872          0            NA            NA                NA
 3   1005  1876          0            NA            NA                NA
 4   1005  1880          0            NA            NA                NA
 5   1005  1884          2            NA            NA                NA
 6   1005  1888          2            NA            NA                NA
 7   1005  1892          2            NA            NA                NA
 8   1005  1896          1            NA            NA                NA
 9   1005  1900          1            NA            NA                NA
10   1005  1904          1            NA            NA                NA
11   1005  1908          1            NA            NA                NA
12   1005  1912          1            NA            NA                NA
13   1005  1916          1            NA            NA                NA
14   1005  1920          1            NA            NA                NA
15   1005  1924          1            NA            NA                NA
16   1005  1928          1            NA            NA                NA
17   1015  1868          0            NA            NA                NA
18   1015  1872          0            NA            NA                NA
19   1015  1876          0            NA            NA                NA
20   1015  1880          0            NA            NA                NA


using data.table

test = data.table::data.table(test)
test[, ":=" (numdailies_l1 = lag(numdailies, n = 1),
             changedailies=numdailies-numdailies_l1,
             changedailies_for=lead(numdailies, n=1)), by = .(cnty90, year)]

What I need to get

# A tibble: 20 x 6
   cnty90  year numdailies numdailies_l1 changedailies changedailies_for
    <int> <int>      <int>         <int>         <int>             <int>
 1   1005  1868          0            NA            NA                 0
 2   1005  1872          0             0             0                 0
 3   1005  1876          0             0             0                 0
 4   1005  1880          0             0             0                 2
 5   1005  1884          2             0             2                 0
 6   1005  1888          2             2             0                 0
 7   1005  1892          2             2             0                -1
 8   1005  1896          1             2            -1                 0
 9   1005  1900          1             1             0                 0
10   1005  1904          1             1             0                 0
11   1005  1908          1             1             0                 0
12   1005  1912          1             1             0                 0
13   1005  1916          1             1             0                 0
14   1005  1920          1             1             0                 0
15   1005  1924          1             1             0                 0
16   1005  1928          1             1             0                NA
17   1015  1868          0            NA            NA                 0
18   1015  1872          0             0             0                 0
19   1015  1876          0             0             0                 0
20   1015  1880          0             0             0                 0

Solution

  • Try not to group by year. test %>% group_by(cnty90) %>%...

    When you are grouping, you are actually splitting the original data frame into a series of smaller data frames, with each "new" piece holding the grouping variables equal.

    In your provided sample:

      cnty90  year numdailies numdailies_l1 changedailies changedailies_for
        <dbl> <dbl>      <dbl>         <dbl>         <dbl>             <dbl>
     1   1005  1868          0            NA            NA                NA
     2   1005  1872          0            NA            NA                NA
     3   1005  1876          0            NA            NA                NA
     4   1005  1880          0            NA            NA                NA
     5   1005  1884          2            NA            NA                NA 
    

    For every combination of "Cntry90" and "year" there is only 1 row each, thus there are no other rows to lag, thus creating the NA.

    By only grouping by cnty90 variable, there are now 16 rows for cnty90 =1005 thus the ability to perform the lag function. The same will also apply when cnty90 =1015