rcombinationspanel

Estimate combinations in panel data in r?


I am trying to maximize the number of data points in a cross sectional panel data. The structure of the matrix that I have is the following with years on the y axis and countries on the x axis:

        A     B    C     D 
2000   NA    50    NA    85
2001   110   75    76    86
2002   120   NA    78    87
2003   130   100   80    88

Therefore I am trying to find all of the possible combinations of yearly data points to get the most countries per combination. Using the example above I am trying to produce vectors, lists or other sort of object which resembles something like this:

2000, 2001, 2002, 2003 = D
2000, 2001, 2003 = D and B
2001, 2002, 2003 = D, A and C
2000, 2001 = D and B
2001, 2002 = D, A and C
2002, 2003 = D, A and C
2000 = D and B
2001 = A, B, C and D
2002 = A, C and D
2003 = A, B, C and D

It is sort of an abstract thing to do and I cannot wrap my head around it. I would appreciate any help.


Solution

  • Update

    Here is a solution that is a good starting place, but could probably be improved:

    library(RcppAlgos)
    getCombs <- function(myMat, upper = NULL, minYears = NULL) {
    
        numRows <- nrow(myMat)
        myColNames <- colnames(myMat)
    
        if (is.null(minYears))  ## set default
            repZero <- numRows - 1
        else if (minYears >= numRows || minYears < 1)  ## check for extreme cases
            repZero <- numRows - 1
        else
            repZero <- numRows - minYears
    
        combs <- comboGeneral(v = 0:numRows, m = numRows,
                              freqs = c(repZero, rep(1, numRows)),
                              upper = upper)
    
        ## I think this part could be improved
        out <- lapply(1:nrow(combs), \(x) {
            myRows <- myMat[combs[x,],]
    
            if (is.null(nrow(myRows)))
                result <- !is.na(myRows)
            else
                result <- complete.cases(t(myRows))
    
            myColNames[result]
        })
    
        myRowNames <- rownames(myMat)
        names(out) <- lapply(1:nrow(combs), \(x) {
            myRowNames[combs[x, combs[x, ] > 0]]
        })
        out
    }
    

    Here is the output for the OP’s example. (The OP is missing 5 of the outcomes below):

    testMat <- matrix(
        c(NA, 50, NA, 85,
          110, 75, 76, 86,
          120, NA, 78, 87,
          130, 100, 80, 88),
        nrow = 4, byrow = TRUE
    )
    
    row.names(testMat) <- 2000:2003
    colnames(testMat) <- LETTERS[1:4]
    getCombs(testMat)
    #> $`2000`
    #> [1] "B" "D"
    #> 
    #> $`2001`
    #> [1] "A" "B" "C" "D"
    #> 
    #> $`2002`
    #> [1] "A" "C" "D"
    #> 
    #> $`2003`
    #> [1] "A" "B" "C" "D"
    #> 
    #> $`c("2000", "2001")`
    #> [1] "B" "D"
    #> 
    #> $`c("2000", "2002")`
    #> [1] "D"
    #> 
    #> $`c("2000", "2003")`
    #> [1] "B" "D"
    #> 
    #> $`c("2001", "2002")`
    #> [1] "A" "C" "D"
    #> 
    #> $`c("2001", "2003")`
    #> [1] "A" "B" "C" "D"
    #> 
    #> $`c("2002", "2003")`
    #> [1] "A" "C" "D"
    #> 
    #> $`c("2000", "2001", "2002")`
    #> [1] "D"
    #> 
    #> $`c("2000", "2001", "2003")`
    #> [1] "B" "D"
    #> 
    #> $`c("2000", "2002", "2003")`
    #> [1] "D"
    #> 
    #> $`c("2001", "2002", "2003")`
    #> [1] "A" "C" "D"
    #> 
    #> $`c("2000", "2001", "2002", "2003")`
    #> [1] "D"
    

    However, this answer, or any future answer for that matter, won’t get you every combination as you have 144 countries and 47 years of data. That produces a very very large number. Every combination of any length up to n is equivalent to the power set. The number of elements in the power set is simply 2^n. Since we are not counting the equivalent of the empty set, we need to subtract one, thus:

    suppressWarnings(suppressMessages(library(gmp)))
    sub.bigz(pow.bigz(2, 47), 1)
    #> Big Integer ('bigz') :
    #> [1] 140737488355327
    

    Yes, that is over one hundred trillion!!! You will probably need to rethink your approach as there are simply too many outcomes.

    All is not lost! You can make use of the upper argument, to limit the number of outcomes, so you can still investigate possible combinations. Observe:

    set.seed(11111)
    biggerTest <- matrix(sample(100, 20*20, replace = TRUE), nrow = 20)
    
    colnames(biggerTest) <- LETTERS[1:20]
    rownames(biggerTest) <- 1988:2007
    
    ## set 10% of values to NA
    myNAs <- sample(400, 400 / 10)
    biggerTest[myNAs] <- NA
    
    biggerTest[1:6, 1:10]
    #>       A  B  C  D  E  F  G  H  I  J
    #> 1988 95 12 29 19 47 38 48 75 68 80
    #> 1989 98 NA 87 77 99 30 92 30 66 27
    #> 1990 33 49 17  2 77 NA 80 67  6 93
    #> 1991 70 35 49 45 52 NA 46 13 27 86
    #> 1992 42 37 28 95 34 13 53 NA  6 98
    #> 1993 23 14 10 96 61  6 41  8 93 NA
    
    
    ## Getting all 1,048,575 results takes a good bit of time
    system.time(allResults <- getCombs(biggerTest))
    #>    user  system elapsed 
    #>  20.488   0.330  20.852
    
    
    ## Using upper greatly reduces the amount of time
    system.time(smallSampTest <- getCombs(biggerTest, upper = 10000))
    #>    user  system elapsed 
    #>   0.083   0.001   0.084
    

    Alternatively, you can use the minYears argument to only return results with a minimum number of combinations of years. For example, per the OP’s comments to @CPak’s answer, if you only want to see results with 15 or more years of combinations, we have:

    system.time(minYearTest <- getCombs(biggerTest, minYears = 15))
    #>    user  system elapsed 
    #>   0.414   0.003   0.417
    
    
    set.seed(123)
    minYearTest[sample(length(minYearTest), 5)]
    #> $`c("1988", "1990", "1991", "1992", "1993", "1994", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
    #> [1] "E" "L" "R"
    #> 
    #> $`c("1988", "1990", "1991", "1992", "1993", "1995", "1996", "1997", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
    #> [1] "E" "L" "R"
    #> 
    #> $`c("1988", "1989", "1990", "1991", "1992", "1996", "1998", "1999", "2000", "2001", "2002", "2003", "2005", "2006", "2007")`
    #> [1] "E" "L" "P" "R"
    #> 
    #> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1996", "1997", "1998", "2000", "2001", "2002", "2003", "2005", "2006")`
    #> [1] "E" "L" "R"
    #> 
    #> $`c("1988", "1989", "1990", "1991", "1993", "1994", "1995", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2007")`
    #> [1] "C" "D" "E" "I" "L" "R"
    

    Or use both arguments together:

    system.time(
        bothConstraintsTest <- getCombs(biggerTest, 10000, minYears = 10)
    )
    #>    user  system elapsed 
    #>   0.137   0.002   0.138
    
    
    bothConstraintsTest[1:5]
    #> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997")`
    #> [1] "A" "C" "D" "E" "I" "L" "R" "S"
    #> 
    #> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1998")`
    #> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
    #> 
    #> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1999")`
    #> [1] "C" "D" "E" "I" "L" "P" "R"
    #> 
    #> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2000")`
    #> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
    #> 
    #> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2001")`
    #> [1] "A" "C" "D" "E" "I" "L" "P" "R"
    

    Explanation

    The first thing we need to do is to determine every combination of n years. This boils down to finding all n-tuples of the multiset c(rep(0, n-1), 1:n) or equivalently, the power set of an n element set minus the empty set. For example, for the years 2000:2003 (4 year span), the possible combinations are given by:

    comboGeneral(v = 0:4, m = 4, freqs = c(3, rep(1, 4)))
    #>       [,1] [,2] [,3] [,4]
    #>  [1,]    0    0    0    1
    #>  [2,]    0    0    0    2
    #>  [3,]    0    0    0    3
    #>  [4,]    0    0    0    4
    #>  [5,]    0    0    1    2
    #>  [6,]    0    0    1    3
    #>  [7,]    0    0    1    4
    #>  [8,]    0    0    2    3
    #>  [9,]    0    0    2    4
    #> [10,]    0    0    3    4
    #> [11,]    0    1    2    3
    #> [12,]    0    1    2    4
    #> [13,]    0    1    3    4
    #> [14,]    0    2    3    4
    #> [15,]    1    2    3    4
    

    Now, we iterate over each row of our combinations where each row tells us which combination of rows from the original matrix to test for NAs. If the particular combination only contains one result, we determine which indices are not NA. That is easily carried out by !is.na(. If we have more than one row, we employ complete.cases(t to obtain the columns that have only numbers (i.e. no occurrences of NA).

    After this we are just using indexing to obtain names for our outcomes and Voila, we have our desired results.