I am trying to maximize the number of data points in a cross sectional panel data. The structure of the matrix that I have is the following with years on the y axis and countries on the x axis:
A B C D
2000 NA 50 NA 85
2001 110 75 76 86
2002 120 NA 78 87
2003 130 100 80 88
Therefore I am trying to find all of the possible combinations of yearly data points to get the most countries per combination. Using the example above I am trying to produce vectors, lists or other sort of object which resembles something like this:
2000, 2001, 2002, 2003 = D
2000, 2001, 2003 = D and B
2001, 2002, 2003 = D, A and C
2000, 2001 = D and B
2001, 2002 = D, A and C
2002, 2003 = D, A and C
2000 = D and B
2001 = A, B, C and D
2002 = A, C and D
2003 = A, B, C and D
It is sort of an abstract thing to do and I cannot wrap my head around it. I would appreciate any help.
Here is a solution that is a good starting place, but could probably be improved:
library(RcppAlgos)
getCombs <- function(myMat, upper = NULL, minYears = NULL) {
numRows <- nrow(myMat)
myColNames <- colnames(myMat)
if (is.null(minYears)) ## set default
repZero <- numRows - 1
else if (minYears >= numRows || minYears < 1) ## check for extreme cases
repZero <- numRows - 1
else
repZero <- numRows - minYears
combs <- comboGeneral(v = 0:numRows, m = numRows,
freqs = c(repZero, rep(1, numRows)),
upper = upper)
## I think this part could be improved
out <- lapply(1:nrow(combs), \(x) {
myRows <- myMat[combs[x,],]
if (is.null(nrow(myRows)))
result <- !is.na(myRows)
else
result <- complete.cases(t(myRows))
myColNames[result]
})
myRowNames <- rownames(myMat)
names(out) <- lapply(1:nrow(combs), \(x) {
myRowNames[combs[x, combs[x, ] > 0]]
})
out
}
Here is the output for the OP’s example. (The OP is missing 5 of the outcomes below):
testMat <- matrix(
c(NA, 50, NA, 85,
110, 75, 76, 86,
120, NA, 78, 87,
130, 100, 80, 88),
nrow = 4, byrow = TRUE
)
row.names(testMat) <- 2000:2003
colnames(testMat) <- LETTERS[1:4]
getCombs(testMat)
#> $`2000`
#> [1] "B" "D"
#>
#> $`2001`
#> [1] "A" "B" "C" "D"
#>
#> $`2002`
#> [1] "A" "C" "D"
#>
#> $`2003`
#> [1] "A" "B" "C" "D"
#>
#> $`c("2000", "2001")`
#> [1] "B" "D"
#>
#> $`c("2000", "2002")`
#> [1] "D"
#>
#> $`c("2000", "2003")`
#> [1] "B" "D"
#>
#> $`c("2001", "2002")`
#> [1] "A" "C" "D"
#>
#> $`c("2001", "2003")`
#> [1] "A" "B" "C" "D"
#>
#> $`c("2002", "2003")`
#> [1] "A" "C" "D"
#>
#> $`c("2000", "2001", "2002")`
#> [1] "D"
#>
#> $`c("2000", "2001", "2003")`
#> [1] "B" "D"
#>
#> $`c("2000", "2002", "2003")`
#> [1] "D"
#>
#> $`c("2001", "2002", "2003")`
#> [1] "A" "C" "D"
#>
#> $`c("2000", "2001", "2002", "2003")`
#> [1] "D"
However, this answer, or any future answer for that matter, won’t get you every combination as you have 144 countries and 47 years of data. That produces a very very large number. Every combination of any length up to n is equivalent to the power set. The number of elements in the power set is simply 2^n
. Since we are not counting the equivalent of the empty set, we need to subtract one, thus:
suppressWarnings(suppressMessages(library(gmp)))
sub.bigz(pow.bigz(2, 47), 1)
#> Big Integer ('bigz') :
#> [1] 140737488355327
Yes, that is over one hundred trillion!!! You will probably need to rethink your approach as there are simply too many outcomes.
All is not lost! You can make use of the upper
argument, to limit the number of outcomes, so you can still investigate possible combinations. Observe:
set.seed(11111)
biggerTest <- matrix(sample(100, 20*20, replace = TRUE), nrow = 20)
colnames(biggerTest) <- LETTERS[1:20]
rownames(biggerTest) <- 1988:2007
## set 10% of values to NA
myNAs <- sample(400, 400 / 10)
biggerTest[myNAs] <- NA
biggerTest[1:6, 1:10]
#> A B C D E F G H I J
#> 1988 95 12 29 19 47 38 48 75 68 80
#> 1989 98 NA 87 77 99 30 92 30 66 27
#> 1990 33 49 17 2 77 NA 80 67 6 93
#> 1991 70 35 49 45 52 NA 46 13 27 86
#> 1992 42 37 28 95 34 13 53 NA 6 98
#> 1993 23 14 10 96 61 6 41 8 93 NA
## Getting all 1,048,575 results takes a good bit of time
system.time(allResults <- getCombs(biggerTest))
#> user system elapsed
#> 20.488 0.330 20.852
## Using upper greatly reduces the amount of time
system.time(smallSampTest <- getCombs(biggerTest, upper = 10000))
#> user system elapsed
#> 0.083 0.001 0.084
Alternatively, you can use the minYears
argument to only return results with a minimum number of combinations of years. For example, per the OP’s comments to @CPak’s answer, if you only want to see results with 15 or more years of combinations, we have:
system.time(minYearTest <- getCombs(biggerTest, minYears = 15))
#> user system elapsed
#> 0.414 0.003 0.417
set.seed(123)
minYearTest[sample(length(minYearTest), 5)]
#> $`c("1988", "1990", "1991", "1992", "1993", "1994", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
#> [1] "E" "L" "R"
#>
#> $`c("1988", "1990", "1991", "1992", "1993", "1995", "1996", "1997", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
#> [1] "E" "L" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1996", "1998", "1999", "2000", "2001", "2002", "2003", "2005", "2006", "2007")`
#> [1] "E" "L" "P" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1996", "1997", "1998", "2000", "2001", "2002", "2003", "2005", "2006")`
#> [1] "E" "L" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1993", "1994", "1995", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2007")`
#> [1] "C" "D" "E" "I" "L" "R"
Or use both arguments together:
system.time(
bothConstraintsTest <- getCombs(biggerTest, 10000, minYears = 10)
)
#> user system elapsed
#> 0.137 0.002 0.138
bothConstraintsTest[1:5]
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997")`
#> [1] "A" "C" "D" "E" "I" "L" "R" "S"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1998")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1999")`
#> [1] "C" "D" "E" "I" "L" "P" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2000")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2001")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R"
The first thing we need to do is to determine every combination of n years. This boils down to finding all n-tuples of the multiset c(rep(0, n-1), 1:n)
or equivalently, the power set of an n element set minus the empty set. For example, for the years 2000:2003
(4 year span), the possible combinations are given by:
comboGeneral(v = 0:4, m = 4, freqs = c(3, rep(1, 4)))
#> [,1] [,2] [,3] [,4]
#> [1,] 0 0 0 1
#> [2,] 0 0 0 2
#> [3,] 0 0 0 3
#> [4,] 0 0 0 4
#> [5,] 0 0 1 2
#> [6,] 0 0 1 3
#> [7,] 0 0 1 4
#> [8,] 0 0 2 3
#> [9,] 0 0 2 4
#> [10,] 0 0 3 4
#> [11,] 0 1 2 3
#> [12,] 0 1 2 4
#> [13,] 0 1 3 4
#> [14,] 0 2 3 4
#> [15,] 1 2 3 4
Now, we iterate over each row of our combinations where each row tells us which combination of rows from the original matrix to test for NAs
. If the particular combination only contains one result, we determine which indices are not NA
. That is easily carried out by !is.na(
. If we have more than one row, we employ complete.cases(t
to obtain the columns that have only numbers (i.e. no occurrences of NA
).
After this we are just using indexing to obtain names for our outcomes and Voila, we have our desired results.