The overall goal of the exercise is to find perfectly collinear columns in a very large, but sparse matrix X
in R in the context of a linear regression problem. One approach that pops up from time to time is to make use of the underlying QR decomposition of lm()
to extract the OLS coefficients and drop all variables that are assigned NA
as estimate. While base::qr()
is much too slow to be a viable option, Matrix::qr()
handles the sparse matrix well and is quite fast. Unfortunately, the results of the two implementations are quite different.
Consider the toy data set
# random design matrix
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
X <- cbind(1, x1, x2, x3)
# random outcome variable
y <- rnorm(100)
where variable x3
is a linear combination of x1
and x2
and therefore the goal is to drop x3
from X
. We can quickly estimate the OLS coefficients using base
as follows
# base implementation
qr.base <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
x1 x2 x3
-0.04330410 -0.04889519 0.07719935 NA
Here, the coefficient of x3
is automatically set to NA
, as desired. Similar results can be obtained by using the dense matrix X
as input for Matrix::qr()
. However, things change when working with a sparse matrix class:
# Matrix implementation (sparse)
X.sparse <- Matrix::sparse.model.matrix(~ x1 + x2 + x3)
qr.matrix <- Matrix::qr(X.sparse)
coef.matrix <- Matrix::qr.coef(qr.matrix, y)
coef.matrix
(Intercept) x1 x2 x3
-3.125000e-02 -2.088811e+14 -2.088811e+14 2.088811e+14
where something clearly goes wrong at some point due to X
not being of full rank. Is there a way to generate results similar to what base::qr()
gives using Matrix::qr()
?
It looks like Matrix::qr.coef(qr.matrix, y)
is giving a different answer because it never drops columns/pivots to deal with collinearity. It thus seems like you would need to deal with this prior to computing the QR decomposition of X.sparse
. If you wanted to implement this, one way to consider which columns need to be dropped uses a trick based on the reduced row echelon form of t(X) %*% X
that works as follows.
Consider a slight variation on your toy data set:
# random design matrix
set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
x4 <- rnorm(100)
x5 <- x4
x6 <- rnorm(100)
X <- cbind(intercept=1, x1, x2, x3, x4, x5, x6)
# random outcome variable
y <- rnorm(100)
I've added two "flavors" of collinearity because I think it's helpful to see the distinction in what follows. I also added an additional independent predictor x6
to show how the pattern showcased continues with additional predictors.
We can extract information about collinearity from the matrix t(X) %*% X
. This is because it's actually the inability to invert this singular matrix that relates to the key problems caused by collinearity in the first place. In ordinary least squares, this is also what prevents us from simply being able directly to calculate
beta <- solve(t(X) %*% X) %*% t(X) %*% y
Error in solve.default(t(X) %*% X) :
Lapack routine dgesv: system is exactly singular: U[6,6] = 0
The process works as follows
# compute t(X) %*% X
tX_X <- t(X) %*% X
# compute the reduced row echelon form of tX_X
rref_tX_X <- pracma::rref(tX_X)
rref_tX_X
intercept x1 x2 x3 x4 x5 x6
intercept 1 0 0 0 0 0 0
x1 0 1 0 1 0 0 0
x2 0 0 1 1 0 0 0
x3 0 0 0 0 1 1 0
x4 0 0 0 0 0 0 1
x5 0 0 0 0 0 0 0
x6 0 0 0 0 0 0 0
# relate predictors to each other to "visualize" non-independence
collinear_sets_matrix <- t(rref_tX_X) %*% rref_tX_X
collinear_sets_matrix
intercept x1 x2 x3 x4 x5 x6
intercept 1 0 0 0 0 0 0
x1 0 1 0 1 0 0 0
x2 0 0 1 1 0 0 0
x3 0 1 1 2 0 0 0
x4 0 0 0 0 1 1 0
x5 0 0 0 0 1 1 0
x6 0 0 0 0 0 0 1
In collinear_sets_matrix
, any row/column i
with a 1 in position i
and 0s elsewhere does not belong to a collinear set. In the example above, this only includes the intercept
and predictor x6
.
Given rref_tX_X
, one could also work through the matrix columns left to right to determine predictors to keep and predictors to drop. Keep a counter k
of columns to keep that starts at 1. When a column in rref_tX_X
has a 1 in position k
and 0s elsewhere, keep that column and increment k
. Continue checking all columns.
In the example above, we would keep intercept
, x1
, x2
, x4
, and x6
, which also matches values that are not NA
following the base::qr()
and base::qr.coef()
process
qr.base <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
intercept x1 x2 x3 x4 x5
-0.040390807 -0.067473889 0.001486699 NA 0.046349008 NA
x6
-0.098773117