Perfect collinearity detection in sparse matrix using Matrix::qr()

The overall goal of the exercise is to find perfectly collinear columns in a very large, but sparse matrix X in R in the context of a linear regression problem. One approach that pops up from time to time is to make use of the underlying QR decomposition of lm() to extract the OLS coefficients and drop all variables that are assigned NA as estimate. While base::qr() is much too slow to be a viable option, Matrix::qr() handles the sparse matrix well and is quite fast. Unfortunately, the results of the two implementations are quite different.

Consider the toy data set

# random design matrix
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
X  <- cbind(1, x1, x2, x3)

# random outcome variable
y  <- rnorm(100)

where variable x3 is a linear combination of x1 and x2 and therefore the goal is to drop x3 from X. We can quickly estimate the OLS coefficients using base as follows

# base implementation
qr.base   <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
                     x1          x2          x3 
-0.04330410 -0.04889519  0.07719935          NA

Here, the coefficient of x3 is automatically set to NA, as desired. Similar results can be obtained by using the dense matrix X as input for Matrix::qr(). However, things change when working with a sparse matrix class:

# Matrix implementation (sparse)
X.sparse    <- Matrix::sparse.model.matrix(~ x1 + x2 + x3)
qr.matrix   <- Matrix::qr(X.sparse)
coef.matrix <- Matrix::qr.coef(qr.matrix, y)
coef.matrix
  (Intercept)            x1            x2            x3 
-3.125000e-02 -2.088811e+14 -2.088811e+14  2.088811e+14

where something clearly goes wrong at some point due to X not being of full rank. Is there a way to generate results similar to what base::qr() gives using Matrix::qr()?

Solution

It looks like Matrix::qr.coef(qr.matrix, y) is giving a different answer because it never drops columns/pivots to deal with collinearity. It thus seems like you would need to deal with this prior to computing the QR decomposition of X.sparse. If you wanted to implement this, one way to consider which columns need to be dropped uses a trick based on the reduced row echelon form of t(X) %*% X that works as follows.

Consider a slight variation on your toy data set:

# random design matrix
set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
x4 <- rnorm(100)
x5 <- x4
x6 <- rnorm(100)
X  <- cbind(intercept=1, x1, x2, x3, x4, x5, x6)

# random outcome variable
y  <- rnorm(100)

I've added two "flavors" of collinearity because I think it's helpful to see the distinction in what follows. I also added an additional independent predictor x6 to show how the pattern showcased continues with additional predictors.

We can extract information about collinearity from the matrix t(X) %*% X. This is because it's actually the inability to invert this singular matrix that relates to the key problems caused by collinearity in the first place. In ordinary least squares, this is also what prevents us from simply being able directly to calculate

beta <- solve(t(X) %*% X) %*% t(X) %*% y
Error in solve.default(t(X) %*% X) : 
  Lapack routine dgesv: system is exactly singular: U[6,6] = 0

The process works as follows

# compute t(X) %*% X
tX_X <- t(X) %*% X

# compute the reduced row echelon form of tX_X
rref_tX_X <- pracma::rref(tX_X)
rref_tX_X
          intercept x1 x2 x3 x4 x5 x6
intercept         1  0  0  0  0  0  0
x1                0  1  0  1  0  0  0
x2                0  0  1  1  0  0  0
x3                0  0  0  0  1  1  0
x4                0  0  0  0  0  0  1
x5                0  0  0  0  0  0  0
x6                0  0  0  0  0  0  0

# relate predictors to each other to "visualize" non-independence
collinear_sets_matrix <- t(rref_tX_X) %*% rref_tX_X
collinear_sets_matrix
          intercept x1 x2 x3 x4 x5 x6
intercept         1  0  0  0  0  0  0
x1                0  1  0  1  0  0  0
x2                0  0  1  1  0  0  0
x3                0  1  1  2  0  0  0
x4                0  0  0  0  1  1  0
x5                0  0  0  0  1  1  0
x6                0  0  0  0  0  0  1

In collinear_sets_matrix, any row/column i with a 1 in position i and 0s elsewhere does not belong to a collinear set. In the example above, this only includes the intercept and predictor x6.

Given rref_tX_X, one could also work through the matrix columns left to right to determine predictors to keep and predictors to drop. Keep a counter k of columns to keep that starts at 1. When a column in rref_tX_X has a 1 in position k and 0s elsewhere, keep that column and increment k. Continue checking all columns.

In the example above, we would keep intercept, x1, x2, x4, and x6, which also matches values that are not NA following the base::qr() and base::qr.coef() process

qr.base   <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
   intercept           x1           x2           x3           x4           x5 
-0.040390807 -0.067473889  0.001486699           NA  0.046349008           NA 
      x6 
-0.098773117