rregressionlinear-regressionmatrix-factorizationqr-decomposition

Unable to get Linear Regression Cofficients in R after Successfully finding Q & R via Householder


I am manually trying to calculate the regression coefficients rather than using any default for data http://people.sc.fsu.edu/~jburkardt/datasets/regression/x31.txt

Here is my code, which properly produces Q&R satisfying A=QR. But I am not able to find coefficients as dimensions of Q & R creating issues. Can any experts help me here? When I have proper Q & R how finding coefficients can go wrong?

library(xlsx)
data.df<-read.xlsx("regression.xlsx",2,header = F)
#Remove unneccesary Index position
data.df<-data.df[2:5]

#Decomposing
#coefficients [b]=inv(t(X)(Matrix Multiplication)(X))(Matrix Multiplication)t(X)(Matrix Multiplication)y
Y<-as.matrix(data.df[4])
#But note that if you need to express Y=Xb, the coefficient of b_0 must be X_0 
#which is 1
X_0<-as.data.frame(c(1,1,1,1,1,1,1,1,1,1))
X<-(cbind(X_0,data.df[1:3]))
names(X)<-c("X1","X2","X3","X4")
X<-as.matrix(X)
#Create copy for final evaluvations
A<-X  


x1<-as.matrix(X[,1])
deter_x<-sqrt(sum(x1^2))
n=dim(x1)[1]
deter_e1<-as.matrix(c(deter_x,rep(0,n-1)))
v1=x1+deter_e1
#c_i=2/(transpose(v_i)%*%(v_i))
c1<-as.numeric(2/(t(v1)%*%v1))
#H_i = I - c_i*v_i%*%transpose(v_i)
I<-diag(n)
H1<-I-c1*(v1%*%t(v1))
R1<-H1%*%X
#Check R1 and see if it is Upper Triangle Matrix
R1
#We will take rest of the interesting portion of matrix R1.
n=dim(R1)[1]
X<-as.matrix(as.data.frame(cbind(R1[2:n,2],R1[2:n,3],R1[2:n,4])))
x1<-as.matrix(X[,1])
deter_x<-sqrt(sum(x1^2))
n=dim(x1)[1]
deter_e1<-as.matrix(c(deter_x,rep(0,n-1)))
v1=x1-deter_e1
#c_i=2/(transpose(v_i)%*%(v_i))
c1<-as.numeric(2/(t(v1)%*%v1))
#H_i = I - c_i*v_i%*%transpose(v_i)
I<-diag(n)
H2<-I-c1*(v1%*%t(v1))
R2<-H2%*%X

#Check R2 and see if it is Upper Triangle Matrix, if no go for R3
n=dim(R2)[1]
X<-as.matrix(as.data.frame(cbind(R2[2:n,2],R2[2:n,3])))
x1<-as.matrix(X[,1])
deter_x<-sqrt(sum(x1^2))
n=dim(x1)[1]
deter_e1<-as.matrix(c(deter_x,rep(0,n-1)))
v1=x1+deter_e1
#c_i=2/(transpose(v_i)%*%(v_i))
c1<-as.numeric(2/(t(v1)%*%v1))
#H_i = I - c_i*v_i%*%transpose(v_i)
I<-diag(n)
H3<-I-c1*(v1%*%t(v1))
R3<-H3%*%X
R3

#Check R3 and see if it is Upper Triangle Matrix, if no go for R4
n=dim(R3)[1]
X<-as.matrix(as.data.frame(cbind(R3[2:n,2])))
x1<-as.matrix(X[,1])
deter_x<-sqrt(sum(x1^2))
n=dim(x1)[1]
deter_e1<-as.matrix(c(deter_x,rep(0,n-1)))
v1=x1-deter_e1
#c_i=2/(transpose(v_i)%*%(v_i))
c1<-as.numeric(2/(t(v1)%*%v1))
#H_i = I - c_i*v_i%*%transpose(v_i)
I<-diag(n)
H4<-I-c1*(v1%*%t(v1))
R4<-H4%*%X
R4
#As we can see R4 has all values except first element as zero
#Let us replace Matrices iteratively in R1 from R2 to R4 and round it of
R1[2:10,2:4]<-R2
R1[3:10,3:4]<-R3
R1[4:10,4]<-R4
R<-round(R1,5)
R
#Find Complete H1
#Q=H1%*%H2%*%H3%*%H4
H1_COM<-H1
# 
H_temp<-diag(10)
n=dim(H_temp)[1]
dim(H_temp[2:n,2:n])
dim(H2)
H_temp[2:n,2:n]<-H2
H2_COM<-H_temp
H2_COM
H_temp<-diag(10)
n=dim(H_temp)[1]
dim(H_temp[3:n,3:n])
dim(H3)
H_temp[3:n,3:n]<-H3
H3_COM<-H_temp
H3_COM

H_temp<-diag(10)
n=dim(H_temp)[1]
dim(H_temp[4:n,4:n])
dim(H4)
H_temp[4:n,4:n]<-H4
H4_COM<-H_temp
Q=H1_COM%*%H2_COM%*%H3_COM%*%H4_COM
# The following code properly reconstructs A Matrix proving proper Q & R
A=round(Q%*%R)
# When you try to find coefficients using Q&R you will end up in error.
solve(R)%*%t(Q)%*%Y
#Error in solve.default(R) : 'a' (10 x 4) must be square
#So trying to get matrix R without all 0 rows R[1:4,1:4]
solve(R[1:4,1:4])%*%t(Q)%*%Y
#Error in solve(R[1:4, 1:4]) %*% t(Q) : non-conformable arguments
dim(solve(R[1:4,1:4]))
dim(solve(R[1:4,1:4]))
#4 4
dim(t(Q))
#[1] 10 10
dim(Y)
#10  1

Solution

  • I would like to point you to this thread which I have answered (rather comprehensively): Multiple regression analysis in R using QR decomposition. Whether your question will be seen as a duplicate will be judged by the community. Note, among the options there, the last one is using a QR factorization written in plain R code myself.

    Given that your toy code for QR factorization is correct (as you commented along your code), the major issue is with your last few lines.

    The solution is simple:

    solve(R) %*% (t(Q) %*% Y)[1:4,]
    

    1:4 selects the first 4 elements in the single-column matrix t(Q) %*% Y.

    If you check with my linked answer, you will see that instead of solve, I am using backsolve as this is a triangular system of equations.

    You will also spot that I use crossprod when I can instead of t and %*%. My recent answer When is 'crossprod' preferred to '%*%', and when isn't? has a comprehensive discussion on these two fashions.