# NMSA407 Linear Regression: Tutorial

Matrix algebra background of linear regression

## Dataset

Everything in this tutorial will be exemplified on pseudodata that we now create.

y <- c(-9, -11, 1, 19)
x <- c(-3, -1, 1, 3)
x2 <- c(9, 1, 1, 9)
x2 <- x^2                     ### the same
X <- cbind(1, x, x2)
colnames(X) <- c("intcpt", "x", "x^2")
print(X)

##      intcpt  x x^2
## [1,]      1 -3   9
## [2,]      1 -1   1
## [3,]      1  1   1
## [4,]      1  3   9

Data <- data.frame(y = y, x = x, x2 = x2)
print(Data)

##     y  x x2
## 1  -9 -3  9
## 2 -11 -1  1
## 3   1  1  1
## 4  19  3  9


## Linear model

We will be fitting the following linear model: $\mathsf{E}Y = \beta_0 + \beta_1\,x + \beta_2\,x^2.$

## Raw solution to normal equations

First, we calculate solution to normal equations directly from theoretically derived expressions, i.e., as $$\widehat{\beta} = \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\mathbf{X}^\top\mathbf{Y}$$.

### Matrix $$\mathbf{X}^\top\mathbf{X}$$

XtX <- t(X) %*% X
XtX <- crossprod(X)    ## the same
print(XtX)

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164


### Matrix $$\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}$$

invXtX <- solve(XtX)
print(invXtX)

##           intcpt    x       x^2
## intcpt  0.640625 0.00 -0.078125
## x       0.000000 0.05  0.000000
## x^2    -0.078125 0.00  0.015625


### Solution to normal equations using already calculated inverse

b <- invXtX %*% t(X) %*% y
print(b)

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25


### Solution to normal equations calculated directly as a solution to a linear system

solve(XtX, t(X) %*% y)

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25

solve(crossprod(X), crossprod(X, y))   ## the same

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25


## Linear algebra and solution to normal equations

Calculation of the solution to normal equations using the above “raw” approach is numerically not too optimal. In practice, numerically more stable and much better methods are used which are based on decompositions of the model matrix $$\mathbf{X}$$ known from linear algebra. The advantage of these methods is also the fact that once the decomposition of the model matrix $$\mathbf{X}$$ is available, calculation of all the rest (fitted values, residuals, solution to normal equations, …) is trivial (and also computationally fast). This is in particular useful in situations when the least squares solutions are needed for a series of “response vectors” $$\mathbf{Y}$$ while keeping the model matrix $$\mathbf{X}$$ fixed.

The most often used decompositions are

• QR decomposition (used also by the lm function in R);
• singular value decomposition (SVD) which is also useful for calculation of the Moore-Penrose pseudoinverse matrix.

In the following, the above two decompositions will be explored in more detail. Nevertheless first, we explore other two decompositions of a symmetric positive semidefinite matrix $$\mathbf{X}^\top\mathbf{X}$$ that can also be used to solve the normal equations $$\mathbf{X}^\top\mathbf{X}\,\mathbf{b}= \mathbf{X}^\top\mathbf{Y}$$ and possibly for some other tasks.

These will be

• spectral decomposition;
• Cholesky (square root) decomposition.

Both the spectral decomposition and the Cholesky decomposition should be known from the bachelor courses Linear algebra and Fundamentals of Numerical Mathematics, Cholesky decomposition perhaps in a more general form of the LU decomposition.

## Spectral decomposition of a symmetric positive semidefinite matrix $$\mathbf{X}^\top\mathbf{X}$$

$\mathbf{X}^\top\mathbf{X} = \mathbf{V}\,\Lambda\,\mathbf{V}^\top = \sum_{j=1}^p \lambda_j\,\mathbf{v}_j\,\mathbf{v}_j^\top, \qquad \lambda_j \geq 0,\;\;j=1,\ldots,p.$

• Columns of matrix $$\mathbf{V}$$: orthonormal eigenvectors $$\mathbf{v}_1,\ldots,\mathbf{v}_p$$ of a matrix $$\mathbf{X}^\top\mathbf{X}$$.
• Matrix $$\Lambda$$: diagonal matrix with eigenvalues $$\lambda_1, \ldots, \lambda_p$$ on a diagonal.

It is additionally satisfied: $\mathbf{V}\,\mathbf{V}^\top = \sum_{j=1}^p \mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{I}_p,$ $\mathbf{V}^\top\,\mathbf{V} = \bigl(\mathbf{v}_j^\top\,\mathbf{v}_l\bigr) = \mathbf{I}_p.$ That is, $\mathbf{V}^{-1} = \mathbf{V}^\top.$ If $$\lambda_j > 0$$ for all $$j=1,\ldots,p$$ then $\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1} = \sum_{j=1}^p \lambda_j^{-1}\,\mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{V}\,\Lambda^{-1}\,\mathbf{V}^\top.$

### Spectral decomposition of $$\mathbf{X}^\top\mathbf{X}$$

eigen(XtX)

## $values ## [1] 166.462113 20.000000 1.537887 ## ##$vectors
##            [,1] [,2]       [,3]
## [1,] -0.1221833    0  0.9925076
## [2,]  0.0000000   -1  0.0000000
## [3,] -0.9925076    0 -0.1221833


### Elements of an object returned by the eigen function

e <- eigen(XtX)
names(e)

## [1] "values"  "vectors"


#### Does it satisfy what it should satisfy?

V <- e$vectors Lambda <- diag(e$values)

V %*% Lambda %*% t(V)

##      [,1] [,2] [,3]
## [1,]    4    0   20
## [2,]    0   20    0
## [3,]   20    0  164

round(V %*% t(V), 13)             ## only numerically identity matrix

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(t(V) %*% V, 13)             ## only numerically identity matrix

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(XtX %*% V - V %*% Lambda, 13)    ## X'X * V = V * Lambda

##        [,1] [,2] [,3]
## intcpt    0    0    0
## x         0    0    0
## x^2       0    0    0

invLambda <- diag(1 / e$values) V %*% invLambda %*% t(V)  ## [,1] [,2] [,3] ## [1,] 0.640625 0.00 -0.078125 ## [2,] 0.000000 0.05 0.000000 ## [3,] -0.078125 0.00 0.015625  round(V %*% invLambda %*% t(V) - invXtX, 13) ## V * Lambda^(-1) * V' = (X'X)^(-1)  ## intcpt x x^2 ## intcpt 0 0 0 ## x 0 0 0 ## x^2 0 0 0  ## Cholesky (square root) decomposition of a symmetric positive semidefinite matrix $$\mathbf{X}^\top\mathbf{X}$$ $\mathbf{X}^\top\mathbf{X} = \mathbf{L}\,\mathbf{L}^\top,$ where $$\mathbf{L}$$ is a lower triangular matrix. This is a special case of so called LU decomposition where in the case of a symmetric positive semidefinite matrix $$\mathbf{U} = \mathbf{L}^\top$$ (or $$\mathbf{L} = \mathbf{U}^\top$$). In statistics, symmetric positive semidefinite matrices are quite often faced with as all covariance matrices are of this type. The Cholesky decomposition is in fact a generalization of a square root to matrices (as will immediately be shown). The Cholesky decomposition of a covariance matrix is then in fact a multivariate version of a standard deviation. Its calculation then simplifies many problems. If $$\mathbf{D}$$ is a symmetric positive semidefinite matrix (e.g., a covariance matrix or a matrix $$\mathbf{X}^\top\mathbf{X}$$ in a regression context) then it can be decomposed as $$\mathbf{D} = \mathbf{L}\,\mathbf{L}^\top$$, where $$\mathbf{L}$$ is lower triangular matrix. Among other things, the following tasks are practically trivial when $$\mathbf{L}$$ is already available. • Linear systems $$\mathbf{D}\,\mathbf{b} = \mathbf{c}$$ are easily solved by backward and forward substitution: • $$\mathbf{L}\,\mathbf{L}^\top\,\mathbf{b} = \mathbf{c}$$, • first solve $$\mathbf{L}\,\mathbf{v} = \mathbf{c}$$ (forward substitution), • then solve $$\mathbf{L}^\top\,\mathbf{b} = \mathbf{v}$$ (backward substitution). This is especially useful when the linear system $$\mathbf{D}\,\mathbf{b} = \mathbf{c}$$ is to be solved for (many) different right-hand-side vectors $$\mathbf{c}$$. • $$\mathsf{det}(\mathbf{D})$$ $$= \mathsf{det}\bigl(\mathbf{L}\,\mathbf{L}^\top\bigr)$$ $$= \mathsf{det}(\mathbf{L})\,\mathsf{det}(\mathbf{L})$$ $$= (\mbox{product of diagonal elements of }\mathbf{L})^2$$. • $$\mathbf{D}^{-1}$$ $$= \bigl(\mathbf{L}\,\mathbf{L}^\top\bigr)^{-1}$$ $$= \bigl(\mathbf{L}^{-1}\bigr)^\top\,\mathbf{L}^{-1}$$. That is, to calculate $$\mathbf{D}^{-1}$$, it is only needed to invert a triangular matrix (which is easy). • If a random vector $$\mathbf{Z}$$ has a distribution with a unit covariance matrix then $$\mathbf{L}\,\mathbf{Z}$$ has a distribution with a covariance matrix $$\mathbf{D}$$. This is useful for random numbers generation. ### Cholesky decomposition of a matrix $$\mathbf{X}^\top\mathbf{X}$$ R returns matrix $$\mathbf{U} = \mathbf{L}^\top$$. chol(XtX)  ## intcpt x x^2 ## intcpt 2 0.000000 10 ## x 0 4.472136 0 ## x^2 0 0.000000 8  #### Does it satisfy what it should satisfy? U <- chol(XtX) t(U) %*% U  ## intcpt x x^2 ## intcpt 4 0 20 ## x 0 20 0 ## x^2 20 0 164  ### Calculate $$\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}$$ from the Cholesky decomposition chol2inv(U)  ## [,1] [,2] [,3] ## [1,] 0.640625 0.00 -0.078125 ## [2,] 0.000000 0.05 0.000000 ## [3,] -0.078125 0.00 0.015625  invXtX2 <- chol2inv(U) round(invXtX2 - invXtX, 13) ## compare with previously calculated (X'X)^(-1), numerically the same  ## intcpt x x^2 ## intcpt 0 0 0 ## x 0 0 0 ## x^2 0 0 0  ### Determinant of $$\mathbf{X}^\top\mathbf{X}$$ prod(diag(U))^2  ## [1] 5120  det(XtX) ## R function to calculate determinant  ## [1] 5120  ### Solution to normal equations via backward and forward substitution using the Cholesky decomposition of a matrix $$\mathbf{X}^\top\mathbf{X}$$ We need to solve $$\mathbf{X}^\top\mathbf{X}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y}$$. Since $$\mathbf{X}^\top\mathbf{X} = \mathbf{U}^\top\mathbf{U}$$, we have to solve $$\mathbf{U}^\top\mathbf{U}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y}$$. (v <- forwardsolve(t(U), crossprod(X, y))) ## solution to U' * v = X'Y  ## [,1] ## [1,] 0.00000 ## [2,] 21.46625 ## [3,] 10.00000  (bchol <- backsolve(U, v)) ## solution to U * b = v  ## [,1] ## [1,] -6.25 ## [2,] 4.80 ## [3,] 1.25  print(b) ## compare to previously calculated solution  ## [,1] ## intcpt -6.25 ## x 4.80 ## x^2 1.25  ## Singular value decomposition (SVD) of the model matrix $$\mathbf{X}$$ Suppose that $$n = \mathrm{nrow}(\mathbf{X})$$, $$k = \mathrm{ncol}(\mathbf{X})$$, $$r = \mathrm{rank}(\mathbf{X}) \leq k \leq n$$. The singular value decomposition (SVD) of the model matrix $$\mathbf{X}$$ takes the following form: $\mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{V}^\top,$ where • Columns of a matrix $$\mathbf{U}_{n\times k}$$: $$\mathbf{u}_1,\ldots,\mathbf{u}_k$$ $$=$$ the first $$k$$ orthonormal eigenvectors of a matrix $$\mathbf{X}\,\mathbf{X}^\top$$; • Columns of a matrix $$\mathbf{V}_{k\times k}$$: $$\mathbf{v}_1,\ldots,\mathbf{v}_k$$ $$=$$ (all) orthonormal eigenvectors of a matrix $$\mathbf{X}^\top\,\mathbf{X}$$; • Matrix $$\mathbf{D}_{k\times k}$$: diagonal matrix with $$d_1, \ldots, d_k$$ on a diagonal • $$d_j^2$$, $$j=1,\ldots,k$$: the first $$k$$ eigenvalues of a matrix $$\mathbf{X}\,\mathbf{X}^\top$$, which are the same as (all) eigenvalues of a matrix $$\mathbf{X}^\top\,\mathbf{X}$$. Numbers $$d_1,\ldots,d_k$$ are called singular values of the matrix $$\mathbf{X}$$. It holds: • $$r = k$$: $$d_1 \geq d_2 \geq \cdots \geq d_k > 0$$, • $$r < k$$: $$d_1 \geq d_2 \geq \cdots \geq d_r > d_{r+1} = \cdots = d_k = 0$$. Further, • $$\mathbf{U}^\top\,\mathbf{U} = \bigl(\mathbf{u}_i^\top\mathbf{u}_j\bigr) = \mathbf{I}_{k\times k}$$; • $$\mathbf{V}^\top\,\mathbf{V} = \bigl(\mathbf{v}_i^\top\mathbf{v}_j\bigr) = \mathbf{I}_{k\times k}$$; • $$\mathbf{V}\,\mathbf{V}^\top = \sum_{j=1}^k\mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{I}_{k\times k}$$. • $$\mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{V}^\top = \sum_{j=1}^k d_j\,\mathbf{u}_j\,\mathbf{v}_j^\top$$; • $$\mathbf{X}^\top\,\mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{D}\,\mathbf{V}^\top = \sum_{j=1}^k d_j^2\,\mathbf{u}_j\,\mathbf{v}_j^\top$$. If for all $$j=1,\ldots,k$$, $$d_j > 0$$, matrix $$\mathbf{X}^\top\,\mathbf{X}$$ is invertible and $\bigl(\mathbf{X}^\top\,\mathbf{X}\bigr)^{-1} = \mathbf{U}\,\mathbf{D}^{-2}\,\mathbf{V}^\top = \sum_{j=1}^k \frac{1}{d_j^2}\,\mathbf{u}_j\,\mathbf{v}_j^\top,$ where $$\mathbf{D}^{-2}$$ is a diagonal matrix with $$d_1^{-2},\ldots,d_k^{-2}$$ on a diagonal. Since matrix $$\mathbf{U}$$ contains an orthonormal basis of $$M(\mathbf{X})$$ in its columns, the projection matrix into $$M(\mathbf{X})$$ is $\mathbf{H} = \mathbf{U}\,\mathbf{U}^\top.$ The SVD decomposition can also be used to calculate the Moore-Penrose pseudoinverse matrix to $$\mathbf{X}$$: $\mathbf{X}^- = \tilde{\mathbf{V}}\,{\tilde{\mathbf{D}}}^{-1}\,{\tilde{\mathbf{U}}}^\top,$ where matrices $$\tilde{\mathbf{V}}_{k\times r},\,\tilde{\mathbf{D}}_{r\times r},\,\tilde{\mathbf{U}}_{n\times r}$$ contains/corresponds only to positive singular values. ### Singular value decomposition of a matrix $$\mathbf{X}$$ svd(X)  ##$d
## [1] 12.902020  4.472136  1.240116
##
## $u ## [,1] [,2] [,3] ## [1,] 0.70180882 0.6708204 -0.08639661 ## [2,] 0.08639661 0.2236068 0.70180882 ## [3,] 0.08639661 -0.2236068 0.70180882 ## [4,] 0.70180882 -0.6708204 -0.08639661 ## ##$v
##           [,1] [,2]       [,3]
## [1,] 0.1221833    0  0.9925076
## [2,] 0.0000000   -1  0.0000000
## [3,] 0.9925076    0 -0.1221833


### Content of an object returned by the svd function

svdX <- svd(X)
names(svdX)

## [1] "d" "u" "v"

(U <- svdX$u)  ## [,1] [,2] [,3] ## [1,] 0.70180882 0.6708204 -0.08639661 ## [2,] 0.08639661 0.2236068 0.70180882 ## [3,] 0.08639661 -0.2236068 0.70180882 ## [4,] 0.70180882 -0.6708204 -0.08639661  (D <- diag(svdX$d))

##          [,1]     [,2]     [,3]
## [1,] 12.90202 0.000000 0.000000
## [2,]  0.00000 4.472136 0.000000
## [3,]  0.00000 0.000000 1.240116

(V <- svdX$v)  ## [,1] [,2] [,3] ## [1,] 0.1221833 0 0.9925076 ## [2,] 0.0000000 -1 0.0000000 ## [3,] 0.9925076 0 -0.1221833  ### Matrices $$\mathbf{D}^2,\,\mathbf{D}^{-1},\,\mathbf{D}^{-2}$$ (D2 <- diag(svdX$d^2))       ## = D * D

##          [,1] [,2]     [,3]
## [1,] 166.4621    0 0.000000
## [2,]   0.0000   20 0.000000
## [3,]   0.0000    0 1.537887

(iD  <- diag(1 / svdX$d)) ## = D^(-1)  ## [,1] [,2] [,3] ## [1,] 0.07750724 0.0000000 0.0000000 ## [2,] 0.00000000 0.2236068 0.0000000 ## [3,] 0.00000000 0.0000000 0.8063762  (iD2 <- diag(1 / svdX$d^2))   ## = (D * D)^(-1)

##             [,1] [,2]      [,3]
## [1,] 0.006007373 0.00 0.0000000
## [2,] 0.000000000 0.05 0.0000000
## [3,] 0.000000000 0.00 0.6502426


#### Do the above matrices satisfy what they should satisfy?

U %*% D %*% t(V)              ### --> X

##      [,1] [,2] [,3]
## [1,]    1   -3    9
## [2,]    1   -1    1
## [3,]    1    1    1
## [4,]    1    3    9

V %*% D %*% t(D) %*% t(V)     ### --> X'X

##      [,1] [,2] [,3]
## [1,]    4    0   20
## [2,]    0   20    0
## [3,]   20    0  164

round(t(U) %*% U, 13)         ### --> I (numerically)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(t(V) %*% V, 13)         ### --> I (numerically)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(V %*% V, 13)            ### --> I (numerically)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1


Xmin <- V %*% diag(1/svdX$d) %*% t(U) print(Xmin)  ## [,1] [,2] [,3] [,4] ## [1,] -0.0625 0.5625 0.5625 -0.0625 ## [2,] -0.1500 -0.0500 0.0500 0.1500 ## [3,] 0.0625 -0.0625 -0.0625 0.0625  ### Properties of the Moore-Penrose pseudoinverse round(X %*% Xmin %*% X - X, 13) ## X * X^- * X = X  ## intcpt x x^2 ## [1,] 0 0 0 ## [2,] 0 0 0 ## [3,] 0 0 0 ## [4,] 0 0 0  round(Xmin %*% X %*% Xmin - Xmin, 13) ## X^- * X * X^- = X^-  ## [,1] [,2] [,3] [,4] ## [1,] 0 0 0 0 ## [2,] 0 0 0 0 ## [3,] 0 0 0 0  round(t(X %*% Xmin) - X %*% Xmin, 13) ## (X * X^-)' = X * X^-  ## [,1] [,2] [,3] [,4] ## [1,] 0 0 0 0 ## [2,] 0 0 0 0 ## [3,] 0 0 0 0 ## [4,] 0 0 0 0  round(t(Xmin %*% X) - Xmin %*% X, 13) ## (X^- * X)' = X^- * X  ## [,1] [,2] [,3] ## intcpt 0 0 0 ## x 0 0 0 ## x^2 0 0 0  ### $$\mathbf{H}$$ matrix to project into $$M(\mathbf{X})$$ (regression space) (H <- U %*% t(U))  ## [,1] [,2] [,3] [,4] ## [1,] 0.95 0.15 -0.15 0.05 ## [2,] 0.15 0.55 0.45 -0.15 ## [3,] -0.15 0.45 0.55 0.15 ## [4,] 0.05 -0.15 0.15 0.95  ### $$\mathbf{M}$$ matrix to project into the orthogonal complement of $$M(\mathbf{X})$$ (residual space) (M <- diag(nrow(X)) - H)  ## [,1] [,2] [,3] [,4] ## [1,] 0.05 -0.15 0.15 -0.05 ## [2,] -0.15 0.45 -0.45 0.15 ## [3,] 0.15 -0.45 0.45 -0.15 ## [4,] -0.05 0.15 -0.15 0.05  ### Fitted values and residuals (yhat <- H %*% y)  ## [,1] ## [1,] -9.4 ## [2,] -9.8 ## [3,] -0.2 ## [4,] 19.4  (u <- M %*% y)  ## [,1] ## [1,] 0.4 ## [2,] -1.2 ## [3,] 1.2 ## [4,] -0.4  ### Fitted values and residuals using the solution to normal equations (to compare) (yhat_from_b <- X %*% b)  ## [,1] ## [1,] -9.4 ## [2,] -9.8 ## [3,] -0.2 ## [4,] 19.4  (u_from_b <- y - yhat_from_b)  ## [,1] ## [1,] 0.4 ## [2,] -1.2 ## [3,] 1.2 ## [4,] -0.4  ### Matrix $$\mathbf{X}^\top\mathbf{X}$$ V %*% D2 %*% t(V) ## X'X  ## [,1] [,2] [,3] ## [1,] 4 0 20 ## [2,] 0 20 0 ## [3,] 20 0 164  crossprod(D %*% t(V)) ## X'X calculated even in a better way  ## [,1] [,2] [,3] ## [1,] 4 0 20 ## [2,] 0 20 0 ## [3,] 20 0 164  print(XtX) ## previously calculated X'X  ## intcpt x x^2 ## intcpt 4 0 20 ## x 0 20 0 ## x^2 20 0 164  ### Matrix $$\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}$$ V %*% iD2 %*% t(V) ## (X'X)^(-1)  ## [,1] [,2] [,3] ## [1,] 0.640625 0.00 -0.078125 ## [2,] 0.000000 0.05 0.000000 ## [3,] -0.078125 0.00 0.015625  crossprod(iD %*% t(V)) ## (X'X)^(-1) calculated even in a better way  ## [,1] [,2] [,3] ## [1,] 0.640625 0.00 -0.078125 ## [2,] 0.000000 0.05 0.000000 ## [3,] -0.078125 0.00 0.015625  print(invXtX) ## previously calculated X'X  ## intcpt x x^2 ## intcpt 0.640625 0.00 -0.078125 ## x 0.000000 0.05 0.000000 ## x^2 -0.078125 0.00 0.015625  ## QR decomposition of the model matrix $$\mathbf{X}$$ Still suppose that $$n = \mathrm{nrow}(\mathbf{X})$$, $$k = \mathrm{ncol}(\mathbf{X})$$, $$r = \mathrm{rank}(\mathbf{X}) \leq k \leq n$$. The QR decomposition of the model matrix $$\mathbf{X}$$ takes the following form: $\mathbf{X} = \mathbf{Q}\,\mathbf{R},$ where • Columns of a matrix $$\mathbf{Q}_{n\times k}$$: $$\mathbf{q}_1,\ldots,\mathbf{q}_k$$; • the first $$r$$ of these columns: orthonormal basis of a linear span of columns of $$\mathbf{X}$$ (regression space $$M(\mathbf{X})$$); • Matrix $$\mathbf{R}_{k\times k}$$: upper triangular matrix. Further, $\mathbf{X}^\top\mathbf{X} = \mathbf{R}^\top\,\mathbf{Q}^\top\mathbf{Q}\,\mathbf{R} = \mathbf{R}^\top\mathbf{R},$ since the columns of the matrix $$\mathbf{Q}$$ form an orthonormal basis and hence $$\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}_k$$. System of normal equations is then easily solvable by one backward substition since: $\mathbf{X}^\top\mathbf{X}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y}$ $\mathbf{R}^\top\mathbf{R}\,\mathbf{b} = \mathbf{R}^\top\mathbf{Q}^\top\,\mathbf{Y}$ $\mathbf{R}\,\mathbf{b} = \mathbf{Q}^\top\,\mathbf{Y}.$ That is, it is only requested to solve one linear system with an upper triangular matrix. If $$r = k$$ then $$\mathbf{X}^\top\mathbf{X}$$ is invertible and $\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1} = \bigl(\mathbf{R}^\top\mathbf{R}\bigr)^{-1} = \mathbf{R}^{-1}\bigl(\mathbf{R}^{-1}\bigr)^\top.$ That is, it is only needed to invert a triangular matrix (which is easy). ### QR decomposition of a matrix $$\mathbf{X}$$ qr(X)  ##$qr
##      intcpt          x         x^2
## [1,]   -2.0  0.0000000 -10.0000000
## [2,]    0.5 -4.4721360   0.0000000
## [3,]    0.5  0.4472136   8.0000000
## [4,]    0.5  0.8944272  -0.9296181
##
## $rank ## [1] 3 ## ##$qraux
## [1] 1.500000 1.000000 1.368524
##
## $pivot ## [1] 1 2 3 ## ## attr(,"class") ## [1] "qr"  Note that results are stored and shown in a compact form. ### Content of an object returned by the qr function qrX <- qr(X) names(qrX)  ## [1] "qr" "rank" "qraux" "pivot"  ### Rank of the model matrix qrX$rank

## [1] 3

r <- qrX\$rank
n <- nrow(X)


### $$\mathbf{Q}$$ matrix

In our case, $$r = k = 3$$, all columns of $$\mathbf{Q}$$ form an orthonormal basis of $$M(\mathbf{X})$$.

qr.Q(qrX)

##      [,1]       [,2] [,3]
## [1,] -0.5  0.6708204  0.5
## [2,] -0.5  0.2236068 -0.5
## [3,] -0.5 -0.2236068 -0.5
## [4,] -0.5 -0.6708204  0.5

Q <- qr.Q(qrX)


### $$\mathbf{R}$$ matrix

qr.R(qrX)

##      intcpt         x x^2
## [1,]     -2  0.000000 -10
## [2,]      0 -4.472136   0
## [3,]      0  0.000000   8

R <- qr.R(qrX)


### Is it really $$\mathbf{X} = \mathbf{Q}\,\mathbf{R}$$?

Q %*% R - X                ## not really zeros

##      intcpt x          x^2
## [1,]      0 0 0.000000e+00
## [2,]      0 0 0.000000e+00
## [3,]      0 0 8.881784e-16
## [4,]      0 0 0.000000e+00

round(Q %*% R - X, 14)     ## numerically yes

##      intcpt x x^2
## [1,]      0 0   0
## [2,]      0 0   0
## [3,]      0 0   0
## [4,]      0 0   0


### $$\mathbf{Q}$$ matrix complemented by $$n - r$$ columns to form an orthonormal basis of $$R^n$$ (in our case $$R^4$$)

The orthonormal basis of $$R^n$$ will be stored in a matrix $$\mathbf{P}$$.

(P <- qr.Q(qrX, complete = TRUE))

##      [,1]       [,2] [,3]       [,4]
## [1,] -0.5  0.6708204  0.5  0.2236068
## [2,] -0.5  0.2236068 -0.5 -0.6708204
## [3,] -0.5 -0.2236068 -0.5  0.6708204
## [4,] -0.5 -0.6708204  0.5 -0.2236068


### Orthonormal basis of the residual space (orthogonal complement to $$M(\mathbf{X})$$)

The last $$n-r$$ columns of the matrix $$\mathbf{P}$$ form an orthonormal basis of the residual space. We store this basis in a matrix $$\mathbf{N}$$.

(N <- P[, (r + 1):n])

## [1]  0.2236068 -0.6708204  0.6708204 -0.2236068


### Projection matrix $$\mathbf{H}$$ into the regression space $$M(\mathbf{X})$$

(H <- Q %*% t(Q))

##       [,1]  [,2]  [,3]  [,4]
## [1,]  0.95  0.15 -0.15  0.05
## [2,]  0.15  0.55  0.45 -0.15
## [3,] -0.15  0.45  0.55  0.15
## [4,]  0.05 -0.15  0.15  0.95


### Projection matrix $$\mathbf{M}$$ into the residual space (orthogonal complement to $$M(\mathbf{X})$$)

(M <- N %*% t(N))

##       [,1]  [,2]  [,3]  [,4]
## [1,]  0.05 -0.15  0.15 -0.05
## [2,] -0.15  0.45 -0.45  0.15
## [3,]  0.15 -0.45  0.45 -0.15
## [4,] -0.05  0.15 -0.15  0.05


### Fitted values and residuals

(yhat <- H %*% y)

##      [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4

(u <- M %*% y)

##      [,1]
## [1,]  0.4
## [2,] -1.2
## [3,]  1.2
## [4,] -0.4


### Fitted values and residuals calculated using the solution to normal equations

(yhat_from_b <- X %*% b)

##      [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4

(u_from_b <- y - yhat_from_b)

##      [,1]
## [1,]  0.4
## [2,] -1.2
## [3,]  1.2
## [4,] -0.4


### Matrix $$\mathbf{X}^\top\mathbf{X}$$ easily calculated from the $$\mathbf{R}$$ matrix

crossprod(R)      ## = R'R = X'X

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164

print(XtX)        ## previously calculated X'X

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164


### Matrix $$\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}$$ calculated from its Cholesky decomposition $$\mathbf{R}^\top\mathbf{R}$$

chol2inv(R)       ## inverse of X'X calculated from its Cholesky factor R (programme R uses the information that R matrix is upper triangular)

##           [,1] [,2]      [,3]
## [1,]  0.640625 0.00 -0.078125
## [2,]  0.000000 0.05  0.000000
## [3,] -0.078125 0.00  0.015625

print(invXtX)     ## compare to previously calculated (X'X)^(-1)

##           intcpt    x       x^2
## intcpt  0.640625 0.00 -0.078125
## x       0.000000 0.05  0.000000
## x^2    -0.078125 0.00  0.015625


### Solution to normal equations by solving $$\mathbf{R}\,\mathbf{b} = \mathbf{Q}^\top\mathbf{Y}$$

(b_from_QR <- backsolve(R, crossprod(Q, y)))

##       [,1]
## [1,] -6.25
## [2,]  4.80
## [3,]  1.25

print(b)          ## previously calculated b for comparison

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25


### Residual sum of squares

The residual sum of squares is also easily calculated, using the $$\mathbf{N}$$ matrix (orthonormal basis of the residual space) since $\mathrm{SS}_e = \bigl(\mathbf{M}\mathbf{Y}\bigr)^\top\,\mathbf{M}\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{M}^\top\mathbf{M}\,\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{N}\mathbf{N}^\top\mathbf{N}\mathbf{N}^\top\,\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{N}\mathbf{N}^\top\,\mathbf{Y} = \bigl\|\mathbf{N}^\top\,\mathbf{Y}\bigr\|^2.$

(NtY <- crossprod(N, y))

##          [,1]
## [1,] 1.788854

(SSe <- as.numeric(crossprod(NtY)))

## [1] 3.2


## Conclusions

It should now be clear that when dealing with the least squares, it is not necessary to calculate the matrix $$\mathbf{X}^\top\mathbf{X}$$, or even its inverse directly from $$\mathbf{X}^\top\mathbf{X}$$. Everything can be obtained from a suitable decomposition of the model matrix $$\mathbf{X}$$, and once this decomposition is available, all the rest is easy (also with respect to computational demands).