This page is an expansion based on an earlier page that uses algebraic notation. To see that page: Regression with 2 IVs - not matrix
Load the data from the Chevy mechanics example (regression with 2 IVs) into R.
Y <- c(1, 2, 1, 3, 2, 3, 3, 4, 4, 3, 5, 3, 3, 2, 4, 5, 5, 5, 4, 3) # Job Performance ratings
X1 <- c(40, 45, 38, 50, 48, 55, 53, 55, 58, 40, 55, 48, 45, 55, 60, 60, 60, 65, 50, 58) # Mechanical Apt
X2 <- c(25, 20, 30, 30, 28, 30, 34, 36, 32, 34, 38, 28, 30, 36, 34, 38, 42, 38, 34, 38) # Conscientious
Dat.all <- data.frame(cbind(Y, X1, X2)) # combine into a data frame
First we’ll run the regression using the linear model (lm) program in R so you can see the typical results from a regression program. Then we’ll see how the bits are computed behind the scenes.
res1 <- lm(Dat.all$Y ~Dat.all$X1+Dat.all$X2) # run linear multiple regression
summary(res1)
##
## Call:
## lm(formula = Dat.all$Y ~ Dat.all$X1 + Dat.all$X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.803 -0.465 0.178 0.524 1.022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.1036 1.2610 -3.25 0.0047 **
## Dat.all$X1 0.0864 0.0314 2.75 0.0137 *
## Dat.all$X2 0.0876 0.0455 1.93 0.0710 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.76 on 17 degrees of freedom
## Multiple R-squared: 0.671, Adjusted R-squared: 0.632
## F-statistic: 17.3 on 2 and 17 DF, p-value: 7.89e-05
Recall that Y is comprised of job performance ratings while X1 and X2 indicate mechancal aptitude and conscientiousness test scores, respectively. We can write the regression equation as Y’ = a + b1X1 +b2X2 or
Predicted Job Performance = -4.10 + .09Mechanical + .09Conscientious.
Mechanics with higher test scores tend to receive better ratings from their supervisors. The intercept is not meaningful in this case. The customary interpretation of the intercept would be the predicted job performance rating of those mechanics receiving scores of zero on both Mechanical Comprehension and Conscientiousness. Here there are no zero test scores, nor are there any negative job performance ratings. Supervisors wanted to give negative performance scores, but were not allowed to (just kidding).
The matrix representation of the problem begins with a design matrix. The design matrix is composed of a column of ones (for the intercept) plus a column for each independent variable, thus:
ones <- rep(1, nrow(Dat.all))
X <- cbind(ones, X1, X2)
X
## ones X1 X2
## [1,] 1 40 25
## [2,] 1 45 20
## [3,] 1 38 30
## [4,] 1 50 30
## [5,] 1 48 28
## [6,] 1 55 30
## [7,] 1 53 34
## [8,] 1 55 36
## [9,] 1 58 32
## [10,] 1 40 34
## [11,] 1 55 38
## [12,] 1 48 28
## [13,] 1 45 30
## [14,] 1 55 36
## [15,] 1 60 34
## [16,] 1 60 38
## [17,] 1 60 42
## [18,] 1 65 38
## [19,] 1 50 34
## [20,] 1 58 38
The transpose of a matrix is indicated by the prime symbol (e.g., X’), and the matrix inverse is indicated by an exponent equal to negative one. The matrix equation for finding the regression weights, B, is \[ B = (X'X)^{-1}X'Y \] Let’s take it step by step. The transpose of the design matrix is:
Xp <- t(X)
Xp
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## ones 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## X1 40 45 38 50 48 55 53 55 58 40 55 48 45 55
## X2 25 20 30 30 28 30 34 36 32 34 38 28 30 36
## [,15] [,16] [,17] [,18] [,19] [,20]
## ones 1 1 1 1 1 1
## X1 60 60 60 65 50 58
## X2 34 38 42 38 34 38
The matrix product of the design matrix premultiplied by its transpose is:
XpX <- Xp%*%X
XpX
## ones X1 X2
## ones 20 1038 655
## X1 1038 54964 34510
## X2 655 34510 21973
And the inverse of that matrix is:
XpXInv <- solve(XpX)
XpXInv
## ones X1 X2
## ones 2.761 -0.0336 -0.0296
## X1 -0.034 0.0017 -0.0017
## X2 -0.030 -0.0017 0.0036
Now we want to multiply that inverted matrix by X’ and Y, thus:
B <- XpXInv%*%Xp%*%Y
B
## [,1]
## ones -4.104
## X1 0.086
## X2 0.088
Notice that the values of B shown immediately above are the same as the regression coefficients that were produced by the lm program at the top of the page. This is essentially how the statistical packages solve the linear regression problems. The beauty of the matrix formulation is that one equation handles any number of independent variables. Numbers of independent variables greater than two are difficult to express in the typical algebraic equation.
It is pretty simple to compute the predicted values and residuals in matrix notation. \[\hat{Y} = XB\] And of course, the residuals are the difference between the predicted and actual values (Y - Yhat); it’s good to place the observed value first because predictions that are too small will result in errors that are positive and predictions that are too large will result in negative errors).
Yhat <- X%*%B
Resids <- Y-Yhat
Items <- data.frame(Dat.all, Yhat, Resids)
Items
## Y X1 X2 Yhat Resids
## 1 1 40 25 1.5 -0.543
## 2 2 45 20 1.5 0.463
## 3 1 38 30 1.8 -0.808
## 4 3 50 30 2.8 0.155
## 5 2 48 28 2.5 -0.497
## 6 3 55 30 3.3 -0.277
## 7 3 53 34 3.5 -0.455
## 8 4 55 36 3.8 0.197
## 9 4 58 32 3.7 0.289
## 10 3 40 34 2.3 0.669
## 11 5 55 38 4.0 1.022
## 12 3 48 28 2.5 0.503
## 13 3 45 30 2.4 0.587
## 14 2 55 36 3.8 -1.803
## 15 4 60 34 4.1 -0.059
## 16 5 60 38 4.4 0.590
## 17 5 60 42 4.8 0.240
## 18 5 65 38 4.8 0.158
## 19 4 50 34 3.2 0.805
## 20 3 58 38 4.2 -1.237
The residuals are the errors of prediction. Their estimated variance is the sum of squared residuals divided by the appropriate degrees of freedom (N - (k+1)). In this case, there are two predictors (k = 2) and 20 cases or observations (N = 20), so the degrees of freedom are 17 (note at the top of the page - the lm program computed F with 2 and 17 degrees of freedom). This estimated variance is known by different names depending on the author. One common name is mean squared error or MSE. Its square root is the estimated standard deviation of residuals, sometimes called the standard error of prediction. In our example, the computation is (Resids’Resid)/dferror).
dfe <- nrow(Dat.all)-ncol(X)
MSE <- (t(Resids)%*%Resids)/dfe
SEP <- sqrt(MSE)
MSE
## [,1]
## [1,] 0.58
SEP
## [,1]
## [1,] 0.76
Note how the standard error of prediction (SEP) equals the residual standard error noted by the lm program at the top of the page.
To find the standard errors of the regression estimates, we need compute the variance-covariance matrix of sample estimates, denoted C:
\[C = MSE(X'X)^{-1}\]
matMSE <- rep(MSE, 9) # I've not used an algorithm to find the dimensions of XpXInv (I know it is 3 by 3 in this particular case).
matMSE <- matrix(matMSE, ncol=3, nrow=3) # R does not like element multiplication by a scalar.
C = matMSE*XpXInv # You have to multiply each element in XpXInv by MSE.
C
## ones X1 X2
## ones 1.590 -0.01932 -0.01706
## X1 -0.019 0.00099 -0.00098
## X2 -0.017 -0.00098 0.00207
The variances of the estimates lie on the main diagonal of the C matrix; the covariances are contained in the off-diagonal elements. For example, the standard error for the intercept is the square root of C11, and for Mechanical Aptitude (variable X1), the standard error is the square root of element C22.
StdErrs <- sqrt(diag(C))
StdErrs
## ones X1 X2
## 1.261 0.031 0.045
Note that the standard errors shown here match those shown at the top of the page. The covariances can be used as part of the test for the significance of the difference between regression weights (e.g., between b1 and b2). However, such tests are usually not meaningful with different variables because of scale. For example, if we predict people’s weight from their height, measuring height in inches (or centimeters) will result in a much smaller value of b than if we measure height in feet (or meters). The regression weight tells us the expected change in Y if X changes one unit. If the unit changes, so does the regression weight. For such a test to make sense, the independent variables must be measured on a common scale.
The t-values come from dividing the estimates by their standard errors. The p-values come from comparing the obtained values of t to the t-distribution with the degrees of freedom for error (N-(k+1)). The multiplication by 2 (see below) is because of the 2-tailed test.
tvals <- B/StdErrs
pvals <-(1- pt(abs(tvals), dfe))*2
main.res <-data.frame(B, StdErrs, tvals, pvals)
main.res
## B StdErrs tvals pvals
## ones -4.104 1.261 -3.3 0.0047
## X1 0.086 0.031 2.7 0.0137
## X2 0.088 0.045 1.9 0.0710
You don’t need matrix algebra to compute the rest of the lm results at the top of the page. This page has walked you through an example of computations for linear regression using matrix algebra and showing how the computations give the same result as the R package lm. The point of this page is not to make you replicate software that already exists and works properly. Rather, for some people, this approach results in better understanding and insight about what the software is doing.
We can, of course use matrix algebra to compute other statistics of interest. If we define y as the deviations of the dependent variable from its mean \(y = Y - m(Y)\), and yhat as the deviations of the predicted scores from their mean \(\hat{y} = \hat{Y}-m(\hat{Y})\), then we can define the variance of Y as \[V(Y) = y'y/(N-1)\] and the variance of yhat as \[V(\hat{Y}) = \hat{y}'\hat{y}/(N-1)\].
There are many ways to compute R-squared, the variance accounted for by the regression. We may compute the ratio of the variance of predicted scores to actual scores, thus:
\[R_{1}^{2} = V(\hat{Y})/V(Y)\] or we may compute the correlation between predicted and observed scores and then square that.
\[R_{2}^{2} = r^{2}_{y\hat{y}}\] The result in either case is .67 for these data.
My <- mean(Y)
Ydev <- Y-My
Vy <- (t(Ydev)%*%Ydev)/(nrow(Items)-1)
Vy
## [,1]
## [1,] 1.6
Mhat <- mean(Yhat)
Yhatd <- Yhat-Mhat
Vyhat <- (t(Yhatd)%*%Yhatd)/(nrow(Items)-1)
Vyhat
## [,1]
## [1,] 1.1
Rsq1 <- (Vyhat)/Vy
Rsq1
## [,1]
## [1,] 0.67
ry1y2 <- (t(Ydev)%*%Yhatd)/((nrow(Items)-1)*sqrt(Vy)*sqrt(Vyhat))
ry1y2
## [,1]
## [1,] 0.82
Rsq2 = ry1y2^2
Rsq2
## [,1]
## [1,] 0.67