Chevy Mechanics Example

In the PowerPoint slides, I introduce the idea of a test validation study for predicting the job performance of mechanics that work for Chevrolet dealers. Suppose we have a study in which supervisors of several dealerships’ service and repair shops provided job performance ratings on their mechanics’ work performance. Further suppose that we have test scores on conscientiousness and mechanical aptitude for each of the mechanics. In real life, we would need lots more people (certainly more than 100), but we use a small sample here to make the data manageable.

library(psych)
job.perf <- c(1, 2, 1, 3, 2, 3, 3, 4, 4, 3, 5, 3, 3, 2, 4, 5, 5, 5, 4 ,3)
mech.apt <- c(40, 45, 38, 50, 48, 55, 53, 55, 58, 40, 55, 48, 45, 55, 60, 60, 60, 65, 50, 58)
consc <-    c(25, 20, 30, 30, 28, 30, 34, 36, 32, 34, 38, 28, 30, 36, 34, 38, 42, 38, 34, 38)
chevy <- data.frame(job.perf, mech.apt, consc)
describe(chevy)
##          vars  n  mean   sd median trimmed  mad min max range  skew
## job.perf    1 20  3.25 1.25      3    3.31 1.48   1   5     4 -0.14
## mech.apt    2 20 51.90 7.58     54   52.19 8.90  38  65    27 -0.28
## consc       3 20 32.75 5.24     34   33.12 5.93  20  42    22 -0.48
##          kurtosis   se
## job.perf    -1.01 0.28
## mech.apt    -1.06 1.70
## consc       -0.22 1.17

We have 20 people in our sample. In the slideshow, I introduce the sums of squares and crossproducts (SSCP) and covariance matrices as useful for finding the regression slopes (b weights). The covariance matrix is an unstandardized correlation matrix (average of crossproduct of deviation scores rather than z scores). The SSCP matrix is the sums rather than the averages of the crossproducts. There is an R function for correlatin matricies and covariance matrices. To get to the SSCP, I take the covariance matrix and multiply by the degrees of freedom.

cor(chevy, use='complete.obs')   # correlation matrix
##           job.perf  mech.apt     consc
## job.perf 1.0000000 0.7740325 0.7243901
## mech.apt 0.7740325 1.0000000 0.6830082
## consc    0.7243901 0.6830082 1.0000000
cov(chevy, use = 'complete.obs') # covariance matrix                     
##          job.perf  mech.apt    consc
## job.perf 1.565789  7.342105  4.75000
## mech.apt 7.342105 57.463158 27.13158
## consc    4.750000 27.131579 27.46053
sscp =19*cov(chevy)              # sums of squares and crossproducts matrix
sscp
##          job.perf mech.apt  consc
## job.perf    29.75    139.5  90.25
## mech.apt   139.50   1091.8 515.50
## consc       90.25    515.5 521.75

The equations for each of the matrices are shown in the PowerPoint slides. The computation of the regression weights uses all of the items in the SSCP matrix. I show all that so that you can see how the things are computed, at least for a simple example.

However, we don’t actually have to do much work, because we can easily ask the computer to do all the computations for us. Note that the syntax is similar to that for simple regression. For multiple regression we can add predictors using a plus sign.

res1 <- lm(job.perf ~mech.apt+consc)      # run linear multiple regression
summary(res1)                             # summarize the results
## 
## Call:
## lm(formula = job.perf ~ mech.apt + consc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8026 -0.4651  0.1778  0.5241  1.0222 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -4.10358    1.26103  -3.254  0.00467 **
## mech.apt     0.08641    0.03144   2.748  0.01372 * 
## consc        0.08760    0.04548   1.926  0.07100 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7589 on 17 degrees of freedom
## Multiple R-squared:  0.6709, Adjusted R-squared:  0.6322 
## F-statistic: 17.33 on 2 and 17 DF,  p-value: 7.888e-05

Note that the overall model R-square (.67) is large and statistically signficant. The coefficient is significant at the .05 level, but the coefficient for conscientiousness is not significant at the .05 level. The estimates in the R output agree with those in the PowerPoint slides. The regression equation is Y’ = -4.10+.086mech.apt + .088consc. Note also that the slope of each independent variables tell the expected change in Y (job performance rating), if the X variable changes 1 unit, holding the other X variable constant.

We should always plot our data. A convenient way to do so if there are not too many variables is the pairs command.

pairs(chevy)

It is clear from both the correlation matrix and the pairwise plots that not only are the independent variables correlated with the dependent variable, but they are also correlated with each other. There is an additional reason to inpect such plots - they can be helpful in spotting problems with the data such as outliers and nonlinear relations. These will be described more fully in the next module.

With a single predictor, the scatterplot showing the correlation of the predicted values and the actual values is equivalent to showing the scatterplot of that variable with the dependent variable. Such is not the case with multiple independent variables. Instead, we must plot predicted and actual values where the predicted values come from the set of predictors taken together.

Ypred <- predict(res1)
Yresid <- resid(res1)
#
plot(Ypred, job.perf, xlab = 'Predicted Performance', ylab='Actual Performance')
abline(lm(job.perf~Ypred)) 

You can run some code that will allow you to create a 3-dimensional graph and rotate it using your computer’s mouse. I haven’t yet mastered animation well enough to show you that here. But I have included to code to do so. If you download the code and execute it you will be able to manipulate the output. It is much easier to visualize when it moves.

# library(plotly)
# plot_ly(x=mech.apt,y=consc,z=job.perf, type = 'scatter3d') # shows movable 3D with origninal data
#
# plot_ly(x=mech.apt,y=consc,z=Ypred, type = 'scatter3d')    # shows a plane by using predicted data
#

Standardized Regression Weights

Standardized regression weights (‘beta weights’) come from applying ordinary regression to the variables after each variable has been standardized, that is, had its mean subtracted and then divided by its standard deviation, so the result has mean zero and standard deviation 1. Then all the regression weights are in terms units of standard deviations. A simple function can standardize all the variables.

chevy2 <- data.frame(lapply(chevy, scale))
describe(chevy2)
##          vars  n mean sd median trimmed  mad   min  max range  skew
## job.perf    1 20    0  1  -0.20    0.05 1.18 -1.80 1.40  3.20 -0.14
## mech.apt    2 20    0  1   0.28    0.04 1.17 -1.83 1.73  3.56 -0.28
## consc       3 20    0  1   0.24    0.07 1.13 -2.43 1.77  4.20 -0.48
##          kurtosis   se
## job.perf    -1.01 0.22
## mech.apt    -1.06 0.22
## consc       -0.22 0.22
cor(chevy2, use='complete.obs')
##           job.perf  mech.apt     consc
## job.perf 1.0000000 0.7740325 0.7243901
## mech.apt 0.7740325 1.0000000 0.6830082
## consc    0.7243901 0.6830082 1.0000000

Notice that the means for the new variables are all zero and the standard deviations are all 1. Further, notice that the correlations of the new variables are just the same as they were at the beginning of this module. Linear transformations change the mean and standard deviation, but have no effect on the correlation; the correlation is the average crossproduct of z scores, so the mean and standard deviation are removed as part of the calculation of the correlation. To find the betas, just rerun the regression on the new dataset.

res2 <- lm(job.perf ~mech.apt+consc, data=chevy2)      # run linear multiple regression
summary(res2)
## 
## Call:
## lm(formula = job.perf ~ mech.apt + consc, data = chevy2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4405 -0.3717  0.1421  0.4189  0.8169 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.408e-16  1.356e-01   0.000   1.0000  
## mech.apt     5.235e-01  1.905e-01   2.748   0.0137 *
## consc        3.669e-01  1.905e-01   1.926   0.0710 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6065 on 17 degrees of freedom
## Multiple R-squared:  0.6709, Adjusted R-squared:  0.6322 
## F-statistic: 17.33 on 2 and 17 DF,  p-value: 7.888e-05

Note that the intercept is zero within rounding error (it should be zero, becuase the intercept is a function of the product of the regression slopes and the predictor means, but the predictor means are all zero). The scale of the estimated regression weights has now changed to standard deviation units.

R-square

The regression progrm (lm) showed an R-square of .6709, which is the proportion of variance in the dependent variable (job performance) accounted for by the set of predictors (conscientiousness and mechanical aptitude).

We can think of R-square in several ways, and it is helpful for our understanding to do so. One way to thing of R-square is the squared correlation between predicted and actual values.

R <- cor(job.perf, Ypred)
Rsq2 <- R^2
Rsq2
## [1] 0.6709279

A second way to think of R-square is the proportion of variance in the dependent variable that can be attributed to the regression. We can compute the variance of Y and also the variance of Predicted Y (variance of the predicted values). The ratio of these is equal to R-square.

var(job.perf)
## [1] 1.565789
var(Ypred)
## [1] 1.050532
Rsq3 <- var(Ypred)/var(job.perf)
Rsq3
## [1] 0.6709279

A third way to think of R-square is the sum of the products of the values of the correlation coefficient and the standardized regrssion coefficient. That is, R-square = (r1)(beta1)+(r2)(beta2).

Rsq4 <- .7740325*.5235+.7243901*.3669
Rsq4
## [1] 0.6709847

Increments in R-square

There is a general formula for testing the increment in R-square when comparing two models (see my PowerPoint slides). The smaller model must be a proper subset of the larger model. For example, the smaller model might indicate Y = a + b1A, the larger model model might indicate Y = a + b1A+b2B+B3C. Because both models contain A, we could test whether adding B and C significantly increases R-square beyond what is obtained using A only. On the other hand, we could not test the model Y = a+b1A+b2D against Y = a + b1A+b2B+b3C because D is not contained in the larger model. To illustrate this test, I will used data on automobiles from the ‘car’ package.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

There are many variables in this dataset. I will use mpg (miles per gallon), cyl (number of cylinders in the engine, disp (displacement in cubic inches), and hp (horsepower). There are issues with this choice of data, e.g., the relations between mpg and other variables is nonlinear, but ignor that for the sake of this illustration of testing incremental R-square.

mpg1 <- lm(mtcars$mpg ~ mtcars$cyl)
summary(mpg1)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$cyl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9814 -2.1185  0.2217  1.0717  7.5186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
## mtcars$cyl   -2.8758     0.3224   -8.92 6.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7171 
## F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10
mpg2 <- lm(mtcars$mpg ~ mtcars$cyl+mtcars$disp+mtcars$hp)
summary(mpg2)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0889 -2.0845 -0.7745  1.3972  6.9183 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.18492    2.59078  13.195 1.54e-13 ***
## mtcars$cyl  -1.22742    0.79728  -1.540   0.1349    
## mtcars$disp -0.01884    0.01040  -1.811   0.0809 .  
## mtcars$hp   -0.01468    0.01465  -1.002   0.3250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.055 on 28 degrees of freedom
## Multiple R-squared:  0.7679, Adjusted R-squared:  0.743 
## F-statistic: 30.88 on 3 and 28 DF,  p-value: 5.054e-09

There are several interesting aspects to this analysis. Note that the number of cylinders is a good predictor of miles per gallon (Rsquare = .7262), and the regression weight for cylinders is significant.

When the other predictors (displacement and horsepower) are added, Rsquare increases to .7679, but now none of the predictors show a significant slope. Is the increment in R-square significant?

anova(mpg1, mpg2)
## Analysis of Variance Table
## 
## Model 1: mtcars$mpg ~ mtcars$cyl
## Model 2: mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     30 308.33                              
## 2     28 261.37  2    46.965 2.5156 0.09891 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, it is not significant at alpha = .05.