Detecting nonlinear relations

First, we call the packages we need for the module.

library(car)       # opens the companion to applied regression pkg
library(psych)     # for descriptive stats
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
library(plotly)    # plot package
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)   # another plot package

We begin by asking for univariate descriptive statistics (mean, sd, etc.) and then running a regression with the ‘mtcars’ dataset in the ‘car’ package. The model shows the regression of miles per gallon on horsepower and weight of automobiles.

options(digits=3)
describe(mtcars$mpg)
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 32 20.1 6.03   19.2    19.7 5.41 10.4 33.9  23.5 0.61    -0.37
##      se
## X1 1.07
describe(mtcars$hp)
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 32  147 68.6    123     141 77.1  52 335   283 0.73    -0.14 12.1
describe(mtcars$wt)
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 32 3.22 0.98   3.33    3.15 0.77 1.51 5.42  3.91 0.42    -0.02
##      se
## X1 0.17
fit <- lm(mpg~hp+wt, data=mtcars)  # run a regression on automobiles (mpg, horsepower, weight)
summary(fit)                       # summary output of the regression
## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879   23.28  < 2e-16 ***
## hp          -0.03177    0.00903   -3.52   0.0015 ** 
## wt          -3.87783    0.63273   -6.13  1.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 29 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.815 
## F-statistic: 69.2 on 2 and 29 DF,  p-value: 9.11e-12

As you can see, we have a handsome R-square, and both horsepower and weight are significant predictors, even though we have only 32 autos in our sample.

plot(mtcars$hp, mtcars$mpg)
identify(mtcars$hp, mtcars$mpg, labels=row.names(mtcars)) # identify points 
## integer(0)
abline(lm(mtcars$mpg~mtcars$hp))

The above code will run a plot that will appear in your plot window. You can click on any of the dots you want, and the computer will note which they are. The plot window will have a ‘finish’ box. When you click that, labels from your dataset will appear on the plot for your chosen dots, like this:

The labeling on the graph is particularly helpful if you have many data points and you spot an outlier (or if you want to call your reader’s attention to a particular point). This will help you find it in your data. It should be clear that horsepower and gas mileage are nonlinearly related from this graph - mpg is higher than predicted at low horspower, mpg is lower than predicted at moderate horsepower and mpg is again higher than predicted at high horsepower.

One of the ways to examining nonlinearity is through plotting residuals. These are pretty obvious if there is only one predictor, but more difficult if there are multiple predictors. To see the relations between the criterion and the set of predictors, we need to plot the predicted values from the model against the actual values.

mtcars$Ypred <- predict(fit)
mtcars$Yres <- resid(fit)
plot(mtcars$Ypred, mtcars$Yres)

As you can see, the residuals form a ‘U’ shape, indicating nonlinear relations. You want to see a swarm of bees (basically a circular point cloud). As with most plots, the nonlinearity would be clearer if we had a larger sample of observations.

Finding Outliers

As I mentioned earlier, the plot function can help spot outliers visually. There are also diagnostic statistics to quantify the outliers.

# Assessing Outliers
outlierTest(fit) # Bonferonni p-value for most extreme obs
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                rstudent unadjusted p-value Bonferonni p
## Toyota Corolla     2.61             0.0145        0.465

In this case, there are no particularly deviant observations according to the studentized residuals test (see my PowerPoint slides). However, I want to show you how to delete a deviation observation and rerun the analysis without it.

mtcars2 <-subset(mtcars, rownames(mtcars)!='Toyota Corolla')
fit2 <- lm(mpg ~ hp+wt, data=mtcars2)
outlierTest(fit2) # Bonferonni p-value for most extreme obs
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##          rstudent unadjusted p-value Bonferonni p
## Fiat 128     3.26            0.00301       0.0932

Here I subset the data by excluding the Toyota Corolla and reran the analysis. Now the Fiat 128 is the most deviant observation, but this observation is not exceptional by the Bonferonni test.

We can also run a QQ plot to see whether the residuals appear to be normally distributed.

qqPlot(fit, main="QQ Plot") #qq plot for studentized resid 

This plot looks pretty good to me. You should expect some wandering away from the line at the extremes. However,the middle looks as it should.

Influential Data Points

One method of finding influential data points is Cook’s D.

cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2)) 
plot(fit, which=4, cook.levels=cutoff)

As you can see, the Chrysler Imperial is likely to be influential in the model.

There are several measures of influence shown below. One that I like is dfbeta, which examines the impact of removing an observation on the model’s coefficients (intercept and slopes) is the dfbeta. This is quite useful because it shows the impact directly.

influence.measures(fit)
## Influence measures of
##   lm(formula = mpg ~ hp + wt, data = mtcars) :
## 
##                       dfb.1_    dfb.hp    dfb.wt    dffit cov.r   cook.d
## Mazda RX4           -0.16135  0.032966  0.063930 -0.21849 1.043 1.59e-02
## Mazda RX4 Wag       -0.06932  0.045785 -0.000407 -0.12666 1.112 5.46e-03
## Datsun 710          -0.21120  0.043375  0.097231 -0.24910 1.068 2.07e-02
## Hornet 4 Drive       0.00267 -0.006839  0.004489  0.01170 1.166 4.72e-05
## Hornet Sportabout    0.00178  0.009208 -0.001554  0.02816 1.151 2.74e-04
## Valiant             -0.00599  0.180374 -0.151657 -0.25381 1.084 2.16e-02
## Duster 360           0.00471 -0.159989  0.078103 -0.19162 1.222 1.26e-02
## Merc 240D            0.03426 -0.189553  0.122412  0.22192 1.207 1.68e-02
## Merc 230             0.01979 -0.055076  0.033257  0.07976 1.169 2.19e-03
## Merc 280            -0.00320  0.036709 -0.033730 -0.06722 1.154 1.55e-03
## Merc 280C           -0.00905  0.103810 -0.095385 -0.19010 1.079 1.22e-02
## Merc 450SE          -0.02697 -0.005712  0.035697  0.06428 1.168 1.42e-03
## Merc 450SL          -0.00396  0.003400  0.004930  0.02056 1.158 1.46e-04
## Merc 450SLC          0.03157 -0.016800 -0.040052 -0.13571 1.110 6.27e-03
## Cadillac Fleetwood  -0.00642 -0.002578  0.007550  0.00898 1.364 2.79e-05
## Lincoln Continental -0.16879 -0.058243  0.190313  0.22792 1.375 1.78e-02
## Chrysler Imperial   -0.92406 -0.148010  0.935600  1.23167 0.723 4.24e-01
## Fiat 128             0.60518 -0.311247 -0.167276  0.74915 0.648 1.57e-01
## Honda Civic          0.15639 -0.034027 -0.081914  0.16533 1.241 9.37e-03
## Toyota Corolla       0.80467 -0.170934 -0.411461  0.86599 0.643 2.08e-01
## Toyota Corona       -0.23133  0.066064  0.088214 -0.29201 1.001 2.79e-02
## Dodge Challenger     0.00392  0.049775 -0.088848 -0.25339 0.962 2.09e-02
## AMC Javelin         -0.01961  0.037837 -0.073420 -0.29471 0.888 2.75e-02
## Camaro Z28           0.02992 -0.128670  0.039074 -0.17048 1.206 9.94e-03
## Pontiac Firebird    -0.05881 -0.002278  0.086874  0.20781 1.055 1.44e-02
## Fiat X1-9           -0.03756  0.010209  0.017426 -0.04142 1.222 5.92e-04
## Porsche 914-2       -0.00366  0.000316  0.002059 -0.00405 1.196 5.67e-06
## Lotus Europa         0.42341  0.188397 -0.407234  0.47152 1.154 7.35e-02
## Ford Pantera L      -0.02254 -0.148176  0.099935 -0.16103 1.382 8.92e-03
## Ferrari Dino        -0.06551 -0.085183  0.086980 -0.12940 1.162 5.73e-03
## Maserati Bora       -0.00748  0.865764 -0.499905  0.90752 1.606 2.72e-01
## Volvo 142E          -0.08000  0.038407  0.012754 -0.12823 1.113 5.60e-03
##                        hat inf
## Mazda RX4           0.0443    
## Mazda RX4 Wag       0.0405    
## Datsun 710          0.0602    
## Hornet 4 Drive      0.0475    
## Hornet Sportabout   0.0369    
## Valiant             0.0672    
## Duster 360          0.1170    
## Merc 240D           0.1157    
## Merc 230            0.0600    
## Merc 280            0.0469    
## Merc 280C           0.0469    
## Merc 450SE          0.0562    
## Merc 450SL          0.0412    
## Merc 450SLC         0.0426    
## Cadillac Fleetwood  0.1858   *
## Lincoln Continental 0.2090   *
## Chrysler Imperial   0.1865   *
## Fiat 128            0.0799   *
## Honda Civic         0.1230    
## Toyota Corolla      0.0995   *
## Toyota Corona       0.0530    
## Dodge Challenger    0.0357    
## AMC Javelin         0.0334    
## Camaro Z28          0.1030    
## Pontiac Firebird    0.0445    
## Fiat X1-9           0.0923    
## Porsche 914-2       0.0708    
## Lotus Europa        0.1536    
## Ford Pantera L      0.2044   *
## Ferrari Dino        0.0670    
## Maserati Bora       0.3942   *
## Volvo 142E          0.0414
dfbeta(fit, infl = lm.influence(fit, do.coef = TRUE))
##                     (Intercept)        hp       wt
## Mazda RX4              -0.25782  2.98e-04  0.04043
## Mazda RX4 Wag          -0.11204  4.18e-04 -0.00026
## Datsun 710             -0.33785  3.92e-04  0.06155
## Hornet 4 Drive          0.00435 -6.28e-05  0.00289
## Hornet Sportabout       0.00290  8.46e-05 -0.00100
## Valiant                -0.00959  1.63e-03 -0.09613
## Duster 360              0.00762 -1.46e-03  0.05005
## Merc 240D               0.05537 -1.73e-03  0.07830
## Merc 230                0.03214 -5.05e-04  0.02138
## Merc 280               -0.00520  3.37e-04 -0.02168
## Merc 280C              -0.01453  9.42e-04 -0.06063
## Merc 450SE             -0.04383 -5.24e-05  0.02296
## Merc 450SL             -0.00644  3.12e-05  0.00317
## Merc 450SLC             0.05100 -1.53e-04 -0.02560
## Cadillac Fleetwood     -0.01045 -2.37e-05  0.00486
## Lincoln Continental    -0.27368 -5.33e-04  0.12212
## Chrysler Imperial      -1.35220 -1.22e-03  0.54183
## Fiat 128                0.88757 -2.58e-03 -0.09709
## Honda Civic             0.25358 -3.12e-04 -0.05256
## Toyota Corolla          1.17463 -1.41e-03 -0.23771
## Toyota Corona          -0.36656  5.91e-04  0.05532
## Dodge Challenger        0.00620  4.44e-04 -0.05552
## AMC Javelin            -0.03056  3.33e-04 -0.04529
## Camaro Z28              0.04846 -1.18e-03  0.02505
## Pontiac Firebird       -0.09414 -2.06e-05  0.05504
## Fiat X1-9              -0.06109  9.38e-05  0.01122
## Porsche 914-2          -0.00595  2.91e-06  0.00133
## Lotus Europa            0.67433  1.69e-03 -0.25668
## Ford Pantera L         -0.03660 -1.36e-03  0.06424
## Ferrari Dino           -0.10615 -7.80e-04  0.05578
## Maserati Bora          -0.01191  7.78e-03 -0.31487
## Volvo 142E             -0.12929  3.51e-04  0.00816

You can see, for example, that the Chrysler Imperial has a major impact on the slope for weight.

Collinearity

Recall that earlier we ran a regression relating gas mileage to horsepower and weight. Both predictors were significant. Now we will add a third predictor (number of cylinders) to the equation.

fit2 <- lm(mpg~hp+wt+cyl, data=mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ hp + wt + cyl, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.929 -1.560 -0.531  1.185  5.899 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7518     1.7869   21.69   <2e-16 ***
## hp           -0.0180     0.0119   -1.52   0.1400    
## wt           -3.1670     0.7406   -4.28   0.0002 ***
## cyl          -0.9416     0.5509   -1.71   0.0985 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.51 on 28 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.826 
## F-statistic: 50.2 on 3 and 28 DF,  p-value: 2.18e-11

As you can see, the model R-square is still significant (and large), but neither horsepower nor cylinders is significant at the .05 level. Horsepower and cylinders are highly correlated.

The collinearity diagnostic that I found for R was for the variance inflation factor.

vif(fit2) 
##   hp   wt  cyl 
## 3.26 2.58 4.76

Although none of these approach the rule of thumb (VIF = 10), they are large enough to present problems. Consider the correlations among them.

cardat <- data.frame(mtcars2$mpg, mtcars2$cyl, mtcars2$hp, mtcars2$wt)
cor(cardat)
##             mtcars2.mpg mtcars2.cyl mtcars2.hp mtcars2.wt
## mtcars2.mpg       1.000      -0.857     -0.773     -0.866
## mtcars2.cyl      -0.857       1.000      0.824      0.770
## mtcars2.hp       -0.773       0.824      1.000      0.639
## mtcars2.wt       -0.866       0.770      0.639      1.000

Given the size of the correlations among the predictors, it is no surprise that we have nonsignificant slopes even though each of the predictors is strongly correlated with the criterion.