Polynomial Regression for Curve Fitting

For the a curve fitting example, we’ll use the car (companion for applied regression) package data on automobiles (mtcars). In a previous module, we saw how gas mileage was nonlinearly related to engine horsepower.

library(car)                        # invoke the package
plot(mtcars$hp, mtcars$mpg)         # plot miles per gallon as a function of horsepower
abline(lm(mtcars$mpg~mtcars$hp))    # note curved relations - line is only good in the middle

#

You should see in the plot that the relations are nonlinear.

Next we examine the linear regression results.

res.lin2 <- lm(mtcars$mpg~mtcars$hp)                  # run linear regression
summary(res.lin2)  
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## mtcars$hp   -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Note that despite the curved relations, there is a strong linear component reflected in the R-square.

Next we compute a squared term for polynomial regression.

mtcars$hp2 <- mtcars$hp^2                  # create a squared term.

You could subtract the mean before squaring. Some people do this to reduce collinearity, but it is not really necessary unless your unsquared numbers are large. If you do subtract the mean, the regression weights will change, but the test of the increment in R-square will not.

res.squ2 <-lm(mtcars$mpg~mtcars$hp+mtcars$hp2)  # run regression including squared term
summary(res.squ2) 
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp + mtcars$hp2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5512 -1.6027 -0.6977  1.5509  8.7213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.041e+01  2.741e+00  14.744 5.23e-15 ***
## mtcars$hp   -2.133e-01  3.488e-02  -6.115 1.16e-06 ***
## mtcars$hp2   4.208e-04  9.844e-05   4.275 0.000189 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.077 on 29 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7393 
## F-statistic: 44.95 on 2 and 29 DF,  p-value: 1.301e-09

Notice that the regression weight for the squared term is significant. That indicates a bend or curve in the regression line.

I used the coefficients from the regression with the squared term to create a curve for plotting.

plot(mtcars$hp, mtcars$mpg, xlab='Horse Power', ylab='Miles per Gallon', main='Curvillinear Relations')                   
curve((40.41-.2113*x+.00042*x^2), from=50, to = 300, add=TRUE) # this is the curved line

The line is fit by least squares. Beware of outliers and extrapolating beyond the observed data.

Testing for Interactions

In the PowerPoint slides, I showed an example in which performance on a task is regressed on tests of cognitive ability and originality (creativity). The data for that example follow:

library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
productive <- c(50, 35, 40, 50, 55, 35, 45, 55, 50, 40, 45, 50, 60, 65, 55, 50, 55, 55, 60, 65)
original   <- c(40, 45, 50, 55, 60, 40, 45, 50, 55, 60, 40, 45, 50, 55, 60, 40, 45, 50, 60, 65)
cognitive  <- c(100, 80, 90, 105, 110, 95, 100, 105, 95, 90, 110, 115, 120, 125, 105, 110, 95, 115, 120, 140)
nonlin <- data.frame(productive, original, cognitive)

First, let’s look at the descriptive information.

describe(nonlin)
##            vars  n   mean    sd median trimmed   mad min max range  skew
## productive    1 20  50.75  8.78     50   50.94  7.41  35  65    30 -0.22
## original      2 20  50.50  7.93     50   50.31  7.41  40  65    25  0.15
## cognitive     3 20 106.25 14.04    105  105.62 14.83  80 140    60  0.37
##            kurtosis   se
## productive    -0.89 1.96
## original      -1.35 1.77
## cognitive     -0.21 3.14
pairs(nonlin)

cor(nonlin)
##            productive  original cognitive
## productive  1.0000000 0.5047006 0.8356751
## original    0.5047006 1.0000000 0.3841369
## cognitive   0.8356751 0.3841369 1.0000000

As you can see, there are 20 observations (no missing data), all the variables are positively correlated, and although there aren’t many people in the graphs, the graphs do not appear pathological.

To test for an interaction, we first need to compute a product term. Some people subtract the mean from each of the consituent variables (here it would be ‘original’ and ‘cognitive’). This won’t make a difference unless the numbers in the original varialbes are large (the issue is one of precision in representing the numbers by the computer). If you do subtract the means, the regression weights will change, but the test for the interaction will not.

nonlin$int <- nonlin$original*nonlin$cognitive
cor(nonlin) # check the resulting correlations
##            productive  original cognitive       int
## productive  1.0000000 0.5047006 0.8356751 0.7846463
## original    0.5047006 1.0000000 0.3841369 0.8495388
## cognitive   0.8356751 0.3841369 1.0000000 0.8073162
## int         0.7846463 0.8495388 0.8073162 1.0000000

Unsurprisingly, the interaction term is correlated with the variables used to compute it.

Now we can test for the significance of the interaction.

res.int <- lm(nonlin$productive ~ nonlin$original + nonlin$cognitive + nonlin$int) # run mult reg with interaction
summary(res.int) 
## 
## Call:
## lm(formula = nonlin$productive ~ nonlin$original + nonlin$cognitive + 
##     nonlin$int)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1400 -2.9400 -0.3366  2.5204 11.2893 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -45.483748  58.312674  -0.780    0.447
## nonlin$original    0.872336   1.078139   0.809    0.430
## nonlin$cognitive   0.790091   0.544515   1.451    0.166
## nonlin$int        -0.005876   0.009895  -0.594    0.561
## 
## Residual standard error: 4.843 on 16 degrees of freedom
## Multiple R-squared:  0.7436, Adjusted R-squared:  0.6955 
## F-statistic: 15.47 on 3 and 16 DF,  p-value: 5.481e-05

As you can see, although the overall model shows a large, significant R-square, non of the regression weights are statistically significant. We would infer that the interaction is not significant (although this test is not very powerful, so to be meaningful you should have hundreds of observations).

Let’s rerun the analysis without the interaction term.

res.int2 <- lm(nonlin$productive ~ nonlin$original + nonlin$cognitive) # run mult reg WITHOUT interaction
summary(res.int2) 
## 
## Call:
## lm(formula = nonlin$productive ~ nonlin$original + nonlin$cognitive)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9496 -3.1249  0.1038  3.0403 10.8579 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -11.31387    9.26554  -1.221    0.239    
## nonlin$original    0.23849    0.14883   1.602    0.127    
## nonlin$cognitive   0.47078    0.08409   5.599  3.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.75 on 17 degrees of freedom
## Multiple R-squared:  0.7379, Adjusted R-squared:  0.7071 
## F-statistic: 23.93 on 2 and 17 DF,  p-value: 1.139e-05

As you can see, the R-square is still large, and now the cognitive test slope is significant.

Second Example

Let’s suppose that scores on exams are a function of total study time in minutes (studymins) and the number of breaks taken during study (breaks). We will manufacture some data to fit the idea and then run the analysis.

set.seed(100)
studymins <- round(rnorm(50,60,10),0)
breaks <- round(runif(50)*5)+1
int <- studymins*breaks
examscore <- round((.8*studymins+.3*breaks +.25*int+rnorm(50,0,3))/1.52,0)
examdata <- cbind(examscore, studymins, breaks)

Let’s look at the descriptives.

describe(examdata) 
##           vars  n  mean    sd median trimmed   mad min max range skew
## examscore    1 50 66.38 20.75   65.0   65.78 26.69  32 112    80 0.16
## studymins    2 50 60.80  8.22   60.5   60.62  6.67  41  83    42 0.17
## breaks       3 50  3.36  1.71    3.0    3.33  2.22   1   6     5 0.07
##           kurtosis   se
## examscore    -1.00 2.93
## studymins     0.41 1.16
## breaks       -1.29 0.24
cor(examdata)      
##           examscore studymins    breaks
## examscore 1.0000000 0.5339656 0.9045083
## studymins 0.5339656 1.0000000 0.1401545
## breaks    0.9045083 0.1401545 1.0000000
pairs(examdata)    

So far, it looks like taking lots of breaks is the key to success on the exam. Let’s test for the interaction.

res.int3 <- lm(examscore ~ studymins + breaks + int)
summary(res.int3) 
## 
## Call:
## lm(formula = examscore ~ studymins + breaks + int)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9692 -1.3075 -0.0443  1.0697  5.3873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.97327    4.75079  -2.099  0.04131 *  
## studymins    0.68453    0.07877   8.690 2.90e-11 ***
## breaks       3.63835    1.30351   2.791  0.00762 ** 
## int          0.10915    0.02131   5.122 5.84e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.929 on 46 degrees of freedom
## Multiple R-squared:  0.9919, Adjusted R-squared:  0.9914 
## F-statistic:  1875 on 3 and 46 DF,  p-value: < 2.2e-16

As you can see, the interaction is significant, suggesting that breaks don’t help if you don’t spend time studying.

We can plot the interaction like this:

plot(studymins, examscore, ylab = 'Exam Score', xlab = 'Study Minutes')
abline(a=(-9.97327+3.63835*1), b= (.68453+1*.10915))     # values from the regression output, substitute specific numbers for breaks
abline(a=(-9.97327+3.63835*3), b= (.68453+3*.10915))     # many people choose M, +1SD, -1SD as values for lines
abline(a=(-9.97327+3.63835*5), b= (.68453+5*.10915))
text(77, 45, '1 study break')
text(77, 70, '3 study breaks')
text(77, 93, '5 study breaks')

Note that I chose the values of a and b for the regression lines from the output of the regression analysis. I substituted 1, 3, and 5 for X as values for the number of study breaks. When the two variables are continuous with no obvious break points (e.g., the test scores for cognitive ability and originality as the predictors), it is customary to leave one alone and to use three values of the other for lines. The three values are typically the mean of that variable, one standard deviation above the mean, and one standard deviation below the mean. So, for example, we would plot productivity by cognitive ability, and show separate regression lines for those of average, above average, and below average originality.

As you can see from the graph, the slope becomes steeper as the breaks increase.

There are other ways to produce similar kinds of graphs in R (ggplot is nice). This one is simple and direct.