Correlation - Pearson’s r

This page uses the ‘car’ (Companion for Applied Regression) for data because inputting data inline is a problem for having sufficient observations or cases for a correlation problem.

The Davis dataset contains self-reported height and weight along with measured height and weight for a group of Canadians.

library(psych)
library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

str(Davis)

## 'data.frame':    200 obs. of  5 variables:
##  $ sex   : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 2 2 2 2 ...
##  $ weight: int  77 58 53 68 59 76 76 69 71 65 ...
##  $ height: int  182 161 161 177 157 170 167 186 178 171 ...
##  $ repwt : int  77 51 54 70 59 76 77 73 71 64 ...
##  $ repht : int  180 159 158 175 155 165 165 180 175 170 ...

As you can (hopefully) see, there are variables for sex (M or F), height, weight, self-reported height and self-reported weight.

Computing the correlations is a simple request, but all the variables must be numeric, and sex is coded as a factor. Further, we need to request ‘complete.obs’ if we want listwise deletion, or else request ‘pairwise.complete.obs’ for pairwise deletion.

Davis$sex <- as.numeric(Davis$sex)
cor(Davis, use = 'complete.obs')

##              sex    weight    height     repwt     repht
## sex    1.0000000 0.5753653 0.5824326 0.7178326 0.7381536
## weight 0.5753653 1.0000000 0.1542575 0.8353758 0.6314352
## height 0.5824326 0.1542575 1.0000000 0.6037367 0.7391662
## repwt  0.7178326 0.8353758 0.6037367 1.0000000 0.7618604
## repht  0.7381536 0.6314352 0.7391662 0.7618604 1.0000000

The correlation between height and weight seems suspicious to me, so I plotted them.

plot(Davis$height, Davis$weight)

As you can see, there is an outlier that causes trouble. I had forgotten that one of the observations is transposed so that height and weight are punched into the opposite columns in the data.

So we fix the data problem.

Davis$weight[12] <- 57
Davis$height[12] <- 166
plot(Davis$height, Davis$weight)

That looks better. Let’s recompute the correlations.

cor(Davis, use = 'complete.obs')

##              sex    weight    height     repwt     repht
## sex    1.0000000 0.6983927 0.7394358 0.7178326 0.7381536
## weight 0.6983927 1.0000000 0.7684924 0.9861233 0.7486882
## height 0.7394358 0.7684924 1.0000000 0.7827870 0.9755870
## repwt  0.7178326 0.9861233 0.7827870 1.0000000 0.7618604
## repht  0.7381536 0.7486882 0.9755870 0.7618604 1.0000000

These are all substantial correlations.

To compute conventional significance tests (H0: rho =0):

corr.test(Davis, use='complete.obs')

## Call:corr.test(x = Davis, use = "complete.obs")
## Correlation matrix 
##         sex weight height repwt repht
## sex    1.00   0.70   0.74  0.72  0.74
## weight 0.70   1.00   0.77  0.99  0.75
## height 0.74   0.77   1.00  0.78  0.98
## repwt  0.72   0.99   0.78  1.00  0.76
## repht  0.74   0.75   0.98  0.76  1.00
## Sample Size 
##        sex weight height repwt repht
## sex    200    200    200   183   183
## weight 200    200    200   183   183
## height 200    200    200   183   183
## repwt  183    183    183   183   181
## repht  183    183    183   181   183
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##        sex weight height repwt repht
## sex      0      0      0     0     0
## weight   0      0      0     0     0
## height   0      0      0     0     0
## repwt    0      0      0     0     0
## repht    0      0      0     0     0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

This gives results comparable to SAS. Note that the p levels are all zero to two decimals (very small p values). Probably should omit the p values for the diagonal and show the listwise sample size, but oh well. We should note also that sex is a binary variable, but its ordinary correlation (so-called point-biserial correlation) is also computed and printed in this example. Many would prefer to use logistic regression in such a case, but if sex is considered to be an independent variable (as it probably would be in this case), then the correlation is mathematically related to the t-test and has intuitive meaning.

We may also want to see the scatterplots of all the numerical values.

Davis2 <- Davis[,2:5]                    # subset of the 4 continuous variables
pairs(Davis2)

It appears that people in this sample reported their height and weight rather accurately.

Correlation - Pearson’s r

Michael Brannick

12/1/2017