library(car)           # calling the library lets you use the data and functions in it
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
str(Davis)       # list the variables and their classes
## 'data.frame':    200 obs. of  5 variables:
##  $ sex   : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 2 2 2 2 ...
##  $ weight: int  77 58 53 68 59 76 76 69 71 65 ...
##  $ height: int  182 161 161 177 157 170 167 186 178 171 ...
##  $ repwt : int  77 51 54 70 59 76 77 73 71 64 ...
##  $ repht : int  180 159 158 175 155 165 165 180 175 170 ...
  1. How many men and how many women in the sample?
table(Davis$sex)
## 
##   F   M 
## 112  88

There are 112 women and 88 men.

  1. Combining men and women, what are the means and standard deviations for weight, hight, reptwt, repht (measured vs self-reported height and weight)?
describe(Davis)  # the describe function comes from the 'psych' pacakage
##        vars   n   mean    sd median trimmed   mad min max range  skew
## sex*      1 200   1.44  0.50    1.0    1.43  0.00   1   2     1  0.24
## weight    2 200  65.80 15.10   63.0   64.21 11.86  39 166   127  2.01
## height    3 200 170.02 12.01  169.5  170.32  9.64  57 197   140 -4.00
## repwt     4 183  65.62 13.78   63.0   64.27 11.86  41 124    83  1.03
## repht     5 183 168.50  9.47  168.0  168.19 10.38 148 200    52  0.33
##        kurtosis   se
## sex*      -1.95 0.04
## weight     8.96 1.07
## height    37.07 0.85
## repwt      1.33 1.02
## repht     -0.36 0.70

For weight, the mean is 65.8 and the standard deviation is 15.10. The rest of the variables are contained in the same table.

  1. In these data, weight is in kg (kilograms), height is in cm (centimeters). What is the mean body mass index for the total sample? Note: BMI is computed by kg/meters-squared (divide cm by 100 to yield meters, then square) A BMI below 18.5 is considered underweight. A BMI of 18.5 to 24.9 is considered healthy. A BMI of 25 to 29.9 is considered overweight. A BMI of 30 or higher is considered obese.
BodyMassIndex <- Davis$weight/((Davis$height/100)^2)
describe(BodyMassIndex)
##    vars   n mean    sd median trimmed mad   min    max range  skew
## X1    1 200 24.7 34.68  21.84   22.11 2.6 15.82 510.93 495.1 13.77
##    kurtosis   se
## X1    190.1 2.45

The mean BMI is 24.7.

  1. Do you spot a problem in the BMI data? What do you think caused it? Fix it.
stem(BodyMassIndex)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    0 | 67778888888888888899999999999
##    2 | 00000000000000000000000000000000001111111111111111111111111111122222+90
##    4 | 
##    6 | 
##    8 | 
##   10 | 
##   12 | 
##   14 | 
##   16 | 
##   18 | 
##   20 | 
##   22 | 
##   24 | 
##   26 | 
##   28 | 
##   30 | 
##   32 | 
##   34 | 
##   36 | 
##   38 | 
##   40 | 
##   42 | 
##   44 | 
##   46 | 
##   48 | 
##   50 | 1
boxplot(BodyMassIndex)

stem(Davis$height)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    5 | 7
##    6 | 
##    7 | 
##    8 | 
##    9 | 
##   10 | 
##   11 | 
##   12 | 
##   13 | 
##   14 | 8
##   15 | 0234567777788899
##   16 | 00000111111222222223333333333344444445555555555566666666667777777888+2
##   17 | 000000001111122222333333333444445555555566666777778888888888889999
##   18 | 0000011222233333344445555677899
##   19 | 117

There is an outlier. Someone has a BMI of 500. This is not very credible. Looking at the data, it appears that when they were punched (input), the values of height and weight were transposed for one of the people. Person 12 has a height of 57 and weight of 166. Assuming this is backwards, the problem can be fixed by swapping the numbers.

Davis$weight[12] <- 57  # swap the numbers
Davis$height[12] <- 166 # yes, both of them
BodyMassIndex <- Davis$weight/((Davis$height/100)^2) #recompute the BMI
stem(BodyMassIndex) # check the plot
## 
##   The decimal point is at the |
## 
##   15 | 8
##   16 | 9
##   17 | 14578899
##   18 | 111122356699
##   19 | 0234555566667789999
##   20 | 01122222222333334444445666677888999
##   21 | 00001111122334555566677889999
##   22 | 00011122444455557778888999
##   23 | 11223333344456778899
##   24 | 0256667788
##   25 | 001222445667889
##   26 | 013333445556
##   27 | 23358
##   28 | 46
##   29 | 78
##   30 | 12
##   31 | 
##   32 | 
##   33 | 
##   34 | 
##   35 | 
##   36 | 7

That’s better.

  1. What is the probability that someone randomly drawn from this sample would be underweight?

We could code BMI for under, healthy, over, and obese and then run a table as we did for men and women. Or we can just select (subset) those who are underweight (which is what I did). Then we need to compute the proportion of underweight people in the sample. That gives us the probability of drawing an underweight person at random.

under <- BodyMassIndex[BodyMassIndex<18.5] # subset the data for those underweight
describe(under)    # use this to find the number 18; we already know there are 200 people total.
##    vars  n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 18 17.75 0.63  17.88   17.82 0.41 15.82 18.47  2.65 -1.59     2.29
##      se
## X1 0.15
p.sample <- 18/200 # find the proportion.
p.sample           # print the proportion
## [1] 0.09

The probability of drawing someone who is underweight at random from this sample is .09.

  1. What is the probability that someone randomly drawn from a normally distributed population that has this sample’s mean and standard deviation would be underweight? We have to compute a value of z for the BMI of 18.5 given this sample’s mean and SD.
describe(BodyMassIndex) # to find the mean and SD of the sample
##    vars   n  mean sd median trimmed  mad   min   max range skew kurtosis
## X1    1 200 22.25  3   21.8   22.07 2.54 15.82 36.73 20.91 0.92     1.94
##      se
## X1 0.21
z.pop <- (18.5-22.25)/3 # to find z
z.pop                   # print z
## [1] -1.25
p.pop <- pnorm(z.pop)   # to find the probability
p.pop                   # print the probability
## [1] 0.1056498

The probability of drawing someone underweight at random from a normally distributed population with mean = 22.25 and SD = 3 is .11. This is pretty close to our observed sample (p = .09).

  1. Discussion question (ask your partner, then we will discuss as a class): How could you decide whether men or women were more accurate in reporting their height? Weight? (Numbers and graphs rather than statistical tests at this point. How would you examine the data and show to people to convince them?)

We could compute the difference between reported and actual weight for each person and then see whether there was a larger mean difference (or mean of absolute differences) for men or women. (That would be an independent samples t-test.) Or we could create a scatterplot where each person was represented by a point for Y = reported and X = measured values. We could see the relations between actual and reported for each, and whether the relations between the two were similar for both males and females. (That would be analysis of covariance.) You will see how to compute and interpret such statistical tests later in the course.