Here we want to test whether a sample was likely drawn from a given population (technically, we are computing a one-sample z test).
We have data from the Davis dataset:
Davis, C. & Cowles, M. (1991). Body image and exercise: A study of relationships and comparisons between physically active men and women. Sex Roles, 25, 33-44.
Invoke the library:
library(car) # calling the library lets you use the data and functions in it
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
Davis$weight[12] <- 57 # one person's data were transposed; this fixed it.
Davis$height[12] <- 166
# we need to find the mean and N (sample size) for the sample.
BodyMassIndex <- Davis$weight/((Davis$height/100)^2) # compute BMI on corrected data
describe(BodyMassIndex)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 200 22.25 3 21.8 22.07 2.54 15.82 36.73 20.91 0.92 1.94
## se
## X1 0.21
You can see that N is 200 and the mean is 22.25. The standard error of the mean is the standard deviation divided by the square root of N = 3/sqrt(200). This is what we need to construct a sampling distribution under the null hypothesis and find the rejection regions.
Aside: by slight of hand, SD in the sample is the same as the assumed SD in the population. Usually we just estimate the population SD with the sample SD, and with 200 people, this will work pretty well.
M <- 23 # hypothesized population mean
SD <- 3 # hypothesized population SD
CanM <- 22.25 # from the output of 'describe'
SEM <- SD/sqrt(200) # computed standard error with 200 people
lo <- M-1.96*SEM # computed lower bound RR
hi <- M+1.96*SEM # computed upper bound RR
lo # print lower bound
## [1] 22.58422
hi # print upper bound
## [1] 23.41578
We have computed the standard error and the edges of the rejection region. Now to graph:
x <- seq(-5,4,length=100)*SEM + M # spots on the normal to be graphed
hx <- dnorm(x,M,SEM) # find the height at each spot using the rare 'd' function
plot(x, hx, xlab="BMI", ylab="Relative Frequency", # run the plot - x, hx is a scatterplot type 'l' is a line
main="BMI in Canada ", type='l', axes=T) # run the plot
abline(v=23, lwd=2) # fat line for the population mean
abline(v=lo, lwd=1) # thin line for lower bound
abline(v=hi, lwd=1) # thin line for upper bound
abline(v=CanM, lwd=1, col='red') # red line for Canadian mean
text(22.5,.01, 'RR') # label for rejection region - placed by trial and error using axes
text(23.5,.01, 'RR') # label for rejection region
text(23, .01, 'N.S.')
We just find the z score for the sample mean and its associated probability from the normal.
z.CanM <- (22.25-23)/SEM # z score for sample mean
p.CanM <- pnorm(z.CanM) # find p value for sample mean
z.CanM # print sample mean z score
## [1] -3.535534
p.CanM # print p value for sample mean
## [1] 0.000203476
Yes, it is significant. It is very unlikely that a mean as low as 22.25 based on 200 people would have been sampled from a population with a mean of 23 and a standard deviation of 3. Notice that the red vertical line representing the Canadian sample in Davis falls within one of the rejection regions, meaning that the result is statstically improbable, i.e., significant.