The notion is that as the sample size increases, the sampling distribution of many statistics will become normal. In this illustration, you’ll first see the distribution of exercise in the Blackmore data. This is one of the most horribly skewed distributions I have ever seen. Then we will use R to randomly sample from this distrbution and compute means on each sample. That is, we will compute an empirical sampling distribution. You will see that as sample size increases, the empirical sampling distribution becomes increasingly normal.
First, the original distribution.
library(car)
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
describe(Blackmore$exercise)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 945 2.53 3.5 1.33 1.79 1.63 0 29.96 29.96 2.87 11.13
## se
## X1 0.11
boxplot(Blackmore$exercise, main='Exercise')
Notice that the mean is about 2.5, but the maximum is around 30, and the skew is nearly 3, which is a very, very pronounced right tail.
Now we sample 10 observations from Blackmore$exercise with replacement and compute the mean of those 10 observations. We do this 1000 times and plot the empirical sampling distribution.
M_Exer <- 1:1000
for (i in 1:1000) {
Exer <- sample(Blackmore$exercise, size=10, replace = T)
M_Exer[i] <- mean(Exer)
}
boxplot(M_Exer, main='Blackmore Exercise Means (n=10)')
Note how much the distribution has improved. Now we do the same thing, but instead of sampling 10 observations at a time, we sample 100.
M_Exer <- 1:1000
for (i in 1:1000) {
Exer <- sample(Blackmore$exercise, size=100, replace = T)
M_Exer[i] <- mean(Exer)
}
boxplot(M_Exer, main='Blackmore Exercise Means (n=100)')
Now the empirical distribution observations (sample means) are nearly balanced around 2.5, and the distribution approaches normality, even though the parent distribution (raw data) was awful.