NormalDistribution

Notes on the Normal distribution

Definition

The normal distribution is sometimes called ‘the bell curve’ because the graph of the normal has the shape of a bell. The bell can be wider or narrower, and can move up and down the number line or scale. It always has a bell shape, but particular form of the bell is determined by its two parameters, the mean (\(\mu\)) and the variance (\(\sigma^2\)). The equation that specifies the normal is \[f(x; \mu, \sigma^2)=\frac{1} {\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/2\sigma^2}\].

The symbols for mean and variance were defined above the equation. The symbol \(\pi\) is a constant about 3.14, and the symbol ‘e’ is the natural exponent, a value of about 2.718. But you don’t need all these if you want to work with the normal in R. If you want to draw the normal in R, you can use the following code:

M = 50 # set the mean
SD = 10 # set the standard deviation (square root of the variance)
x <- seq(-4,4,length=100)*SD + M   # spots on the normal to be graphed
hx <- dnorm(x,M,SD)                # find the height at each spot 
plot(x, hx, xlab="Score", ylab="Density",   
     main="Normal Curve ", type='l', axes=T)        # run the plot

The normal distribution is not the only theoretical or mathematical distribution that is used in statistical work. But it plays a very prominent role in many statistical tests. Its relatives, t, \(\chi^2\) and F are also very commonly used.

You can think of the normal curve as resulting from lots of little, independent stray causes. For example, height of people is approximately normally distributed. We have lots of bones that vary more or less independently. In tall people, they all tend to be long. In average people, some are longer, some are shorter. If you track your weight every day, the distribution is likely to be approximately normal. You exercise different amounts and eat different things and wear different clothes each day (well, most of us do, anyway). Lots of little, independent stray causes result in a nice, bell-shaped distribution. The ‘Galton Board’ is a pegboard display that shows how dropping marbles over pegs that are spaced properly will result in a distribution of marbles that is approximtely normal. See Wolfram stats (or Google the Galton Board). http://mathworld.wolfram.com/GaltonBoard.html

Because the normal distribution is truly continuous (no space between scores; real numbers) and has infinite tails (no upper or lower boundary), real data will never be truly normally distributed. But we can often get close enough for practical purposes. In other words, even though the math cannot be rigorously defended, it turns out to be useful in practice. Further, many statistical tests rest on the sampling distribution of a statistic such as the mean, which can be shown to become normal as the sample size increases. In such cases, the justification is stronger.

Areas of the Normal

One of the main uses of the normal is to figure probabilities. The probabilities correspond to areas of the normal. So if we know that a distribution is normally distributed, we can compute the percentage of the normal for any given value and thus calcuate a probability corresponding to the percentage. For example, 50 percent of the normal distribution lies above the mean, and about 16 percent lies above one standard deviation above the mean. So let’s say we want to know how far apart to space rows of seats in an airplane. We want to space the seats so that knees don’t collide with the seats in front of the passengers. Thus we are interested in the length of the upper leg, which we measure from the back to the outside of the knee for 200 seated people. If the measurements are normally distributed, we can calculate the percentage of passengers in general that will bump knees for any given spacing between rows. You may have noticed that airlines’ desire to avoid bumping knees struggled with airlines’ desire to fit more people into the aircraft, and thus to increase revenue per flight. As usual, money won.

It’s customary to find percentages (and thus probabilities) of the normal by working from the bottom of the distibution up to the point of interest. In other words, we examine cumulative probability. So if we chose the mean, 50% lies from the bottom (minus infinity) to the mean. If we chose one standard deviation above the mean, then about 84% lies from the bottom to that value (remember, about 16% lies above).

To illustrate, consider the following graph, which is based on the unit normal, in which the mean is zero and the standard deviation is one:

M = 0 # set the mean
SD = 1 # set the standard deviation (square root of the variance)
x <- seq(-3,3,length=100)*SD + M   # spots on the normal to be graphed
hx <- dnorm(x,M,SD)                # find the height at each spot 
plotdat1 <- data.frame(x, hx)
ggplot(data=plotdat1, aes(x=x, y = hx))+ ylim(c(0,.5))+
  geom_line(color='black')+
   geom_area(mapping = aes(x = ifelse(x<=1, x, 0)), fill = "gold")+
 labs(x="Score", y="Density")

If we look up the value of pnorm for x = 1, we find the golden area marked in the graph:

pnorm(1)

## [1] 0.8413447

which is about 84 percent. So returning to our airline seats, if we set the distance between seats to be one standard deviation above the mean upper leg length, we would expect that about 16 percent of passangers would bump knees against the seat in front of them.

We can also use the normal to estimate probabilities for intervals. Suppose that student evaluations of classroom teaching are approximately normal (M=4.2, SD = .3), and we want to estimate the probability that our new faculty will obtain a rating between 4 and 4.5. First we want to convert the desired scores into z scores (for the unit normal table; isn’t entirely necessary these days because of better software, but whatever). The score 4 converts to z1 = (4 - 4.2)/.3 = -.67, and 4.5 converts to z2 = (4.5 - 4.2)/.3 = 1. The probability corresponds to the area in between. If we subtract z1 from z2, we have the desired area and thus the desired probability.

z1 <- (4 - 4.2)/.3
z2 <- (4.5 - 4.2)/.3
p1 <- pnorm(z1)
p2 <- pnorm(z2)
prob <- p2-p1
results <- data.frame(z1, z2, p1, p2, prob)
results

##           z1 z2        p1        p2      prob
## 1 -0.6666667  1 0.2524925 0.8413447 0.5888522

About 59 percent of the curve is gold, and thus the probably of finding a teaching evaluation in that range is about .59. The same problem is represented by a graph shown below.

M = 4.2 # set the mean
SD = .3 # set the standard deviation (square root of the variance)
x <- seq(-2, 2,length=100)*SD + M   # spots on the normal to be graphed
hx <- dnorm(x,M,SD)                # find the height at each spot 
plotdat1 <- data.frame(x, hx)
ggplot(data=plotdat1, aes(x=x, y = hx))+ ylim(c(0, 1.5))+xlim(c(3.5, 5))+
  geom_line(color='black')+
  geom_area(mapping = aes(x = ifelse(x >=4 & x<=4.5, x, 0)), fill = "gold")+
  labs(x="Score", y="Density")

## Warning: Removed 58 rows containing missing values (position_stack).

R functions

There are four functions in R for the normal distribution: p, q, d, and r. The function p returns the cumulative area (from the bottom up) or probability for the unit normal, so p(0) = .5, p(-1) is about .16 and p(1) is about .84. The function q is the inverse of p, and thus q returns a quantile or value of z that corresponds to a particular area or probability. Thus q(.5) = 0, q(.16) = -1 (about) and q(.84) = 1 (almost). The function d returns a density or height of the curve at a given value of the unit normal; dnorm(0) = .3989. This is not used as often as the other functions, but as you might have noticed in the code that generated the graphs, it is used in plotting the normal (among other things). The final function is r, which is used for random sampling. Unless you specify a seed, different random numbers will be sampled each time you call this function. This function is very useful in computer simulations about statistics. Such studies are sometimes called ‘Monte Carlo’ studies.

pnorm

The function ‘pnorm’ returns probabilites. Given a score on the unit normal, it returns the proportion of the normal from the bottom to the score. For example, pnorm(0) = .50 because zero is the mean, and that is half way up from the bottom to the top. Some examples

# probabilities from the unit normal distribution (M=0, SD=1)
values <- c(-1.96, -1, 0, 1, 1.96)         # 
pnorm(values)                             # compute & print the output

## [1] 0.0249979 0.1586553 0.5000000 0.8413447 0.9750021

Note that -1.96 and 1.96 result in values very close to .025 and .975, respectively, and that zero results in .50. Note that -1 and 1 result in about 16 and 84. The bottom 16 percent fall below -1 and the top 16 percent fall above 1. The values returned by pnorm correspond to the area under the normal curve from minus infinity up to the input score.

qnorm

The qnorm function returns quantiles - the values of z, the unit normal - from cumulative probabilities (proportions of the normal; areas under the curve from the bottom up). It is the inverse of the pnorm function.

Some examples:

# probabilities from the unit normal distribution (M=0, SD=1)
probs <- c(.025, .05, .16, .50, .84, .95, .975)         # 
qnorm(probs)                             # compute & print the output

## [1] -1.9599640 -1.6448536 -0.9944579  0.0000000  0.9944579  1.6448536
## [7]  1.9599640

rnorm

The rnorm function returns random deviates from the unit normal. If we choose a lot of them, we would expect that about 95% of them would fall between -1.96 and 1.96. Here are a few:

# probabilities from the unit normal distribution (M=0, SD=1)
deviates <- rnorm(5)         # 
deviates                     # compute & print the output

## [1] -0.3468932 -0.4168497  1.0498639  1.2353753 -0.3822176

devs2 <- rnorm(10, 50, 5)   #  normal with M=50, SD=5
devs2

##  [1] 51.45418 55.29422 51.20100 46.53877 51.41547 52.98663 45.33288
##  [8] 40.06940 44.08733 44.01846

dnorm

The dnorm function returns the height of the curve of the unit normal at user specified places on the curve. Here are a few:

# probabilities from the unit normal distribution (M=0, SD=1)
heights <- dnorm(c(-1.96, -1, 0, 1, 1.96))         # 
heights

## [1] 0.05844094 0.24197072 0.39894228 0.24197072 0.05844094

Importance of the Normal

Many measures of interest to psychologists are well represented by the normal distribution. Many psychological tests yield distributions that are approximately normal when gathered over large, representative samples of adults (e.g., cognitive ability tests, personality tests). Many physical measurements such as height, also appear approximately normal.

Distributions of errors are often assumed to be normally distributed. For example, in classical test theory, although the standard error of measurement does not require an assumption about the distribution of errors, setting a confidence interval about an individual’s score on a test does require some assumption about the distribution of errors, and we typically assume that the distribution is normal. Same for errors of prediction in regression.

More generally, the parameters for the normal (mean and variance) are independent for the normal distribution. This is helpful to mathematical statisticians that want to derive properties about statistics computed on a distribution. For other distributions (e.g., the proportion and the correlation coefficient), the mean and the variance are dependent, and so the equations are more complicated.

Because of the Central Limit Theorem, which states that statistics become normally distributed as sample sizes increase (to infinity), the normal forms the foundation for many statistical tests.

Single Sample Significance Tests of the Mean

We can use the normal distribution to test hypotheses about the mean of a population given a single sample of observations if either we have a large sample of people (say N=200) or if we are willing to make assumptions about the population values.

Suppose we have been teaching introductory statistics for years to large classes, and suppose further that we have been administering the same items to the students across the years and the scores have not changed systematically over the years, so we can pool all the data. On the second exam, which contains 75 questions about hypothesis testing, the mean score is 60, with a standard deviation of 4, and a bell-shaped distribution. Suppose we are willing, on the basis of these data, to assume that in the population of examinees, the mean is 60, the standard deviation is 4, and the data are normally distributed. Our population should look like this:

M = 60 # set the mean
SD = 4 # set the standard deviation (square root of the variance)
x <- seq(-3, 3,length=100)*SD + M   # spots on the normal to be graphed
hx <- dnorm(x,M,SD)                # find the height at each spot 
plotdat1 <- data.frame(x, hx)
ggplot(data=plotdat1, aes(x=x, y = hx))+ ylim(c(0, .2))+xlim(c(50, 70))+
  geom_line(color='black')+
  labs(x="Test Score (M=60, SD=4)", y="Relative Frequency")

## Warning: Removed 18 rows containing missing values (geom_path).

We are going to introduce a computer tutor to the next class, and we want to know whether the tutor helps the students to perform better on the exam. (Typically, we would rather design a randomized controlled trial where half the students got the tutor and half did not, but there are often constraints in study design…the local IRB wouldn’t allow students in the same class to get different instructions and then grades on the same test.)

So let’s say we give the same exam to N=100 students who were given the computer tutor in addition to ordinary instruction. It’s our hope that the tutor improved scores, but it’s also possible that it mystified people instead of helping them, and thus resulted in worse performance. So what we can do is to set up our expectations analytically. If the tutor has no effect, then we expect to see the mean score of the new students to fall within a range we would expect if no tutor was available. Because we are talking about the mean and not the raw distribution, the result we expect to see if there is no effect of the tutor looks like this:

M = 60 # set the mean
SD = 4 # set the standard deviation (square root of the variance)
x <- seq(-3, 3,length=100)*SD + M   # spots on the normal to be graphed
hx <- dnorm(x,M,SD)                # find the height at each spot 
plotdat1 <- data.frame(x, hx)

M = 60 # set the mean
SD = .4 # set the standard deviation (square root of the variance)
x <- seq(-3, 3,length=100)*SD + M   # spots on the normal to be graphed
hx <- dnorm(x,M,SD)                # find the height at each spot 
plotdat2 <- data.frame(x, hx)
ggplot(data=plotdat1, aes(x=x, y = hx))+ ylim(c(0, 1.1))+xlim(c(50, 70))+
  geom_line(color='black')+
  geom_line(data=plotdat2, color='green')+
  labs(x="Test Score (M=60, SD=4)", y="Relative Frequency")

## Warning: Removed 18 rows containing missing values (geom_path).

The raw scores on the test are represented in the graph by the black line, and the means of random samples of size 100 are represented by the green line. The distribution of means is much narrower than the distribution of raw scores (differences in individuals cancel out when we have the mean based on 100 people).

If we get rid of the raw data so that the means are more easily viewed, we can picture where the mean of the new class is likely to fall if the tutor has no effect. To figure this, we will go up and down 1.96 standard errors (standard deviations of the distribution of means) and mark off the distribution. Aside: the standard error of the mean is computed by \(SEM = \sigma/\sqrt{N}\). Inside the marks we will have 95 percent of the distribution. Thus, 95 times in 100, the mean score would fall in this range if the tutor has no effect. Let’s look:

M = 60 # set the mean
SD = .4 # set the standard deviation (square root of the variance)
upper <- M+1.96*SD
lower <- M-1.96*SD
limits <- c(lower, upper)
x <- seq(-3, 3,length=100)*SD + M   # spots on the normal to be graphed
hx <- dnorm(x,M,SD)                # find the height at each spot 
plotdat2 <- data.frame(x, hx)
ggplot(data=plotdat2, aes(x=x, y = hx))+ ylim(c(0, 1.1))+xlim(c(58, 62))+
  geom_line(color='green')+
  labs(x="Test Score (M=60, SD=4, N=100)", y="Relative Frequency")+
  geom_vline(xintercept = upper, color='green', linetype = 2)+
  geom_vline(xintercept=lower, color='green', linetype = 2)+
    geom_area(mapping = aes(x = ifelse(x >=lower & x<=upper, x, 0)), fill = "gold")+
  geom_text(x=59, y= .15, label='RR')+
geom_text(x=61, y= .15, label='RR')

## Warning: Removed 36 rows containing missing values (position_stack).

limits

## [1] 59.216 60.784

If we find a mean from the new class (using the tutor) that is less than 59.22 or greater than 60.78, then we will conclude that the tutor had an effect. We will conclude this because such a result was very unlikely if the computer tutor did nothing. If it did nothing, we would very likely find a result between 59.22 and 60.78. The place in the distribution that is unlikely given no effect is called the rejection region in statistics, and is labeled ‘RR’ in the graph.

Note that all this setup for our decision was made before we collected any data. All we have to do now is run the study and compute the mean. Suppose that when we collect the data from the sample of 100 people, their mean score is 61. We would conclude that the tutor had an effect - the tutor appeared to raise the scores by about a point. On the other hand, if the observed mean was 59.5, we could not conclude that the tutor had an effect because the result was within the range that we expected to get if there was no effect of the tutor, that is, we would expect such a result even if we had not introduced the tutor.

I’ve glossed over some important aspects of null hypothesis significance testing in presenting this example. The point here is to illustrate one way that the normal distribution can be used in statistical significance testing. We’ll spend more time on null hypothesis testing in several other lectures.