Explore Data 1

Computing Simple Descriptive Statistics

In this module, we will use two packages: car and psych. If you have not already installed them, do so before you run this or you will get an error when R gets to the ‘library’ function.

library(car)  # car - Companion for Applied Regression - has lots of data pre-loaded

## Loading required package: carData

library(psych) # include the psych package for descriptive stats

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

When I collect data, the first thing I do is to clean it. I check to see if the values of the variables in the data are plausible/legal. For example, if I measured the weight of people in pounds, then values like 999, 1, and -3 would not make sense, so if I find them, I go check how they got there. I want to be sure that the computer sees the same data that I input, and the variable names are associated with the correct variables. Verifying the data as a first step can save a whole lot of trouble later.

I also want to examine various plots or visual representations of the data. In this module, I will only show you one of them (the stem-leaf diagram). The rest will wait until we show you ggplot and conventions for building nice graphs in R. Then we will show you histograms, boxplots, and density plots.

Normal Data - like really normal

Before examining some real data, let’s have a look at some random numbers drawn from the normal distribution.

# sample some data from the normal distribution (N=100, M=50, SD=10)
Sample1 <- rnorm(100,50,10)         # random nomral
Sample1                             # print the data; R is case sensitive

##   [1] 69.69092 48.23872 53.35433 55.39809 40.26705 54.03659 54.89827
##   [8] 48.35580 60.68953 39.26803 60.55599 37.42201 35.62847 66.19980
##  [15] 51.21953 48.19199 26.29395 37.62048 63.81775 55.84073 59.39952
##  [22] 38.82573 45.48527 38.01806 47.16854 38.01388 35.57007 43.51347
##  [29] 54.42072 38.04604 51.72354 38.68013 43.54621 57.77375 53.12246
##  [36] 54.71049 60.93628 62.25070 56.47913 50.48550 34.71060 41.38428
##  [43] 59.14472 57.52426 51.96132 36.93142 58.16203 75.73236 64.07081
##  [50] 56.69094 66.09361 37.97685 56.48466 63.71919 49.34222 40.46277
##  [57] 46.77498 45.09179 44.57753 52.74813 42.08015 53.80406 53.27261
##  [64] 71.71845 49.59182 45.06335 48.49740 49.80885 47.85720 46.61490
##  [71] 43.89037 33.42658 42.31905 45.99708 48.61631 51.84483 53.78780
##  [78] 51.41850 56.95436 69.29667 44.46923 55.69625 44.87581 49.21966
##  [85] 45.09274 54.43527 76.32790 38.65756 40.66105 44.94681 44.72860
##  [92] 46.12108 39.68655 50.79565 60.01282 42.06604 58.65881 45.37919
##  [99] 26.47039 53.67733

describe(Sample1)                   # the function 'describe' comes from the 'psych' package

##    vars   n  mean   sd median trimmed   mad   min   max range skew
## X1    1 100 49.83 9.98  49.28   49.42 10.02 26.29 76.33 50.03 0.28
##    kurtosis se
## X1    -0.04  1

The ‘rnorm’ function creates random observations from a normal distribution with mean and standard deviation supplied by the user (here 50 and 10, respecrtively). The object Sample1 gets the 100 observations.

When I execute the command ‘Sample1’, then R prints the 100 values, as you should see above. If data are missing, R will print ‘NA’ instead of a value. Notice that we have data printed to 5 decimals beyone zero, which is excessive for psychology. Also notice that we can tell that there are 100 observations as there should be, but we can’t tell much about the distribution from just looking at a series of numbers.

The ‘describe’ function comes with the psych package. The output shows the number of variables to be analyzed (here 1), the number of observations (here 100), and the descriptive statistics. We expect the mean to be 50 and the standard deviation to be 10, but because we have only a sample of N = 100, we typically do not get exactly 50 and 10. The skew and kurtosis should be near zero (again, sampled from a normal distribution), and the standard error should be about 1 (SD/sqrt(N) = 10/10 = 1).

When cleaning the data, always look for the sample size (n), mean, standard deviation, minimum and maximum. The minimum and maximum are particularly helpful in detecting errors in the data (e.g., a record indicates that person weighs -3 pounds).

The simple graph that I show in this module in the stem-leaf diagram. First I will round the data to the nearest digit, then run ‘describe’ on the rounded data, and finally ask for a stem-leaf plot.

# The stem-leaf plot shows actual data along 
#  with the distribution shape
Sample2 <- round(Sample1,0) # Round to the nearest whole number.
                            #  That is, get rid of digits beyond 0.
describe(Sample2)

##    vars   n  mean sd median trimmed   mad min max range skew kurtosis se
## X1    1 100 49.82 10     49   49.41 10.38  26  76    50 0.27    -0.02  1

stem(Sample1)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   2 | 66
##   3 | 356677888889999
##   4 | 0001122244445555555556677788888999
##   5 | 0001112223333444444555666677888999
##   6 | 01112444669
##   7 | 0266

# The output shows in the console window 
#   rather than the plot window

The cool thing about the stem-leaf diagram is that it shows the data in the graph. You should see a stem to the left of the vertical bar ( | ) and the leaves to the right of the bar. The diagram tells you that in this case, the decimal is one digit to the right of the bar. If you go back to the ‘describe’ output, you should see that the minimum value found in ‘describe’ is also the minimum value shown in the stem-leaf, and that the maximim values match as well. The stem-leaf diagram is essentially a histogram in which the data appear as numbers.

Some Real Data

The Davis study concerns height and weight of people that exercise. The data are included in the car package. Study was completed in Canada, so height and weight are reported in metric units instead of English ones. But they will do for the example.

Davis, C., & Cowles, M. (1991). Body image and exercise: A study of relationships and comparisons between physically active men and women. Sex Roles, 25, 33-44.

Here I request descriptive statistics.

describe(Davis)

##        vars   n   mean    sd median trimmed   mad min max range  skew
## sex*      1 200   1.44  0.50    1.0    1.43  0.00   1   2     1  0.24
## weight    2 200  65.80 15.10   63.0   64.21 11.86  39 166   127  2.01
## height    3 200 170.02 12.01  169.5  170.32  9.64  57 197   140 -4.00
## repwt     4 183  65.62 13.78   63.0   64.27 11.86  41 124    83  1.03
## repht     5 183 168.50  9.47  168.0  168.19 10.38 148 200    52  0.33
##        kurtosis   se
## sex*      -1.95 0.04
## weight     8.96 1.07
## height    37.07 0.85
## repwt      1.33 1.02
## repht     -0.36 0.70

As you should see, there are 5 variables in the dataset. Sex was coded 1 for female, and 2 for male. Weight and height were measured by machine. Each person was asked independently to self-report their weight and height. We see 200 people in the sample, and some of them apparently refused to tell the researcher their own weight and height (there are only 183 observations for each of these). As you can see, there is close agreement between the means of the actual values and the self-reported values.

Let’s ask to see the data for repwt.

Davis$repwt

##   [1]  77  51  54  70  59  76  77  73  71  64  75  56  52  64  57  66 101
##  [18]  62  75  61 124  61  66  70  59  50  61  60  41 100  71  73  76  52
##  [35]  63  65  54  69  86  67  53  80  59  80  82  55  NA  NA  56  75  85
##  [52]  57  73 107  NA  64  65  74  70  58  69  71  76  75  98  59  63  62
##  [69]  51  76  61  66  54  57  50  NA  55  64  70  70  60  56  52  55  56
##  [86]  61  66  53  59  56  68  56  86  71  87  57 101  50  52  NA  55  63
## [103]  47  45  63  51  51  55  64  55  90  79  57  67  77  62  83  94  76
## [120]  66  77  73  68  55  NA  56  NA  45  68  44  61  89  53  47  84  53
## [137]  62  NA  91  83  68  53  55  55  66  55  55  55  67  86  58  47  45
## [154]  NA  44  58  68  NA  NA  56  50  54  52  58  58  59  62  66  95  50
## [171]  75  NA  61  NA  64  68  NA  67  82  68  78  NA  NA  59  70  56  55
## [188]  54  75  49  93  86  59  51  61  71  80  NA  91  81

You should see ‘NA’ for the missing observations.

Then let’s examine the stem-leaf diagram for the same data.

stem(Davis$repwt)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    4 | 144
##    4 | 5557779
##    5 | 0000011111222223333344444
##    5 | 5555555555555666666666777778888899999999
##    6 | 0011111111222223333444444
##    6 | 5566666667777888888899
##    7 | 0000001111133334
##    7 | 55555566666777789
##    8 | 000122334
##    8 | 5666679
##    9 | 01134
##    9 | 58
##   10 | 011
##   10 | 7
##   11 | 
##   11 | 
##   12 | 4

Interesting distribution, no? A kilogram is about 2.2 pounds, so the minimum value is 41 x 2.2 = 90.2, and the maximum is 124 x 2.2 = 272.8. The latter is pretty heavy. Could be a real weight, but I might want to check the value. Also notice the skew of the distribution.

Explore Data 1

Michael Brannick

6/2/2019

Computing Simple Descriptive Statistics

Normal Data - like really normal

Some Real Data