Descriptive Stats and Graphs

Data, Descriptive Stats and a Histogram

library(car)                        # install.packages('car') if you don't have this
library(psych)                      # install.packages('psych') if you don't have this

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

Sample1 <- rnorm(100,50,10)         # random nomral
Sample1                             # print the data; R is case sensitive

##   [1] 33.20278 39.88115 53.06658 49.87895 43.52782 36.20418 30.87053
##   [8] 53.55891 52.44551 51.76206 50.94728 31.47855 42.28203 51.31070
##  [15] 61.18571 55.19413 58.70188 50.29461 37.54355 44.21818 44.95603
##  [22] 28.79619 64.12388 45.43964 51.36496 36.22189 60.03809 48.13543
##  [29] 56.52984 62.92297 41.19259 64.28017 45.79015 44.04187 27.42118
##  [36] 61.53297 55.16395 42.03167 61.35125 46.92814 55.27047 62.51552
##  [43] 58.03451 62.92743 69.12613 51.39214 46.47483 53.25590 36.85786
##  [50] 46.22412 42.03372 42.99251 55.87646 34.16620 56.98732 55.63444
##  [57] 52.80488 39.92735 54.88956 44.76548 32.91087 50.65925 42.02592
##  [64] 46.14878 56.74883 40.71777 49.18179 36.45364 51.00027 48.30197
##  [71] 65.92533 44.66492 57.67173 63.70911 54.32682 27.97523 49.85782
##  [78] 46.24181 48.00197 65.81213 50.95917 30.95114 54.91202 55.72315
##  [85] 59.61961 33.26242 53.42571 50.39045 58.28657 54.43698 67.51934
##  [92] 53.96751 48.77602 40.12537 49.18068 47.75564 54.05945 53.43756
##  [99] 47.17401 46.99979

describe(Sample1)                   # the function 'describe' comes from the 'psych' package

##    vars   n  mean   sd median trimmed  mad   min   max range  skew
## X1    1 100 49.27 9.63  50.34   49.59 8.75 27.42 69.13  41.7 -0.27
##    kurtosis   se
## X1    -0.48 0.96

hist(Sample1)                       # show a simple histogram

Add normal distribution overlay

x <- Sample1
h<-hist(x, breaks=10,
        main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)

Histograms don’t show shape too well because shape is so influenced by the number of bins (columns). The kernal density plot is better for showing shape.

Kernel Density Plot

d <- density(Sample1) # returns the density data
plot(d) # plots the results

Stem-leaf plot

Stem-leaf shows actual data along with the distribution shape

stem(Sample1)   # the output shows in the console window rather than the plot window

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   2 | 789
##   3 | 1113334
##   3 | 66678
##   4 | 0001122223444
##   4 | 5555666667778888999
##   5 | 00001111111223333344444
##   5 | 555556667778889
##   6 | 00112333444
##   6 | 6689

Boxplot

The boxplot shows an abstracted distribution, which is very handy for quick inspection and plotting multiple distributions side-by-side.

boxplot(Sample1)

Blackmore Data

Blackmore dataset from package ‘cars’ has data on Blackmore and Davis’s data on exercise histories of 138 girls hospitalized for eating disorders and 98 control subjects. The data frame has 945 rows and 4 columns. Note that there are multiple rows for each participant. We will ignore the dependencies for now. We would probably want summaries for each person if we were going to analyze these data. But descriptive here is good - we want to see all the data - observations in the data that might be masked by taking the average.

I’m showing these data because the distribution is so pathological. It is a good contrast to the nice, normal graphs I showed earlier.

describe(Blackmore)

##          vars   n   mean    sd median trimmed   mad min    max  range
## subject*    1 945 114.10 67.84 113.00  113.80 90.44   1 231.00 230.00
## age         2 945  11.44  2.77  12.00   11.25  2.97   8  17.92   9.92
## exercise    3 945   2.53  3.50   1.33    1.79  1.63   0  29.96  29.96
## group*      4 945   1.62  0.49   2.00    1.65  0.00   1   2.00   1.00
##           skew kurtosis   se
## subject*  0.03    -1.25 2.21
## age       0.34    -0.98 0.09
## exercise  2.87    11.13 0.11
## group*   -0.49    -1.76 0.02

hist(Blackmore$age)

age.dens <- density(Blackmore$age)
plot(age.dens)

hist(Blackmore$exercise)

exe.dens <- density(Blackmore$exercise)
plot(exe.dens)

stem(Blackmore$exercise)

## 
##   The decimal point is at the |
## 
##    0 | 00000000000000000000000000000000000000000000000000000000000000000000+507
##    2 | 00000000000000000000111111111112222222223333333333344444455555555555+91
##    4 | 00001112222223333333355555566677777777888889999999990011222344445556
##    6 | 1111122344556666667788889900113333557889999
##    8 | 0133455567889011339
##   10 | 0137777889002345589
##   12 | 024555801234
##   14 | 06881
##   16 | 890
##   18 | 56058
##   20 | 0
##   22 | 7
##   24 | 7
##   26 | 
##   28 | 
##   30 | 0

boxplot(Blackmore$exercise, main='Exercise')

Blackmore scatterplot

cor.blk <- round(cor(Blackmore$age, Blackmore$exercise),2) # find the correlation, round to 2 digits
plot(Blackmore$age, Blackmore$exercise)                    # scatterplot(X, Y)
abline(lm(Blackmore$exercise~Blackmore$age))               # plot the regression line (lm is linear regression)
text(16,25, 'r=')                                          # place 'r=' in the plot
text(17,25, cor.blk)                                       # place the value of the correlation in the plot