GMAT, sex & Law School Grades

These are the data presented in my PowerPoint slides (fictitious data). Imagine we have we have test scores and grade point averages for males and females. We want to know whether males or females show better grades, whether there is a relation between the test and grades, and whether the relations between test scores and grades are similar for both groups.

Data description - GMAT_Ancova.csv

sex 1 = female -1 = male (IV; factor) MAT - Miller Analogies Test (IV; covariate) GPA - law school Grade Point Average (DV)

library(psych)
sex <- c(rep(1,20), rep(-1,20))
MAT <- c(51, 53, 52, 50, 54, 50, 52, 56, 49, 52, 50, 51, 49, 54, 48, 48, 53, 53, 48, 47, 47, 53, 51, 51, 46, 48, 51, 51, 53, 55, 50, 51, 52, 50, 49, 49, 50, 52, 44, 52)
GPA <- c(3.7, 3.28, 3.79, 3.23, 3.58, 3.34, 3.05, 3.78, 3.23, 3.16, 3.46, 3.47, 3.73, 3.63, 3.09, 3.18, 3.58, 3.26, 3.11, 3.22, 2.72, 3.62, 3.45, 3.78, 3.14, 2.89, 3.36, 3.05, 3.65, 3.61, 3.45, 3.43, 3.56, 3.14, 3.19, 3.32, 3.06, 3.47, 3.07, 3.66)
lawschool <- data.frame(sex, MAT, GPA)

Now we will ask for some basic descriptive information.

describe(lawschool)
##     vars  n  mean   sd median trimmed  mad   min   max range  skew
## sex    1 40  0.00 1.01   0.00    0.00 1.48 -1.00  1.00  2.00  0.00
## MAT    2 40 50.62 2.50  51.00   50.69 2.97 44.00 56.00 12.00 -0.28
## GPA    3 40  3.36 0.27   3.35    3.37 0.33  2.72  3.79  1.07 -0.16
##     kurtosis   se
## sex    -2.05 0.16
## MAT    -0.05 0.39
## GPA    -0.84 0.04
pairs(lawschool)

cor(lawschool)
##           sex       MAT       GPA
## sex 1.0000000 0.1520279 0.1189388
## MAT 0.1520279 1.0000000 0.6362456
## GPA 0.1189388 0.6362456 1.0000000

Because sex is binary, the plots with that variable are not terribly informative. However, you can see that there appears to be a slight difference in MAT by sex, and a greater difference in GPA by sex. There is a clear correlation between MAT and GPA.

We want to test the full model first, to see whether the interaction is significant (common slopes for the regression).

res1 <- lm(GPA ~factor(sex)+MAT+ MAT:factor(sex))    # run linear multiple regression
summary(res1)                                 # print the coefficients (intercept & slopes)
## 
## Call:
## lm(formula = GPA ~ factor(sex) + MAT + MAT:factor(sex))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39529 -0.13314  0.03239  0.10112  0.44007 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.75493    0.94304  -0.801 0.428656    
## factor(sex)1      1.50736    1.38475   1.089 0.283588    
## MAT               0.08131    0.01874   4.338 0.000111 ***
## factor(sex)1:MAT -0.02953    0.02731  -1.081 0.286801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2102 on 36 degrees of freedom
## Multiple R-squared:  0.424,  Adjusted R-squared:  0.376 
## F-statistic: 8.834 on 3 and 36 DF,  p-value: 0.0001603

As you can see, the interaction is not significant, so we rerun after dropping that term.

res2 <- lm(GPA ~factor(sex)+MAT)        # plot the data rerun without interaction
summary(res2)
## 
## Call:
## lm(formula = GPA ~ factor(sex) + MAT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41090 -0.09740  0.02767  0.10549  0.47131 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.05599    0.68817  -0.081    0.936    
## factor(sex)1  0.01195    0.06740   0.177    0.860    
## MAT           0.06740    0.01366   4.933 1.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2107 on 37 degrees of freedom
## Multiple R-squared:  0.4053, Adjusted R-squared:  0.3732 
## F-statistic: 12.61 on 2 and 37 DF,  p-value: 6.673e-05

As you can see, the only significant term is for the GMAT. The inference would be that there is one regression line for both groups.

Recall that the data are fabricated. But a couple of interesting points. First, there are only 20 people in each group. The test of the interaction is not very powerful, so you really need lots of people (perhaps 250 per group) to take it very seriously. Second, note that sex is correlated with both GPA and with GMAT. When the independent variables (sex and GMAT in this case) are correlated, the power of the test of the slopes decreases.

One more thing we can do is to plot the data with the common regression line:

lawschool$sex <- lawschool$sex + 3 # recode sex to be positive numbers for plotting
res3 <- lm(GPA ~ MAT)                                   
plot(MAT, GPA, pch= lawschool$sex, col=lawschool$sex)             # Plot of final result; note how pch was used to plot different symbols for M and F
abline(res3) 

Note how the code for sex was used to create plots characters (pch) and plot colors (col) for each observation.

If the lines had been different rather than the same, we could separate the groups and find a regression line for each, plotting them both on the same graph as shown below.

# sex was recoded to 1, 4 by adding 3 earlier so it would be positive for plotting
men <- subset(lawschool, sex==2)
women <- subset(lawschool, sex==4)
# run regressions for each group to find coefficients relating foodgram to BMI
res1a <- lm(GPA ~ MAT, data=men)
#summary(res1a)
res1b <- lm(GPA ~ MAT, data=women)
#summary(res1b)
#
plot(MAT, GPA, pch=lawschool$sex, col=lawschool$sex, xlab="GMAT", ylab='Law School GPA')
abline(res1a, lty=2)
abline(res1b)
text(45, 3.2, 'Women') # find which line is which by looking at the summary(res1a & b)
text(45, 2.8, 'Men')