Anova

Purpose
Anova is a concept which is probably very well known to everyone who has taken any elementary course in stats.When I tried racking my brains to remember what I could do with R , relating to ANOVA.

> x <- rnorm(1000, 2, 3)
> y <- rnorm(1000, 2, 9)

I was trying to do this . Anova of 2 random variables with same mean and different variance. Anova is something which is not used on 2 random variables. It is for heaven’s sake , analysis of variance and there is nothing to analyze the variance of x and y I can instead do a t test

> t.test(y - x)
        One Sample t-test

data:  y - x
t = 0.1947, df = 999, p-value = 0.8457
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.5097545  0.6220424
sample estimates:
 mean of x
0.05614391

This talks about the hypothesis of difference between means = 0 against the alternate that it is not 0.

While reading the stuff about anova, I came across stripplot and hence explored a bit on the stripplot function in R.

> par(mfrow = c(1, 1))
> x <- stats::rnorm(50)
> xr <- round(x, 1)
> stripchart(x)
> m <- mean(par("usr")[1:2])
> text(m, 1.04, "stripchart(x, \"overplot\")")
> stripchart(xr, method = "stack", add = TRUE, at = 1.2)
> text(m, 1.35, "stripchart(round(x,1), \"stack\")")
> stripchart(xr, method = "jitter", add = TRUE, at = 0.7)
> text(m, 0.85, "stripchart(round(x,1), \"jitter\")")

Basic funda is to use to check 2 different models.

> library(faraway)
> data(coagulation)
> boxplot(coag ~ diet, coagulation)
> par(mfrow = c(1, 2))
> plot(coag ~ diet, coagulation, ylab = "coagulation time")
> with(coagulation, stripchart(coag ~ diet, vertical = TRUE, method = "stack",
+     xlab = "diet", ylab = "coagulation time"))

Lets fit a model

> g <- lm(coag ~ diet, coagulation)
> summary(g)
Call:
lm(formula = coag ~ diet, data = coagulation)

Residuals:
       Min         1Q     Median         3Q        Max
-5.000e+00 -1.250e+00 -2.141e-17  1.250e+00  5.000e+00

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  64.0000     0.4979 128.537  < 2e-16 ***
diet1        -3.0000     0.9736  -3.081 0.005889 **
diet2         2.0000     0.8453   2.366 0.028195 *
diet3         4.0000     0.8453   4.732 0.000128 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.366 on 20 degrees of freedom
Multiple R-squared: 0.6706,     Adjusted R-squared: 0.6212
F-statistic: 13.57 on 3 and 20 DF,  p-value: 4.658e-05
> g <- lm(coag ~ diet - 1, coagulation)
> summary(g)
Call:
lm(formula = coag ~ diet - 1, data = coagulation)

Residuals:
       Min         1Q     Median         3Q        Max
-5.000e+00 -1.250e+00  1.743e-16  1.250e+00  5.000e+00

Coefficients:
      Estimate Std. Error t value Pr(>|t|)
dietA  61.0000     1.1832   51.55   <2e-16 ***
dietB  66.0000     0.9661   68.32   <2e-16 ***
dietC  68.0000     0.9661   70.39   <2e-16 ***
dietD  61.0000     0.8367   72.91   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.366 on 20 degrees of freedom
Multiple R-squared: 0.9989,     Adjusted R-squared: 0.9986
F-statistic:  4399 on 4 and 20 DF,  p-value: < 2.2e-16

Basically to test the model against the null that none of the coefficients are good, there is a simple anova test

> gi <- lm(coag ~ diet - 1, coagulation)
> gnull <- lm(coag ~ 1, coagulation)
> anova(gnull, gi)
Analysis of Variance Table

Model 1: coag ~ 1
Model 2: coag ~ diet - 1
  Res.Df RSS Df Sum of Sq      F    Pr(>F)
1     23 340
2     20 112  3       228 13.571 4.658e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> options(contrasts = c("contr.sum", "contr.poly"))
> gs <- lm(coag ~ diet, coagulation)
> summary(gs)
Call:
lm(formula = coag ~ diet, data = coagulation)

Residuals:
       Min         1Q     Median         3Q        Max
-5.000e+00 -1.250e+00 -2.141e-17  1.250e+00  5.000e+00

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  64.0000     0.4979 128.537  < 2e-16 ***
diet1        -3.0000     0.9736  -3.081 0.005889 **
diet2         2.0000     0.8453   2.366 0.028195 *
diet3         4.0000     0.8453   4.732 0.000128 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.366 on 20 degrees of freedom
Multiple R-squared: 0.6706,     Adjusted R-squared: 0.6212
F-statistic: 13.57 on 3 and 20 DF,  p-value: 4.658e-05
> par(mfrow = c(1, 1))
> qqnorm(residuals(g))
> plot(jitter(fitted(g)), residuals(g), xlab = "Fitted", ylab = "Residuals")

There are bunch of options to produce the contrast matrix

contr.helmert(n, contrasts = TRUE) contr.poly(n, scores = 1:n, contrasts = TRUE) contr.sum(n, contrasts = TRUE) contr.treatment(n, base = 1, contrasts = TRUE) contr.SAS(n, contrasts = TRUE)

> med <- with(coagulation, tapply(coag, diet, median))
> ar <- with(coagulation, abs(coag - med[diet]))
> anova(lm(ar ~ diet, coagulation))
Analysis of Variance Table

Response: ar
          Df Sum Sq Mean Sq F value Pr(>F)
diet       3  4.333   1.444  0.6492 0.5926
Residuals 20 44.500   2.225
> qt(0.975, 20)
[1] 2.085963
> c(5 - 2.086 * 1.53, 5 + 2.086 * 1.53)
[1] 1.80842 8.19158
> qtukey(0.95, 4, 20)/sqrt(2)
[1] 2.798936
> c(5 - 2.8 * 1.53, 5 + 2.8 * 1.53)
[1] 0.716 9.284
> TukeyHSD(aov(coag ~ diet, coagulation))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = coag ~ diet, data = coagulation)

$diet
             diff         lwr       upr     p adj
B-A  5.000000e+00   0.7245544  9.275446 0.0183283
C-A  7.000000e+00   2.7245544 11.275446 0.0009577
D-A -1.421085e-14  -4.0560438  4.056044 1.0000000
C-B  2.000000e+00  -1.8240748  5.824075 0.4766005
D-B -5.000000e+00  -8.5770944 -1.422906 0.0044114
D-C -7.000000e+00 -10.5770944 -3.422906 0.0001268
> qt(1 - 0.05/12, 20)
[1] 2.927119

I started going over lady tasting tea to understood why in the hell does one want anova in the first place ? By analyzing anova, what are we getting at ? why is it called analysis of variance ?

> y <- rnorm(1000)
> x <- rep(letters[1:2], 500)
> summary(lm(y ~ x))
Call:
lm(formula = y ~ x)

Residuals:
      Min        1Q    Median        3Q       Max
-3.157256 -0.724636 -0.008918  0.680274  3.430869

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.05189    0.03298   1.573    0.116
x1          -0.04674    0.03298  -1.417    0.157

Residual standard error: 1.043 on 998 degrees of freedom
Multiple R-squared: 0.002008,   Adjusted R-squared: 0.001008
F-statistic: 2.008 on 1 and 998 DF,  p-value: 0.1568
> anova(lm(y ~ x))
Analysis of Variance Table

Response: y
           Df  Sum Sq Mean Sq F value Pr(>F)
x           1    2.18    2.18  2.0083 0.1568
Residuals 998 1085.63    1.09

What is the historical importance of anova ? is something I intend to explore soon.