Anova
Purpose
Anova is a concept which is probably very well known to everyone who has taken any elementary course in stats.When I tried racking my brains to remember what I could do with R , relating to ANOVA.
> x <- rnorm(1000, 2, 3) > y <- rnorm(1000, 2, 9) |
I was trying to do this . Anova of 2 random variables with same mean and different variance. Anova is something which is not used on 2 random variables. It is for heaven’s sake , analysis of variance and there is nothing to analyze the variance of x and y I can instead do a t test
> t.test(y - x)
One Sample t-test
data: y - x
t = 0.1947, df = 999, p-value = 0.8457
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.5097545 0.6220424
sample estimates:
mean of x
0.05614391 |
This talks about the hypothesis of difference between means = 0 against the alternate that it is not 0.
While reading the stuff about anova, I came across stripplot and hence explored a bit on the stripplot function in R.
> par(mfrow = c(1, 1))
> x <- stats::rnorm(50)
> xr <- round(x, 1)
> stripchart(x)
> m <- mean(par("usr")[1:2])
> text(m, 1.04, "stripchart(x, \"overplot\")")
> stripchart(xr, method = "stack", add = TRUE, at = 1.2)
> text(m, 1.35, "stripchart(round(x,1), \"stack\")")
> stripchart(xr, method = "jitter", add = TRUE, at = 0.7)
> text(m, 0.85, "stripchart(round(x,1), \"jitter\")") |

Basic funda is to use to check 2 different models.
> library(faraway) > data(coagulation) > boxplot(coag ~ diet, coagulation) > par(mfrow = c(1, 2)) > plot(coag ~ diet, coagulation, ylab = "coagulation time") > with(coagulation, stripchart(coag ~ diet, vertical = TRUE, method = "stack", + xlab = "diet", ylab = "coagulation time")) |
Lets fit a model
> g <- lm(coag ~ diet, coagulation)
> summary(g)
Call:
lm(formula = coag ~ diet, data = coagulation)
Residuals:
Min 1Q Median 3Q Max
-5.000e+00 -1.250e+00 -2.141e-17 1.250e+00 5.000e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.0000 0.4979 128.537 < 2e-16 ***
diet1 -3.0000 0.9736 -3.081 0.005889 **
diet2 2.0000 0.8453 2.366 0.028195 *
diet3 4.0000 0.8453 4.732 0.000128 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.366 on 20 degrees of freedom
Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212
F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05
> g <- lm(coag ~ diet - 1, coagulation)
> summary(g)
Call:
lm(formula = coag ~ diet - 1, data = coagulation)
Residuals:
Min 1Q Median 3Q Max
-5.000e+00 -1.250e+00 1.743e-16 1.250e+00 5.000e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
dietA 61.0000 1.1832 51.55 <2e-16 ***
dietB 66.0000 0.9661 68.32 <2e-16 ***
dietC 68.0000 0.9661 70.39 <2e-16 ***
dietD 61.0000 0.8367 72.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.366 on 20 degrees of freedom
Multiple R-squared: 0.9989, Adjusted R-squared: 0.9986
F-statistic: 4399 on 4 and 20 DF, p-value: < 2.2e-16 |
Basically to test the model against the null that none of the coefficients are good, there is a simple anova test
> gi <- lm(coag ~ diet - 1, coagulation) > gnull <- lm(coag ~ 1, coagulation) > anova(gnull, gi) Analysis of Variance Table Model 1: coag ~ 1 Model 2: coag ~ diet - 1 Res.Df RSS Df Sum of Sq F Pr(>F) 1 23 340 2 20 112 3 228 13.571 4.658e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
> options(contrasts = c("contr.sum", "contr.poly"))
> gs <- lm(coag ~ diet, coagulation)
> summary(gs)
Call:
lm(formula = coag ~ diet, data = coagulation)
Residuals:
Min 1Q Median 3Q Max
-5.000e+00 -1.250e+00 -2.141e-17 1.250e+00 5.000e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.0000 0.4979 128.537 < 2e-16 ***
diet1 -3.0000 0.9736 -3.081 0.005889 **
diet2 2.0000 0.8453 2.366 0.028195 *
diet3 4.0000 0.8453 4.732 0.000128 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.366 on 20 degrees of freedom
Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212
F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05
> par(mfrow = c(1, 1))
> qqnorm(residuals(g))
> plot(jitter(fitted(g)), residuals(g), xlab = "Fitted", ylab = "Residuals") |
There are bunch of options to produce the contrast matrix
contr.helmert(n, contrasts = TRUE) contr.poly(n, scores = 1:n, contrasts = TRUE) contr.sum(n, contrasts = TRUE) contr.treatment(n, base = 1, contrasts = TRUE) contr.SAS(n, contrasts = TRUE)
> med <- with(coagulation, tapply(coag, diet, median))
> ar <- with(coagulation, abs(coag - med[diet]))
> anova(lm(ar ~ diet, coagulation))
Analysis of Variance Table
Response: ar
Df Sum Sq Mean Sq F value Pr(>F)
diet 3 4.333 1.444 0.6492 0.5926
Residuals 20 44.500 2.225
> qt(0.975, 20)
[1] 2.085963
> c(5 - 2.086 * 1.53, 5 + 2.086 * 1.53)
[1] 1.80842 8.19158
> qtukey(0.95, 4, 20)/sqrt(2)
[1] 2.798936
> c(5 - 2.8 * 1.53, 5 + 2.8 * 1.53)
[1] 0.716 9.284
> TukeyHSD(aov(coag ~ diet, coagulation))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = coag ~ diet, data = coagulation)
$diet
diff lwr upr p adj
B-A 5.000000e+00 0.7245544 9.275446 0.0183283
C-A 7.000000e+00 2.7245544 11.275446 0.0009577
D-A -1.421085e-14 -4.0560438 4.056044 1.0000000
C-B 2.000000e+00 -1.8240748 5.824075 0.4766005
D-B -5.000000e+00 -8.5770944 -1.422906 0.0044114
D-C -7.000000e+00 -10.5770944 -3.422906 0.0001268
> qt(1 - 0.05/12, 20)
[1] 2.927119 |
I started going over lady tasting tea to understood why in the hell does one want anova in the first place ? By analyzing anova, what are we getting at ? why is it called analysis of variance ?
> y <- rnorm(1000)
> x <- rep(letters[1:2], 500)
> summary(lm(y ~ x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.157256 -0.724636 -0.008918 0.680274 3.430869
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05189 0.03298 1.573 0.116
x1 -0.04674 0.03298 -1.417 0.157
Residual standard error: 1.043 on 998 degrees of freedom
Multiple R-squared: 0.002008, Adjusted R-squared: 0.001008
F-statistic: 2.008 on 1 and 998 DF, p-value: 0.1568
> anova(lm(y ~ x))
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 2.18 2.18 2.0083 0.1568
Residuals 998 1085.63 1.09 |
What is the historical importance of anova ? is something I intend to explore soon.