Faraway-2-Intro
Purpose
To understand various aspects of extending a linear model .
> library(faraway)
> library(car)
> data(gavote)
> head(gavote)
equip econ perAA rural atlanta gore bush other votes ballots
APPLING LEVER poor 0.182 rural notAtlanta 2093 3940 66 6099 6617
ATKINSON LEVER poor 0.230 rural notAtlanta 821 1228 22 2071 2149
BACON LEVER poor 0.131 rural notAtlanta 956 2010 29 2995 3347
BAKER OS-CC poor 0.476 rural notAtlanta 893 615 11 1519 1607
BALDWIN LEVER middle 0.359 rural notAtlanta 5893 6041 192 12126 12785
BANKS LEVER middle 0.024 rural notAtlanta 1220 3202 111 4533 4773 |
The variables in the dataset are - equip - type of equipment - econ - economic level of the country - perAA - percentage of afro americans - rural - whether the country is rural or urban - atlanta - whether country is a part of atlanta metropolitan - gore - votes for gore - bush - votes for bush - other - votes for others - votes - total votes - ballots - ballots used
> summary(gavote)
equip econ perAA rural atlanta
LEVER:74 middle:69 Min. :0.0000 rural:117 Atlanta : 15
OS-CC:44 poor :72 1st Qu.:0.1115 urban: 42 notAtlanta:144
OS-PC:22 rich :18 Median :0.2330
PAPER: 2 Mean :0.2430
PUNCH:17 3rd Qu.:0.3480
Max. :0.7650
gore bush other votes
Min. : 249 Min. : 271 Min. : 5.0 Min. : 832
1st Qu.: 1386 1st Qu.: 1804 1st Qu.: 30.0 1st Qu.: 3506
Median : 2326 Median : 3597 Median : 86.0 Median : 6299
Mean : 7020 Mean : 8929 Mean : 381.7 Mean : 16331
3rd Qu.: 4430 3rd Qu.: 7468 3rd Qu.: 210.0 3rd Qu.: 11846
Max. :154509 Max. :140494 Max. :7920.0 Max. :263211
ballots
Min. : 881
1st Qu.: 3694
Median : 6712
Mean : 16927
3rd Qu.: 12251
Max. :280975
> gavote$undercount = (gavote$ballots - gavote$votes)/gavote$ballots
> boxplot(gavote$undercount) |
One important learning is that you should always look at magnitude of the possible y values and transform it accordingly. In this case , the relative undercount proportion makes far more sense that the usual raw numbers.
> hist(gavote$undercount, n.bins(gavote$undercount))
> plot(density(gavote$undercount))
> rug(gavote$undercount)
> pie(table(gavote$equip))
> barplot(sort(table(gavote$equip), decreasing = T), las = 2)
> gavote$pergore <- gavote$gore/gavote$votes
> plot(pergore ~ perAA, gavote)
> plot(undercount ~ equip, gavote)
> xtabs(~atlanta + rural, gavote)
rural
atlanta rural urban
Atlanta 1 14
notAtlanta 116 28 |

Basic Modeling
> gavote$cpergore <- gavote$pergore - mean(gavote$pergore)
> gavote$cperAA <- gavote$perAA - mean(gavote$perAA)
> lmodi <- lm(undercount ~ cperAA + cpergore * rural + equip, gavote)
> summary(lmodi)
Call:
lm(formula = undercount ~ cperAA + cpergore * rural + equip,
data = gavote)
Residuals:
Min 1Q Median 3Q Max
-0.059530 -0.012904 -0.002180 0.009013 0.127496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.043297 0.002839 15.253 < 2e-16 ***
cperAA 0.028264 0.031092 0.909 0.3648
cpergore 0.008237 0.051156 0.161 0.8723
ruralurban -0.018637 0.004648 -4.009 9.56e-05 ***
equipOS-CC 0.006482 0.004680 1.385 0.1681
equipOS-PC 0.015640 0.005827 2.684 0.0081 **
equipPAPER -0.009092 0.016926 -0.537 0.5920
equipPUNCH 0.014150 0.006783 2.086 0.0387 *
cpergore:ruralurban -0.008799 0.038716 -0.227 0.8205
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02335 on 150 degrees of freedom
Multiple R-squared: 0.1696, Adjusted R-squared: 0.1253
F-statistic: 3.829 on 8 and 150 DF, p-value: 0.0004001
> par(mfrow = c(2, 2))
> plot(lmodi) |
Robust regression
This is present in the MASS package.
> library(MASS)
> x <- security.db1[, c("UNIONBANK", "PNB")]
> head(x)
UNIONBANK PNB
1 148.95 404.75
2 152.35 408.90
3 149.80 406.50
4 149.00 402.45
5 149.05 408.55
6 140.85 391.25
> rlm(UNIONBANK ~ PNB, x)
Call:
rlm(formula = UNIONBANK ~ PNB, data = x)
Converged in 6 iterations
Coefficients:
(Intercept) PNB
50.5755479 0.2426357
Degrees of freedom: 236 total; 234 residual
Scale estimate: 9.94
> lm(UNIONBANK ~ PNB, x)
Call:
lm(formula = UNIONBANK ~ PNB, data = x)
Coefficients:
(Intercept) PNB
52.5559 0.2419 |
Other learnings from chapter 1 of the book are
- You can use step to prune down a big model in to a small model
- You can manually prune down by doing ANOVA and by checking F values
- regsubsets from MASS package is again very useful
- Hierarchy rule says that lower order interactions can be removed if higher order interactions are found in the model/
- Weighted least square is useful if you know that residuals can be weighted based on the number of observations in a specific variable
- Robust regression from the MASS Package
- You can have a point which has high leverage but low influence