Omitted Variable Problem

Purpose
Omitted Variable Bias

Simulating a dataset with three 2 independent variables with no correlation between them. Y = alpha + Beta_1 X_1 + Beta_2 X_2

> N <- 10000
> x <- cbind(1, runif(N), runif(N))
> beta.true <- c(2, 3, 5)
> error.var <- 3
> indep <- as.matrix(x)
> dep <- indep %*% beta.true + sqrt(error.var) * rnorm(N)
> fit1 <- lm(dep ~ indep[, c(2, 3)])
> summary(fit1)
Call:
lm(formula = dep ~ indep[, c(2, 3)])

Residuals:
      Min        1Q    Median        3Q       Max
-6.681732 -1.174800  0.005124  1.160683  7.613576

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        2.04134    0.04606   44.32   <2e-16 ***
indep[, c(2, 3)]1  2.98309    0.06077   49.09   <2e-16 ***
indep[, c(2, 3)]2  4.94039    0.06078   81.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.749 on 9997 degrees of freedom
Multiple R-squared: 0.4762,     Adjusted R-squared: 0.4761
F-statistic:  4544 on 2 and 9997 DF,  p-value: < 2.2e-16
> fit2 <- lm(dep ~ indep[, c(2)])
> summary(fit2)
Call:
lm(formula = dep ~ indep[, c(2)])

Residuals:
       Min         1Q     Median         3Q        Max
-7.5494252 -1.5544859  0.0004559  1.5569615  8.7648413

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    4.49838    0.04479  100.44   <2e-16 ***
indep[, c(2)]  3.02685    0.07831   38.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.254 on 9998 degrees of freedom
Multiple R-squared:  0.13,      Adjusted R-squared: 0.1299
F-statistic:  1494 on 1 and 9998 DF,  p-value: < 2.2e-16
> fit3 <- lm(dep ~ indep[, c(3)])
> summary(fit3)
Call:
lm(formula = dep ~ indep[, c(3)])

Residuals:
     Min       1Q   Median       3Q      Max
-7.08114 -1.31229 -0.02188  1.29275  8.40231

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    3.50248    0.03916   89.44   <2e-16 ***
indep[, c(3)]  4.96683    0.06770   73.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.948 on 9998 degrees of freedom
Multiple R-squared: 0.3499,     Adjusted R-squared: 0.3499
F-statistic:  5382 on 1 and 9998 DF,  p-value: < 2.2e-16

As you can see that Beta_1 and Beta_2 can independently estimated with out any bias even in shorter form of regressions.

Now lets say that the variables are correlated. X_2 and X_3 are correlated. Let the correlation between be 0.9

> library(mnormt)
> sample.cov <- matrix(data = NA, nrow = 2, ncol = 2)
> sample.cov[1, 1] <- 1
> sample.cov[1, 2] <- 0.8
> sample.cov[2, 1] <- 0.8
> sample.cov[2, 2] <- 1
> x <- rmnorm(n, mean = 0, varcov = sample.cov)
> x <- cbind(1, x)
> beta.true <- c(2, 3, 3)
> error.var <- 3
> indep <- as.matrix(x)
> dep <- indep %*% beta.true + sqrt(error.var) * rnorm(N)
> fit1 <- lm(dep ~ indep[, c(2, 3)])
> summary(fit1)
Call:
lm(formula = dep ~ indep[, c(2, 3)])

Residuals:
     Min       1Q   Median       3Q      Max
-6.57254 -1.16425  0.01150  1.15195  6.17648

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.98840    0.01725   115.3   <2e-16 ***
indep[, c(2, 3)]1  3.00990    0.02880   104.5   <2e-16 ***
indep[, c(2, 3)]2  2.99156    0.02860   104.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.725 on 9997 degrees of freedom
Multiple R-squared: 0.918,      Adjusted R-squared: 0.918
F-statistic: 5.596e+04 on 2 and 9997 DF,  p-value: < 2.2e-16
> fit2 <- lm(dep ~ indep[, c(2)])
> summary(fit2)
Call:
lm(formula = dep ~ indep[, c(2)])

Residuals:
     Min       1Q   Median       3Q      Max
-9.20541 -1.66141  0.03458  1.70168 10.26445

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.96109    0.02496   78.58   <2e-16 ***
indep[, c(2)]  5.43369    0.02474  219.60   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.495 on 9998 degrees of freedom
Multiple R-squared: 0.8283,     Adjusted R-squared: 0.8283
F-statistic: 4.823e+04 on 1 and 9998 DF,  p-value: < 2.2e-16
> fit3 <- lm(dep ~ indep[, c(3)])
> summary(fit3)
Call:
lm(formula = dep ~ indep[, c(3)])

Residuals:
       Min         1Q     Median         3Q        Max
-11.749998  -1.669863   0.008223   1.679512   9.647873

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.99631    0.02495   80.01   <2e-16 ***
indep[, c(3)]  5.39703    0.02456  219.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.495 on 9998 degrees of freedom
Multiple R-squared: 0.8284,     Adjusted R-squared: 0.8284
F-statistic: 4.827e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

One can see that the coefficients are screwed up if you omit any variable in the regression. Thus omitted variable kills you if the omitted variable has any correlation with the explanatory variable in the model which is always the case Take any regression involving 2 variable, there is always a possibility that the left out variable or the omitted variable has a correlation with the explanatory variable.

For some vague reason, I had never simulated and tested out and checked out what was being said about the omitted variable