OLS Negative R-squared Simulation

Purpose
This script is to investigate the absence of an intercept in a model and its effect on R squared

Let me simulate an OLS

> set.seed(1977)
> n <- 15000
> beta.actual <- matrix(c(-20, 3), ncol = 1)
> beta.sample <- cbind(rnorm(n, beta.actual[1]), rnorm(n, beta.actual[2]))
> error <- rnorm(n)
> x <- cbind(rep(1, n), seq(from = 1, to = 5, length.out = n))
> y <- x[, 1] * beta.sample[, 1] + x[, 2] * beta.sample[, 2] +
+     error
> plot(x[, 2], y)
> summary(lm(y ~ x + 0))
Call:
lm(formula = y ~ x + 0)

Residuals:
       Min         1Q     Median         3Q        Max
-18.107422  -2.126044   0.006147   2.108803  15.303521

Coefficients:
    Estimate Std. Error t value Pr(>|t|)
x1 -19.96603    0.07997  -249.7   <2e-16 ***
x2   2.99550    0.02488   120.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.518 on 14998 degrees of freedom
Multiple R-squared: 0.9146,     Adjusted R-squared: 0.9146
F-statistic: 8.029e+04 on 2 and 14998 DF,  p-value: < 2.2e-16
> summary(lm(y ~ x[, 1] + 0))
Call:
lm(formula = y ~ x[, 1] + 0)

Residuals:
     Min       1Q   Median       3Q      Max
-13.0710  -3.6950  -0.6678   3.1540  20.6075

Coefficients:
        Estimate Std. Error t value Pr(>|t|)
x[, 1] -10.97952    0.04029  -272.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.934 on 14999 degrees of freedom
Multiple R-squared: 0.832,      Adjusted R-squared: 0.832
F-statistic: 7.428e+04 on 1 and 14999 DF,  p-value: < 2.2e-16

I stumbled on to this statement in a blog and immediately realized the way to get negative r square in a model ?

Now I should confess that in order to get models that behaved so badly, I had to ratchet up the error variation and ratchet down the coefficient scale. My error standard deviation was 50 times the standard deviation of the coefficients, but that is somewhat unrealistic. Most models are not going to have negative R-Squares, even if they are overfit.

> set.seed(1977)
> n <- 15000
> beta.actual <- matrix(c(2, 3), ncol = 1)
> beta.sample <- cbind(rnorm(n, beta.actual[1]), rnorm(n, beta.actual[2]))
> error <- rnorm(n, 0, 10000)
> x <- cbind(rep(1, n), seq(from = 1, to = 5, length.out = n))
> y <- x[, 1] * beta.sample[, 1] + x[, 2] * beta.sample[, 2] +
+     error
> summary(lm(y ~ x[, 2] + 0))
Call:
lm(formula = y ~ x[, 2] + 0)

Residuals:
      Min        1Q    Median        3Q       Max
-38566.92  -6667.73    -33.89   6785.59  37467.89

Coefficients:
       Estimate Std. Error t value Pr(>|t|)
x[, 2]   -7.041     25.385  -0.277    0.782

Residual standard error: 9994 on 14999 degrees of freedom
Multiple R-squared: 5.129e-06,  Adjusted R-squared: -6.154e-05
F-statistic: 0.07693 on 1 and 14999 DF,  p-value: 0.7815

Adjusted R-squared: -1.517e-05 This is exactly what I wanted to simulate a few hours ago.

Cool!!

Lesson learnt : If the error variance is too too high than the coefficient variance, then you get a negative variance