Purpose
This script is to investigate the absence of an intercept in a model and its effect on R squared

Let me simulate an OLS

> set.seed(1977)
> n <- 15000
> beta.actual <- matrix(c(-20, 3), ncol = 1)
> beta.sample <- cbind(rnorm(n, beta.actual[1]), rnorm(n, beta.actual[2]))
> error <- rnorm(n)
> x <- cbind(rep(1, n), seq(from = 1, to = 5, length.out = n))
> y <- x[, 1] * beta.sample[, 1] + x[, 2] * beta.sample[, 2] +
+     error
> plot(x[, 2], y)
> summary(lm(y ~ x + 0))
Call:
lm(formula = y ~ x + 0)
Residuals: Min 1Q Median 3Q Max -18.107422 -2.126044 0.006147 2.108803 15.303521
Coefficients: Estimate Std. Error t value Pr(>|t|) x1 -19.96603 0.07997 -249.7 <2e-16 *** x2 2.99550 0.02488 120.4 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.518 on 14998 degrees of freedom Multiple R-squared: 0.9146, Adjusted R-squared: 0.9146 F-statistic: 8.029e+04 on 2 and 14998 DF, p-value: < 2.2e-16 > summary(lm(y ~ x[, 1] + 0)) Call: lm(formula = y ~ x[, 1] + 0)
Residuals: Min 1Q Median 3Q Max -13.0710 -3.6950 -0.6678 3.1540 20.6075
Coefficients: Estimate Std. Error t value Pr(>|t|) x[, 1] -10.97952 0.04029 -272.5 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.934 on 14999 degrees of freedom Multiple R-squared: 0.832, Adjusted R-squared: 0.832 F-statistic: 7.428e+04 on 1 and 14999 DF, p-value: < 2.2e-16

I stumbled on to this statement in a blog and immediately realized the way to get negative r square in a model ?

Now I should confess that in order to get models that behaved so badly, I had to ratchet up the error variation and ratchet down the coefficient scale. My error standard deviation was 50 times the standard deviation of the coefficients, but that is somewhat unrealistic. Most models are not going to have negative R-Squares, even if they are overfit.

> set.seed(1977)
> n <- 15000
> beta.actual <- matrix(c(2, 3), ncol = 1)
> beta.sample <- cbind(rnorm(n, beta.actual[1]), rnorm(n, beta.actual[2]))
> error <- rnorm(n, 0, 10000)
> x <- cbind(rep(1, n), seq(from = 1, to = 5, length.out = n))
> y <- x[, 1] * beta.sample[, 1] + x[, 2] * beta.sample[, 2] +
+     error
> summary(lm(y ~ x[, 2] + 0))
Call:
lm(formula = y ~ x[, 2] + 0)
Residuals: Min 1Q Median 3Q Max -38566.92 -6667.73 -33.89 6785.59 37467.89
Coefficients: Estimate Std. Error t value Pr(>|t|) x[, 2] -7.041 25.385 -0.277 0.782
Residual standard error: 9994 on 14999 degrees of freedom Multiple R-squared: 5.129e-06, Adjusted R-squared: -6.154e-05 F-statistic: 0.07693 on 1 and 14999 DF, p-value: 0.7815

Adjusted R-squared: -1.517e-05 This is exactly what I wanted to simulate a few hours ago.

Cool!!

Lesson learnt : If the error variance is too too high than the coefficient variance, then you get a negative variance