Linear Models with R
I strongly believe that you can learn stats by have a parallel process of working on understanding the theory AND simulating data to know the implementation details about the theory. Hence while learning about linear models, I used this book to know the R commands for running linear models. The book takes you through all the possible nuances of a linear model. Let me summarize this book.
Estimation
Estimation of various parameters in a linear model from scratch
Identifiability arises when the model matrix is not full rank and hence not invertible.
Check the eigen values of the design matrix. If any of the eigen values is close to 0 or 0 , then you have a problem of identifiability and problem of collinearity.
Clear lack of indentifiability is good as software throws up error or clear warnings. But if there is a situation which is close to unidentifiability, then it is a bigger problem where it is the responsibility of the analyst to interpret the standard errors of the model
If I have 2 models , M1 and M2, and the only difference between the two is an explanatory variable , then anova(M1, M2) does an F test for the significance of the additional variable
For a single parameter model, t statistic = square root( F statistic )
Inference
Distribution of Beta, mu and variance of beta
Use of I() for hypothesis testing
Use of offset() for hypothesis testing
Checking the t stat from a standard regression model , one can get a confidence band for the parameter. One can get the same statistic by doing the following :
Fit a model with all the variables
Fit a model with all the variables except the one that you want to test
Do an anova of two models, you get F statistic and it is nothing but the square of t statistic reported
Suppose there is a book store and depending on the genre of the book, the bookstore offers discount. So , If I pick up a nonfiction finance book, I can ask two questions
Given that I have chosen a non-fiction finance book, what is the average discount that I can expect?
Given that I have chosen a non-fiction finance book, what is the price band that I will be expected to foot?
The thing to note is that in former question, beta variance will suffice , but the latter question needs to take care of error variance too
Predict with the argument “confidence” can be used for predicting the mean of response variable given a specific value of the independent variable
Predict with the argument “prediction” can be used for predicting the response variable given a specific value of the independent variable\
Diagnostics
Cooks distance – What is it ? How to compute it ?
Hat value measures leverage – What is it ? How to compute it ?
Added Variable plots – How do draw one ?
Durbin Watson test - How to compute it ?
Leverage talks about the spread in the independent variables
Cooks distance talks about the influence of a specific point on the slope and intercept of the model
Way to draw a added variable plot
Problem with Predictors
Measurement error of the independent variable causes a bias in the estimates
By changing the scale, the parameters are also affected
Change the scale of X – t, F, Rsquare, sigma square remains same whereas beta gets rescaled.
Change the scale of Y – t, F,Rsquare remains same whereas beta and sigma square gets rescaled
Collinearity
Conditional Index
Variance Inflation Factor
Problems with Error
Bootstrap regression
Robust regression
Weighted Least Squares
Generalized Least Squares
Transformation
In reality, there will be non constant error variance
Log transformation is one of the easiest
Log – Log is also good as it removes the non linearity in the relationship and makes it a linear relationship. YVonneBishop is credited with the development of Log Linear Models
Box Cox Transformation to be used on variables so that the response variable is more tractable analytically
Build confidence intervals for Box Cox and then use it to estimate bands for Lambda. Based on the Lambda you can decide whether you really need a transformation or not
Logit Transformation in cases where the y variable is proportion based
Fischer Z transformation in cases when the y variable represents correlation
Hockey Stick regression
Spline regression
Polynomial regression
Orthogonal regression
Variable Selection
Forward + Backward + Stepwise regression
Information criterion like AIC, BIC
Mallows criterion
Shrinkage Methods
PCA based regression
Partial Regression
Ridge Regression
Missing Treatment
Removal of outliers
Imputing mean
Regression fill in method
ANCOVA + ANOVA
One of the best ways to check the seasonality is to use ANOVA and create bands for each of the parameters which relate to the difference of means between 2 levels
Two Way ANOVA and Fractional Design
Insurance Redlining Case
Aggregation bias more specifically when conclusions at the group level are extended to the individual level
Steps
Diagnostics
Skewness
Variation of the independent variable
Stripcharts to get an idea about the variation
Pairs to see the cross correlation
Fit an MLR
Hat values talks about the variation in the independent variables
Cooks distance talks about the influence of the point on the slope and intercept of the model.
Use partial regression plots and partial residual plots to check for the need of transformation