Statistics Done Wrong

cover

This book is vastly different from the books that try to warn us against incorrect statistical arguments present in media and other mundane places. Instead of targeting newspaper articles, politicians, journalists who make errors in their reasoning, the author investigates research papers, where one assumes that scientists and researchers make flawless arguments, at least from stats point of view. The author points a few statistical errors, even in the pop science book, “How to lie with statistics?’’. This book takes the reader through the kind of statistics that one comes across in research papers and shows various types of flawed arguments. The flaws could arise because of several reasons such as eagerness to publish a new finding without thoroughly vetting the findings, not enough sample size, not enough statistical power in the test, inference from multiple comparisons etc. The tone of the author isn’t deprecatory. Instead he explains the errors in simple words. There is minimal math in the book and the writing makes the concepts abundantly clear even to a statistics novice. That in itself should serve as a good motivation for a wider audience to go over this 130 page book.

In the first chapter, the author introduces the basic concept of statistical significance. The basic idea of frequentist hypothesis testing is that it is dependent on p value that measure Probability(data|Hypothesis). In a way, p value measures the amount of surprise that you find in the data given that you have a specific null hypothesis in mind. If the p value turns out to be too less, then you start doubting your null and reject the null. The procedure at the outset looks perfectly logical. However one needs to keep in mind, the things that do not form a part of p value such as,

It does not per se measure the size of the effect.
Two experiments with identical data can give different p values. This is disturbing as one assumes that p value somehow knows the intention of the person doing the experiment.
It does not say anything about the false positive rate.

By the end of the first chapter, the author convincingly rips apart p value and makes a case for using confidence intervals. He also says that many people do not report confidence intervals because they are often embarrassingly wide and might make their effort a fruitless exercise.

The second chapter talks about statistical power, a concept that many introductory stats courses do not delve in to, appropriately. The statistical power of a study is the probability that it will distinguish an effect of a certain size from pure luck. The power depends on three factors

size of the bias you are looking for
sample size
measurement error

If an experiment is trying to test a subtle bias, then there needs to be far more data to even detect it. Usually the accepted power for an experiment is 80%. This means that the probability of bias detection is close to 80%. In many of the tests that have negative results, i.e the alternate is rejected, it is likely that the power of test is compromised. Why do researchers fail to take care of power in their calculations? The author guesses that it could be because the researcher’s intuitive feeling about samples is quite different from the results of power calculations. The author also ascribes to the not so straightforward math required to compute the power of study.

The problems with power also plague the other side of experimental results. Instead of detecting the true bias, the results often show inflation of true result, called M errors, where M stands for magnitude. One of the suggestions given by the author is : Instead of computing the power of a study for a certain bias detection and certain statistical significance, the researchers should instead look for power that gives narrower confidence intervals. Since there is no readily available term to describe this statistic, the author calls it assurance, which determines how often the confidence intervals must beat a specific target width. The takeaway from this chapter is that whenever you see a report of significant effect, your reaction should not be “Wow, they found something remarkable", but it needs to be, “Is the test underpowered ?”. Also just because alternate was rejected doesn’t mean that alternate is absolute crap.

The third chapter talks about pseudo replication, a practice where the researcher uses the same set of patients/animals/ whatever to create repeated measurements. Instead of bigger sample sizes, the researcher creates a bigger sample size by repeated measurements. Naturally the data is not going to be independent as the original experiment might warrant. Knowing that there is a pseudo replication of the data, one must be careful while drawing inferences. The author gives some broad suggestions to address this issue

The fourth chapter is about the famous base rate fallacy where one ascribes the p value to the probability of alternate being true. Frequentist procedures that give p values merely talk about the surprise element. In no way do they actually talk about the probability of alternate treatment in a treatment control experiment. The best way to get a good estimate of probability that a result is false positive, is by considering prior estimates. The author also talks about Benjamini-Hochberg procedure, a simple yet effective procedure to control for false positive rate. I remember reading this procedure in an article by Brad Efron titled, “The future of indirect evidence”, in which Efron highlights some of the issues related to hypothesis testing in high dimensional data.

The fifth chapter talks about the often found procedure of testing two drugs with a placebo and using the results to compare the efficiency of two drugs. Various statistical errors can creep in. These are thoroughly discussed. The sixth chapter talks about double dipping, i.e. using the same data to do exploratory analysis and hypothesis testing. It is the classic case of using in-sample statistics to extrapolate out-of-sample statistics. The author talks about arbitrary stopping rules that many researchers employ for cutting short an elaborate experiment when they find statistically significant findings at the initial stage. Instead of having a mindset which says, “I might have been lucky in the initial stage”, the researchers over enthusiastically stops the experiment and reports truth inflated result. The seventh chapter talks about the dangers of dichotomizing continuous data. In many research papers, there is a tendency to divide the data in to two groups and run significance tests or ANOVA based tests, thus reducing the information available from the dataset. The author gives a few examples where dichotomization can lead to grave statistical errors.

The eighth chapter talks about basic errors that one does in doing regression analysis. The errors highlighted are

over reliance on stepwise regression methods like forward selection or backward elimination methods
confusing correlation and causation
confounding variables and Simpson’s paradox

The last few chapters gives general guidelines to improve research efforts, one of them being “reproducible research”.

Takeaway

Even though this book is a compilation of various statistical errors committed by researchers in various scientific fields, it can be read by anyone whose day job is data analysis and model building. In our age of data explosion, where there are far more people employed in analyzing data and who need not necessarily publish papers, this book would be useful to a wider audience. If one wants to go beyond the simple conceptual errors present in the book, one might have to seriously think about all the errors mentioned in the book and understand the math behind them.