Statistical Modeling–The two cultures

The paper titled, “Statistical Modeling – The two cultures”, by Leo Breiman(man behind CART) was published 14 years ago, a time when machine learning techniques had started to become popular. In the last decade or so, the field has exploded. There are many graduate and PhD programs that are offering full-fledged courses on the subject. There are also dedicated journals that publish ML research findings of researchers from a wide range of fields. Larry Wasserman attributes the quick rise of the field to “conference culture”, an environment where an ML researcher is valued based on his/her conference publications. The thought process and the activities that precede a “conference publication” is different from the usual “journal” publication and this enables an ML researcher to work on far many interesting problems usually from a diverse set of areas as compared to a statistician.

Those who acknowledge that the field of statistics has a significant competitor, have quickly changed their ways of research by learning and incorporating ML techniques. This paper is a wake up call to all those statisticians who have not yet made the transition.

The two cultures that are being referred to, are “Data modeling culture” and “Algorithmic modeling culture”. In the former one, the analysis starts by assuming a stochastic model for the data generating process. The model is evaluated based “how good the model fits the data ?”. In the latter case, the model is evaluated based on “how good the model can predict?”.

The author, a statistician by training, gets in to a full time consulting job for 13 years. His experience over consulting projects shapes his opinions about modeling and its uses. He summarized them in the paper.

Focus on finding a good solution-that’s what consultants get paid for.
Live with the data before you plunge into modeling.
Search for a model that gives a good solution, either algorithmic or data.
Predictive accuracy on test sets is the criterion for how good the model is.
Computers are an indispensable partner.

After this long stint in consulting, the author returns to Berkeley and finds that the usual way of dealing with problems - “Build data models” extremely limiting. Why so ?

The conclusions from the data models are about the model’s mechanism, and not about nature’s mechanism. If the model is a poor emulation of nature, the conclusions may be wrong.
Models are being built that makes it attractive from a math point of view. It need not and often is not how the reality is
Should a data model be used at all if the sample is the entire population ?
There are many problems with using goodness-of-fit and residual analysis for checking the data model. They lack the power in more than a few dimensions.
Usually multiple models fit the data. One reason for this multiplicity is that goodness-of-fit and other methods for checking fit give a yes-no answer. There is no way to compare models that have been tested based on yes-no answers
Standard models such as linear regression, Cox model, logistic regression are doing the same thing that p-values have done to statistics. They are being more misused than used.
With the insistence on data models, multivariate analysis tools in statistics are frozen at discriminant analysis and logistic regression in classification and multiple linear regression in regression. Nobody really believes that multivariate data is multivariate normal, but that data model occupies a large number of pages in every graduate textbook on multivariate statistical analysis.
“If all a man has is a hammer, then every problem looks like a nail.” The trouble for statisticians is that recently some of the problems have stopped looking like nails. I conjecture that the result of hitting this wall is that more complicated data models are appearing in current published applications. Bayesian methods combined with Markov Chain Monte Carlo are cropping up all over. This may signify that as data becomes more complex, the data models become more cumbersome and are losing the advantage of presenting a simple and clear picture of nature’s mechanism

The paper states the theory behind algorithmic modeling as follows:

Data models are rarely used in this community. The approach is that nature produces data in a black box whose insides are complex, mysterious, and, at least, partly unknowable. What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The problem is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. The theory in this field shifts focus from data models to the properties of algorithms. It characterizes their “strength” as predictors, convergence if they are iterative, and what gives them good predictive accuracy. The one assumption made in the theory is that the data is drawn i.i.d. from an unknown multivariate distribution

The author talks about three projects from his consulting career that gives a flavor of algorithmic modeling.

Toward the end of the paper, the author dispels one of the common myths about ML models, i.e they cannot be interpreted and they cannot be used for descriptive models. They are good for prediction and not useful for interpretability. He says that the goal for any modeling project should to obtain useful information about the relation between the response and predictor variables. Interpretability is a way of getting information. But a models does not have to be simple to provide reliable information about the relation between the predictor and the response variables; neither does it have to be a data model. The justification for this kind of thinking is provided via three examples and the takeaways from these three examples are

Higher predictive accuracy is associated with more reliable information about the underlying data mechanism. Weak predictive accuracy can lead to questionable conclusions.
Algorithmic models can give better predictive accuracy than data models, and provide better information about the underlying mechanism.

The author urges all the statisticians to move away from “data models” towards “algorithmic models” so that the field of “statistics” does not end up being a marginal spectator to the ML revolution that’s happening everywhere.