book-cover

The entire book is organized as 20 small bite sized chapters. Each chapter focuses on one specific thing and explains everything via visuals(as is obvious from the title).

The author starts off by explaining the basic idea of Random Forests, i.e. a collection of decision trees that have been generated via randomization. The randomness comes from the fact that a random subset is used from training the dataset and a random set of attributes are used for splitting the data. Before understanding the

To make the understanding of random forests concrete, a simple example of classifying a set of fruits based on length, width and color is used. A decision tree for classifying fruits is drawn based on three simple attributes. This decision tree is drawn using a manual process. If the data is sufficiently scattered across various dimensions, one can use basic intuition to construct a decision tree. The result of manually drawing a decision tree is shown in various iterations. There are a few immediate takeaways from this tree, i.e. there are regions where the training data from several samples are present in the a region. This makes it difficult to conclusively say anything about the region, except in a probabilistic sense. Also there are lines in the decision space that need not necessarily be horizontal and vertical. Decision tree on the other hand only uses horizontal or vertical lines. To generate a decision tree via python, all it it takes is a few lines of code. Thanks to the massive effort behind sklearn library, there is a standardized template that one can use to build many models.

One of the disadvantages of Decision trees is the overfitting. This limitation can be obviated by the use of Random Forests. The good thing about this book is that all the data used for illustration is available on the web. One can replicate each and every figure in the book via a few lines of python code. Generating a Random Forest via sklearn is just a few lines of code. Once you have fit a Random Forest, the first thing that you might want to do is predict. There are two ways to predict the outcome of the Random Forest, one is via picking up the class that is predicted the most and the second is via obtaining class probabilities. One of the biggest disadvantages with Decision trees and Random Forest models is that they cannot be used to extrapolate.

The randomness in a random forest also comes from the fact that the number of attributes selected for splitting criteria is also random. The general rule is to pick up random set of attributes and then build a decision tree based on these randomly selected attributes. The basic idea of selecting the attributes randomly is that each decision tree is built based on a randomized set of attributes and hence there is less scope of overfitting.

One of the ways to check the validity of a model is to check its performance on a test set. Usually one employs Cross-Validation method. In the case of random forests, a better way is to use Out of Bag estimator. The basic idea of O.O.B error estimate is this: If there are N data samples and a random decision tree is drawn, on an average only 2/3rd of the data sample is used. Thus there is usually 1/3rd of the dataset that is available as test data for any tree in the random forest. Hence one can compute the accuracy of each tree and average the error of all the trees, giving rise to Out of Bag error estimate.

The author also mentions two ways in which a decision tree can be split, Gini criteria and Entropy criteria. Both these are a type of impurity measures. One of the interesting features of Random Forest algo is that it gives you the relative importance scores of various attributes. Even though they are relative measures and there are some caveats in interpreting these relative scores, overall they might be useful in certain situations.

The author ends the book with a broad set of guidelines for using Random Forest

  • Use cross-validation or out of bag error to judge how good the fit is
  • Better data features are the most powerful tool you have to improve your results. They trump everything else
  • Use feature importances to determine where to spend your time
  • Increase the number of trees until the benefit levels off
  • Set the random seed before generating your Random Forest so that you get reproducible results each time
  • Use a predict() or a predict_prob() where it makes sense, depending on if you want just the most likely answer, or you to know the probability of all the answers
  • Investigate limiting branching either by
    • Number of splits
    • Number of data points in a branch to split
    • Number of data points in the final resulting leaves
  • Investigate looking at different numbers of features for each split to find the optimum. The default number of features in many implementations of Random Forest is the square root of the number of features. However other options, such as using a percentage of the number of features, the base 2 logarithm of the number of features, or other values can be good options.