Quantitative investing

Random forest

Random forest (RF) is a popular machine learning algorithm.¹ Its simplicity and versatility make it one of the most widely used learning algorithms for both regression and classification. It is used in many applications, including tasks as diverse as object recognition, credit risk assessment or purchase recommendations based on prior customer behavior.

In practice, the RF builds a myriad of individual decision trees. A decision tree is a tool that uses a tree-shaped model of possible options and their respective outcomes. It is a way to represent graphically an algorithm that only contains conditional control statements. Individual trees are created based on a random sample of observations in the broader dataset.

The RF then aggregates the individual the trees, a process called ‘bagging’, to get a more accurate and stable prediction. This can be done by averaging the results when the outcome is a number – for example the expected return of a given stock – or by performing a majority vote when predicting a class variable – for example, when the outcome can be ‘true’ or ‘false’, or a type of object.

To use a simple analogy, let’s imagine someone wants to buy a car and seeks advice from friends. The first friend may ask about the type of powertrain the person may be interested in, depending on the type of intended use (long vs. short distances, daily use vs. holidays only, city vs. countryside) and may come up with a recommendation based on the answers given to these possible choices.

The second friend may ask about the desired driving experience and come up with a very different decision tree (high vs. low driving position, quiet vs. sporty). The third friend may have more of an affinity for design and would therefore ask a series of questions about the desired shape of the vehicle. And so on. In the end, the person will choose the car that was most frequently recommended.

Among the advantages of RFs are the fact that they limit chances of overfitting, improve prediction accuracy and have results that tend to remain relatively stable as datasets grow. On the other hand, the main drawback of RFs is that a large number of trees could render the algorithm too slow and ineffective for real-time predictions.

In the asset management industry, random forest algorithms are being increasingly used for a number of machine learning applications, such as forecasting stock returns² or predicting distress risk. ³

Footnotes

¹ Breiman, L., 2001, “Random forests”, Machine learning, Vol. 45, No. 1, pp. 5–32.
² See for example: Dixon, M., Klabjan, D. and Bang, J. H., 2017, "Classification-based financial markets prediction using deep neural networks”, Algorithmic Finance. See also: Khaidem, L., Saha, S. and Dey, S. R., 2016 "Predicting the direction of stock market prices using random forest”, working paper.
³ See for example: Shen, F., Liu, Y., Lan, D. and Li, Z., 2019, “A dynamic financial distress forecast model with time-weighting based on random forest”. In: Xu, J., Cooke, F., Gen, M. and Ahmed, S. (eds), “Proceedings of the twelfth international conference on management science and engineering management”.