Data mining is a threat to empirical research. Campbell R. Harvey is Professor of Finance at Duke University and Investment Strategy Advisor at Man Group, PLC. In recent years, he has warned that academic journals have a strong bias towards publishing papers with positive results and that this incentivizes quantitative researchers to engage in ‘p-hacking’. We talked to him about this issue and the serious consequences for investors.
“Editors want their journals to have the highest possible impact factor. This is based on the number of citations relating to the articles they publish, and studies that support the hypothesis being tested tend to receive more citations. Authors understand this and want to produce papers that have positive results. It is also more enjoyable to work on research that supports the hypothesis being tested, which is why researchers engage in data dredging to find results that exceed traditional levels of significance. As a result, I estimate that over 50% of all empirical studies in finance are unlikely to hold up in the future.”
‘Over 50% of all empirical studies in finance are unlikely to hold up in the future’
“In my 2017 presidential address1 to the American Finance Association (AFA), I detailed many of the ways that researchers engage in ‘p-hacking’ – trying to achieve the lowest possible p-value, meaning the highest level of significance. Some examples of the tools found in the p-hacker’s bag of tricks are: selective reporting of results; selective sample size; arbitrary transformations of data; arbitrary ‘winsorization’ and outlier exclusion rules; and arbitrary selection of statistical tests. P-hacking reduces the chance that any result will hold up ‘out-of-sample’.”
“No, not really, because people won’t want to publish in ‘The Journal of Non-Results’, nor are they likely to be rewarded for publishing in such a journal. Instead, I have advocated a concept called ‘Registered Reports’. Here, a researcher pitches an idea to an editor. The idea is peer reviewed. If the reviews are positive, the editor makes a commitment to publish a paper, no matter what the results are. This solves a few problems. First, it allows researchers to pursue risky research that often requires very costly data collection, which they might not otherwise embark on if they believe there is a large probability the result will be negative. Second, researchers still get to publish articles in the top journals even if they might have a negative result.”
“I have put forward four different ideas in my research. First, three papers2 I co-authored argued that we need to deal head on with the issue of multiple tests. That is, if you test 20 random factors, one will show up as ‘significant’ by chance. So these papers argue that the cut-off for significance needs to be raised from two standard errors to three. Other fields, such as particle physics and genome association studies, have even higher thresholds.”
“Second, in another paper3, I advocated a bootstrapping-based approach. Place all the factors in a spreadsheet and then strip out the mean for each one, so the average return is exactly zero. Now we have a universe of factors where each factor is false because we have hardwired a zero average return. Then create a new history using random sampling by replacing different rows and, when finished, calculate the average returns of each of the factors. They will not be zero in this new history. Save the factor return that is the highest: this is what you get with pure chance. Repeat the exercise a thousand times and, each time, save the best return you get by chance. Look at the distribution of these best factor returns, generated from a universe of factors that we have hardwired so none are true factors, and pick off the 95th percentile. The real best factor needs to beat this 95th percentile of what you can get purely by luck.”
“Third, I propose a shrinkage-based approach. In another recent paper I co-authored, we devised a model to select factors by considering both time-series information (factor by factor) and cross-sectional information (looking across many factors). This allowed us to reduce some of the noise that inevitably forms part of realized factor returns.”
“Finally, my address4 to the AFA argued that it does not make any sense to continue carrying out inference in the traditional way. For example, we might have two factors with identical Sharpe ratios: one is a value factor and the other is some convolution of sun spot data. You can’t just use the Sharpe ratios, you need to add economic priors. I propose a method to haircut Sharpe ratios based on prior information.”
“This phenomenon is prevalent in both academia and financial practice. In investment management, the worst instances of p-hacking are when someone produces a ‘good’ backtest on a smart beta strategy and some ETF provider decides to launch a product based on this. As with the academic papers, more than 50% of these so-called smart beta products will fail.’’
“Firms need to be very careful in fostering the right research culture to reduce the number of false positives. For example, suppose two highly qualified researchers, A and B, propose investigating two different potential strategies. A review committee thinks both have high quality ideas and A and B do their research which is also of an equally high quality. But the data supporting A’s strategy fails to hold up, while B’s strategy works well and goes live. It would be a big mistake to reward B and/or punish A. This could encourage other researchers to engage in p-hacking. That’s why the research culture is crucial for the success of an asset management firm.”
This article was initially published in our Quant Quarterly magazine.
1 C. R. Harvey, ‘The Scientific Outlook in Financial Economics’.
2 C. R. Harvey and Y. Liu, ‘Backtesting’; C. R. Harvey and Y. Liu, ‘Evaluating Trading Strategies’; and C. R. Harvey, Y. Liu and H. Zhu, ‘… and the Cross-Section of Expected Returns’.
3 C. R. Harvey and Y. Liu, ‘Lucky factors’.
4 C. R. Harvey and Y. Liu, ‘Detecting Repeatable Performance’.