30 days of backtesting

Guest Post by Nils-Bertil Wallin

It is a pleasure to host other data science and quantitative trading enthusiasts here at Prognostikon. Please read the very intertesting first post on backtesting by Nils-Bertil Wallin and visit his blog Option Stock Machines to keep track with this 30-days of backtesting education and more ideas for your strategies. Check the link below and keep tracking Nils' work - comments will follow.

Day 1. We begin our 30 days of backtesting by first establishing a baseline. Typically, you’d compare a single-asset strategy like the one we plan to use to buy-and-hold, but that assumes you bought on day one of the test, which is a bit unrealistic. As Marcos Lopéz de Prado warns of the randomness in backtests, there is also randomness when a benchmark starts. We’ll use buy-and-hold as a baseline, but also look at portfolios composed of the SPY and IEF (bond) ETFS. We show the 60/40 and 50/50 portfolios for the SPY and IEF with and without rebalancing. Here is the link.

Day 2. We examine the Hello World of quant strategies: the 200-day moving average and contrast it with the Buy-and-Hold benchmark. We submit the 200SMA as a separate benchmark because we believe very few investors actually buy-and-hold. Hence, a simple, rules-based strategy should offer a more realistic comparison. We then compare the 200SMA with Buy-and-Hold using different weights and rebalancing schemes. Our summary findings show that the 200SMA underperforms on a cumulative,  but outperforms Buy-and-Hold on a risk-adjusted basis. Here is the link.

Day 3. In Day 3 of 30 days of backtesting we catalog different performance metrics like cumulative returns, Sharpe ratios, and max drawdowns. The goal is to find a good balance between useful, real-world insights and in-depth analysis. We also reduce the number of benchmarks we plan to use going forward to buy-and-hold, the 200-day moving average strategy, and a 60-40 portfolio rebalanced at quarter end. We finish by graphing our chosen metrics in a tearsheet format, which is completely reproducible from the code provided. We'll examine the metrics in the next post. Here is the link.

Day 4. In Day 4, we resist the urge to jump into backtesting and focused on building a solid foundation instead. Using the 200-day simple moving average (200SMA) as our strategy benchmarked against a 60/40 SPY/IEF allocation, we find that while the strategy kept one out of the market in 2022, it faltered significantly in 2020 and underperformed overall. The Sharpe Ratio is decent, but the rolling information ratio is persistently negative. These results underscore the challenges of outperforming a simple buy-and-hold benchmark but also offer worthwhile comparisons to rules-based investing. Next up: crafting our trading hypothesis. Here is the link.

Day 5. We start backtesting today, but not with a stew of moving averages and namesake indicators. Instead, we use the famous Fama-French Factors as a launchpad to develop a hypothesis on what factors might predict market returns. Our cursory analysis suggests Momentum and Profitability stand out as promising. We opt to forego Profitability -- too error prone, in our view -- in favor of Momentum. In our next post, we’ll develop a hypothesis and start to test how Momentum might predict forward returns.  Here is the link.

Day 6. In Day 6 of our series, we explore momentum for predictive power. While the market risk premium proved significant in earlier analysis, we now examine how momentum. Originally highlighted by Jegadeesh and Titman and later added by Carhart to the Fama-French model, we look to this factor as a predictor of superior returns. We run 16 different combinations of weekly lookback and forward periods of 3, 6, 9, and 12 weeks, excluding data post-2019 to avoid snooping. Tomorrow, we’ll analyze these initial findings in greater detail. Spoiler alert: we find modest reversion in more than half of the models.  Here is the link.

Day 7. In Day 7 of our 30 days-of-backtesting series, we examine the results of our lookback-look-forward momentum combinations in greater detail. We discuss size effects, as represented by the coefficient on the lookback variable, and find that about 75% of values are negative, suggesting models modestly better at finding reversals rather than forecasting trend continuation. Only the 12-by-12-week lookback and look forward period exhibits a positive effect, an observation we might use when building a trading strategy. Most of the size effects are not statistically significant apart from the 12-by-12 and two others mentioned in the post. Our next post will explore baseline effects.  Here is the link.

Day 8. In Day 8, we delve into baseline effects using the the regression models we ran on the 16 different lookback and look forward momentum periods. This should prepare us to analyze "alpha" in the future. Analyzing weekly data from 2000 to 2018, we observe that baseline effects, while small (averaging under 1%), are relatively stable across the same look forward period regardless of the lookback period used. Such consistency amid different market regimes could be the result of an upwardly trending market in the period of analysis. But would warrant a more detailed investigation that is out of the scope of these blog posts. Tomorrow, we transition into the next stage: forecasting.  Here is the link.

Day 9. In today’s post, we discuss our reasoning behind the long lead up to forecasting forward returns. That is, we want to establish a testable market thesis that has a basis in logic rather than p-hacking indicators. We then apply walk-forward analysis using the 12-by-12 lookback/look forward momentum model to forecast returns. We train the model on 13 weeks and then predict the subsequent 12-week look forward momentum using the next week in the time series. We then repeat this process for our entire data set. We begin our analysis with the canonical graph of actual vs. predicted values, which we'll delve into in more detail in our next post. Here is the link.

Day 10. In our walk-forward analysis of a 12-week lookback/look forward model, we assess residuals to gauge model performance. The actual vs. predicted scatterplot suggests limited bias predicted values exceed actuals about  53% of the time. However, when we plot residuals against predictions, we notice that error variance increases for extreme predictions, especially during market stress, such as the Global Financial Crisis. While the model seems to perform well in -10% to 10% return range, the residual analysis calls for deeper inspection. Our next post will examine residual autocorrelation. Here is the link.

Day 11. We delve into in the 12-by-12 model's residuals further, finding significant autocorrelation at lags 1-7.  Rather than diving down a rabbit hole to identify time dependencies or try different models, we shift to iterating various train/forecast split combinations to find a model with the lowest forecast error. Using root mean-squared error to judge performance, we observe that the errors seem to valley in the 5-by-1 to 13-by-4 range.  We'll look at this approach in more detail in our next post. Here is the link.

Day 12. In today's post, we extend our analysis by iterating 320 different combinations of training and forecasting windows across the 16 momentum models we've built in our preceding updates. Assessing performance with the root mean-squared error (RMSE), we discover that the lowest error models typically use a 12-week forecast and a 5-week training period. Shorter lookback periods seem linked  to higher RMSEs, except for 3-by-12 lookback/look forward model with 5 training steps and one forecast step. This warrants further analysis, but could be due to noise. We'll these results to generate trading signals in our upcoming posts. Here is the link.

Day 13. We conduct our first backtest! But on day 13 of our 30 days of backtesting are we tempting fate or banishing superstitions? We use the 12-week forecast paired with a 5-week training period model from [Day 12](https://www.optionstocksmachines.com/post/2024-11-04-day-12-iteration/) to build a simple strategy, going long when the forecast imples a positive return and out-of-the market, or short (as the case may be) otherwise. Results are great. Too great in fact. And we plan to explain why in our next post. Spoiler alert: we intentionally introduced a critical error to highlight a common pitfall in backtests. Here is the link.

Day 14. In Day 14 of our series, we address a common pitfall in model design—data snooping—using our momentum model as an example. Our setup initially trained on past returns, predicting future 12-week returns, but we deliberately used future data to generate trading signals, creating unrealistic results. To rectify this, we aligned forecasts to avoid looking ahead, which degraded performance compared to the snooped model. Nonetheless, the long-only strategy still enjoyed a better Sharpe ratio than buy-and-hold. This exercise underscores the trade-off between data availability and up-to-date models, which we'll discuss in our next post. Here is the link.

Day 15. After showing how the model was snooping and correcting for it, we look in detail at using more up-to-date data to improve the trading signals. By inputting the most recent weekly data into the model trained on lagging information,  we see the long-only strategy outperform buy-and-hold by 10% points, with a Sharpe Ratio over 20% points higher. The long-short approach delivers stronger results, too. Tomorrow, we’ll dive deeper into these promising findings and benchmark comparisons. Here is the link.

Day 16. We assess the improved 12-by-12 strategy performance against several benchmarks, including buy-and-hold, the 60-40 portfolio, and the 200-day SMA. The strategy demonstrates resilience, avoiding major declines during the 2002-2003 and 2008 financial crises while capturing overall market trends. It outperforms the 60-40 benchmark by 22% points on a cumulative return basis. It performs less well against the 200-day SMA, and most of that comes from 2015-2018 period. We'll drill down into comparisons with the 200-day in our next post. Here is the link.

Day 17. On Day 17 we compare the 12-by-12 strategy to the 200-day SMA in terms of drawdowns. Notably, the 12-by-12 strategy suffers from a deeper and longer drawdown, after the Tech Bubble. But the average depth of the drawdown is about the same as the 200-day. The overall under and outperformance of the 12-by-12 relative to the 200-day appears driven by two periods -- post Tech Bubble and pre-Covid. This could speak to a model more sensitive to specific market regimes that might not repeat in the future. Interestingly, if we only looked at the middle period, the 12-by-12 would have outperformed by 30% points! In our next post, we look at ways to offset historical idiosyncrasies to estimate future performance. Here is the link.

Day 18. On Day 18, we discuss methods for simulating future returns, like sampling from distributions or historical data, and the limitations of each approach. We favor historical data with block sampling, but first need to find a suitable block to capture realistic market behavior. Using autocorrelation plots, we present different lag periods with significance areas to determine block sizes. We discuss these plots in more detail in our next post. Here is the link.

Day 19. We dive into circular block sampling to simulate the probability of outperformance. We first revisit the autocorrelation plots for the strategy and benchmarks—buy-and-hold, 60-40, and the 200-day SMA—and select buy-and-hold's 3 and 7-week lags. We then start using the 200-day compared with buy-and-hold to establish baseline code. We employ a 3-week block circular sample method to simulate five-year return series and repeat this process 1,000 times. Our findings reveal that the 200-day SMA underperforms buy-and-hold by approximately 13 percentage points on average, with only a 25% chance of outperformance. That said, the strategy's Sharpe ratio surpasses buy-and-hold over 30% of the time. In our next post, we'll build on our circular block sampling algorithm to test the 12-by-12 strategy. Here is the link.

Day 20. In our ongoing analysis, we apply circular 3-block sampling to evaluate the 12-by-12 strategy's performance against buy-and-hold and compare with the 200-day over a five-year horizon. The 12-by-12 strategy underperforms buy-and-hold on average by about 11% points with only a 29% likelihood of outperformance. The strategy's Sharpe ratio is slightly better, surpassing buy-and-hold approximately 36% of the time. Compared to the 200-day, the 12-by-12 fares better, but not significantly so. We also apply 7-block sampling and find similar results, though the 12-by-12 does see some improvement overall. Our conclusion: the 12-by-12 is probably not a viable strategy in present form. But all is not lost as we discuss in our next post. Stay tuned! Here is the link.

Day 21. In our latest analysis, we take stock of our progress and acknowledge the 12-by-12 strategy is more illustrative than persuasive.  Despite these flaws, we're not ready to give up and propose an enhancement: incorporating an error correction term to improve trading signals. Our first pass presents compelling evidence that this approach merits further investigation. Our next post will delve into the results and seek to understand the cause of the improvement. Here is the link.

Day 22. In our previous post, we faced the dilemma of our 12-by-12 model's lackluster simulated performance. Instead of starting anew or forcing the data to fit our expectations, we opted for a third approach: incorporating an error correction term inspired by traditional machine learning techniques and insights from the Prognostikon blog. By comparing our model's predictions against actual outcomes and adjusting forecasts accordingly, we observed a significant improvement in cumulative performance. Notably, this modest modification transformed an underperforming strategy into an outperforming one, begging the question, whether this success stems from logic or luck—a topic we'll explore in our next post. Here is the link.

Day 23. In our latest analysis, we begin to explore whether our enhanced outperforms due to logic or luck. Recall, the updated strategy incorporates a one-week delay to quantify model error, subsequently adjusting predictions based on this error term. This refinement surpasses buy-and-hold by 20% points and the original strategy by 10% points. To tease out the logic vs. luck debate, we first need to understand what the enhancement is doing and draw parallels to the mechanics of machine learning's use of loss functions and error correction. Our findings reveal that the correlation between the error term's sign and forward returns, along with its influence on the prediction-return relationship, underpins the strategy's success. That this works is a bit counter-intuitive at first glance, which we'll address in our next post. Here is the link.

Day 24. In our latest post, we extend the analysis of our error correction method, which significantly enhanced trading performance. The choice of the method was admittedly hacky, but we look to find some logic in our luck. We hypothesize that our error correction adjusts directional biases of the original walk-forward models, particularly if those models are mean-reverting, which many of them appear to be. While the jury is still out on the ultimate driver of success, we shift to compare prediction accuracy of the adjusted and unadjusted models using confusion matrices. Our first pass analysis reveals that adjusted model slightly underperforms in predicting positive directions but excels in accurately forecasting negative ones. This reminds of the first rule of trading: Don't Lose Money! Here is the link.

Day 25. In our latest analysis, we analyze the effects of false positives and false negatives on the adjusted and unadjusted strategies. The adjusted strategy, which incorporates the error correction, predicts fewer positive returns correctly, but excels at identifying negative returns compared with the unadjusted strategy. It also generates fewer false positives compared to the unadjusted strategy, but suffers more false negatives. Importantly, the adjusted strategy generates higher average returns on its true positives and lower average returns on its false negatives vs. the unadjusted strategy. However, the unadjusted strategy appears to generate better returns on the true negatives, although the significance of the result remains to be analyzed. We'll analyze results further and compare with the original strategy in our next post. Here is the link.

Day 26. Only five more days! We compare the adjusted strategy to original 12-by-12, focusing on predictive accuracy by scenario and corresponding returns. The adjusted strategy demonstrates fewer true positives but also has fewer false positives compared to the original. Notably, it achieves better performance on true positives and false negatives than its counterparts. Statistical tests reveal the differences in mean returns for these categories is significant.  These findings suggest that the adjusted strategy's strength lies in getting more of the big moves correct and not missing the big ones when incorrect. We've got one more refinement in store, which could yield even further performance improvements. Stay tuned for our next post! Here is the link.

Day 27. We enhance our error correction method yielding attractive results. The cumulative return on the new adjusted strategy surpasses buy-and-hold by 36% points and the original strategy by 26% points. Notably, the Sharpe ratio improves to 0.6, almost twice that of buy-and-hold and 0.08 points higher than the original strategy. While drawdown periods remain similar, this new strategy appears to avoid significant downturns, staying out of the market approximately 5% of the time more than the original approach. Our remaining posts will present the remaining metrics we've used in our analysis. Then we'll end with the pièce de résistance -- testing the finished model on out-of-sample data. Stay tuned! Here is the link.

Day 28. Our refined strategy generated an additional 16% points in outperformance vs. both buy-and-hold and the original 12-by-12 strategy. In this post, we analyze the prediction scenarios and find the enhancement increases true positives by 2.5% and reduces false negatives by 3.6%. T-tests confirm the statistical significance of these improvements. Next, we use circular block sampling to simulate 1,000 five-year periods. In this, our adjusted strategy exceeds the original strategy's performance, but it still trails the buy-and-hold method. While these findings suggest that while the enhancement yields notable gains, the true test will be out-of-sample performance. We'll tackle that in our next post. Stay tuned! Here is the link.

Day 29. In our penultimate post, we evaluate the out-of-sample performance of four strategies we've been analyzing against buy-and-hold, the 200-day simple moving average (200SMA), and the 60%-40% SPY-IEF rebalanced portfolio. We rank the performance on cumulative return, Sharpe ratio, and maximum drawdown. While the unadjusted strategy performs the best, it is not viable due to its poor results in the training period. Coming in second and more tenable is the adjusted strategy with error correction. Were we to consider implementing this strategy, we'd need to analyze its tax drag and where it might fit in our overall asset allocation process. Discussions ripe for another series! In our next and final post of the series, we sum up lessons learned and next steps. Stay tuned! Here is the link.

Day 30. As our 30-day backtesting journey comes to an end, we sum up lessons learned. First, sometimes even simple strategies can be improved by adding targeted refinements. Second, multiple metrics help to analyze multiple strategies against benchmarks, but they too need a systematic approach to implementation. Finally, simulation offers valuable insights, but should fit logically in the overall analysis. As we conclude this series, we invite our readers to share what they'd like to see in future posts. Stay tuned! Here is the link.