Home / Education / We Backtested Our Market Regime Detector. Two of Five Labels Were Wrong.

We Backtested Our Market Regime Detector. Two of Five Labels Were Wrong.

Most fintech tools launch with marketing copy and never get tested. We took the opposite approach. After building the Trailing Stop Loss Regime Detector through three iterations, we ran its entire playbook against 15.3 years of historical data — 3,841 trading days from February 2011 to May 2026 — to see whether the regime labels actually predict when each strategy has edge.

The results were uncomfortable. Two of five strategies came back with the opposite of the predicted result. One came back with a massive validated edge. One was a coin flip. And the single best overlay we tested wasn't even a strategy — it was a risk filter that cut maximum drawdown by a third while preserving Sharpe ratio.

This is the full writeup of what we tested, what survived, and what we changed in v4 of the tool as a result. If you want the bottom line up front: the tool itself now wears its backtest results on its sleeve, with edge badges and p-values displayed next to each strategy recommendation. No more theoretical claims — every label is tagged with whether the data backed it up.

What we tested and how

The regime detector classifies every trading day across five axes: risk sentiment, volatility regime, trend versus mean-reversion, sector correlation, and yield curve direction. From these classifications, the tool's playbook generates strategy recommendations — for example, "premium selling has edge in low-vol regimes" or "long volatility has edge in high-vol regimes."

The backtest question is simple: are those recommendations right? For each strategy, we compared the forward 5-day SPY return on days the detector flagged as "EDGE" days versus days it flagged as "NO EDGE." A real edge means the EDGE days actually outperform with statistical significance. The test is a Welch's t-test, which doesn't assume equal variances between the two groups.

The full methodology is documented in our open-source backtest repository. Anyone can run it themselves and reproduce these numbers. That openness matters because the alternative — "trust us, the tool works" — is exactly what we built this tool to push back against.

Honest limitations. Forward returns use SPY as a strategy proxy, not actual trades with stops and position sizing. Premium-selling P&L is approximated by the vol risk premium gap (VIX minus realized), not actual options. Long-vol is proxied by VIX changes, not VXX (real VXX is worse due to roll decay). No transaction costs are modeled. The window is 2011–2026 — one regime epoch — and results may not generalize cleanly to earlier eras. p-values are not Bonferroni-corrected.

The results, ranked from best to worst

Bar chart showing edge in basis points by strategy
Edge (basis points, forward 5-day SPY return EDGE days minus NO EDGE days) by strategy. Green bars are statistically significant positive edges. Red bars are negative or non-significant.
StrategyEdge returnNo-edge returnDifferencep-valuenVerdict
Premium selling36.48%15.91%+2,057 bps<0.0011,256✓ Strong edge
Trend following0.22%0.26%−4 bps0.570745~ Inconclusive
Buy & Hold (risk-on bias)0.23%0.52%−30 bps0.0882,449~ Inconclusive (was disputed)
Mean reversion in chop0.34%1.31%−97 bps0.003213✗ Backwards
Long volatility in high vol−2.15%+4.53%−667 bps<0.0011,395✗ Backwards

Three of five strategies came back negative — two with overwhelming statistical significance, and the third (buy-and-hold risk-on bias) directionally but no longer conclusively after extending the test window. That's not a soft "the model needs tuning" result — that's the data emphatically rejecting parts of the theory. Let's walk through what each one actually means and what we did about it.

What worked: premium selling in low vol

The premium-selling result was the biggest single finding in the backtest. On the 1,256 days the detector flagged as low-volatility, contango, non-risk-off conditions, the volatility risk premium (VIX minus realized 5-day vol) captured 36% of available premium versus 16% on days when those conditions weren't met. The difference was 2,057 basis points with a p-value below 0.001 — about as statistically clean as financial data gets.

This isn't surprising in retrospect. The volatility risk premium — the gap between what options imply and what stocks actually realize — is one of the most well-documented anomalies in finance. AQR Capital Management research has shown the premium persists across decades and asset classes. What the backtest confirmed is that the detector's timing of when to capture that premium — low vol, contango, no risk-off signals — is correct.

Practical takeaway. When the detector shows VIX in the bottom quartile of its trailing year, VIX3M above spot (contango), and no risk-off signals firing — that's the historically validated window for short-premium trades. The edge is real, the conditions are real, and the sample size of 1,256 days is large enough to take seriously.

What we got backwards: long volatility in high-vol regimes

This was the most uncomfortable finding. The v3 playbook said long volatility has edge when VIX is elevated or in backwardation — the intuitive idea being "if vol is spiking, ride the spike." The data said the opposite. On the 1,395 days the detector flagged as high-vol or backwardation, long-vol exposure lost an average 215 basis points over the next 5 days. On the no-edge days, it gained 453 basis points. The difference was 667 basis points against the v3 playbook with a p-value below 0.001.

The explanation is one of the most robust findings in volatility research: VIX mean-reverts. Once VIX is already elevated, the highest-probability move is back down, not further up. CBOE's own methodology documentation on VIX construction makes this explicit — the index measures 30-day implied volatility, and elevated readings are by definition a peak in market fear, which historically resolves downward more often than upward. Buying volatility when it's already high is buying the top.

The v4 playbook reverses this label. In high-vol regimes, the recommendation now reads "long-vol does NOT work here (VIX mean-reverts)." Buying long-vol exposure makes more sense when vol is cheap, not when it's already moved.

What we got backwards: mean reversion in chop

The v3 logic said RSI(2) < 10 dip-buy signals should work better in choppy, range-bound regimes than in trending markets. The reasoning was textbook: range-bound markets fade extremes, trending markets respect them. The data disagreed. On the 213 days RSI fired in choppy regimes, the forward 5-day return averaged 0.34%. On days it fired in trending regimes, the average was 1.31%. A difference of 97 basis points against the v3 playbook, with a p-value of 0.003.

The likely explanation requires thinking carefully about what an oversold reading actually means in each context. In a true trend, an oversold reading is a pullback within a strong move — and pullbacks within uptrends tend to resolve sharply higher. In genuine chop, an oversold reading is just one swing of an oscillating range, and the next move is as likely to grind lower as to bounce. The dip-buying edge lives in trends, not in ranges. Counterintuitive, but well-supported by 15.3 years of data and consistent with what serious discipline-focused traders have always said about dip-buying: do it in confirmed uptrends, not in markets going sideways.

The honest update: risk-on bias for buy and hold (weakened result)

This is the finding the extended out-of-sample window changed. In the original 2011–2024 backtest, risk-off days outperformed risk-on days on 5-day forward returns with statistical significance — flipping the v3 logic. Extending the window through May 2026 added 204 more days and the result softened to p=0.088, no longer significant at the conventional 5% threshold. Across 2,449 testable days, risk-off days averaged 52 basis points of forward 5-day return versus 23 basis points on risk-on days. The 30-basis-point difference is still directionally interesting but no longer statistically conclusive.

The intuition remains the "buy the dip" effect at the macro level. Days when risk signals are flashing red — high yield underperforming, dollar strong, equities down 20-day — tend to be exactly the days when forward returns are best, because those signals fire near short-term lows. But with weaker statistical support in the longer window, this is a directional observation rather than a hard rule. The v4 detector still removes the risk-on bias from buy-and-hold timing recommendations, but the case for that change is now "no clear edge either way" rather than "the opposite of v3 is correct."

An important caveat. This 5-day forward result is about short-term timing, not about regime-aware risk management writ large. The "cash when risk-off" overlay over the full 13.9-year window had a Sharpe ratio of 0.60 — meaningfully worse than buy-and-hold's 0.71 — but did cut maximum drawdown from −34% to −30%. Risk-off signals are too slow to time SPY entries but can still inform broader portfolio sizing. The detail matters here.

The coin flip: trend following

On the 687 days the detector flagged as trending-and-uptrending, the forward 5-day SPY return averaged 0.21%. On the rest of the days, it averaged 0.25%. The difference was 4 basis points, with a p-value of 0.568 — pure noise. The trend regime label is not, by itself, doing useful work for forward SPY returns.

This finding has a softer interpretation. The test was directional only — long SPY when the label fires — without the entries, stops, and position sizing that real trend strategies use. Trend-following P&L typically comes from sizing up in confirmed moves and surviving the whipsaws, not from being correct about direction on any given day. So the label not predicting day-by-day SPY direction doesn't necessarily mean a properly-implemented trend strategy would fail in those regimes; it just means the regime label alone isn't enough.

The v4 playbook downgrades trend-following from "EDGE in trending regimes" to "NEUTRAL with caveats." Honest, but with the door open for proper trend infrastructure to extract value the label test couldn't see.

The unexpected winner: the high-vol risk filter

Cumulative return comparison of regime-filtered overlays vs buy-and-hold SPY
Cumulative returns of three regime-filtered overlays versus always-long SPY (2011–2026, log scale). The high-vol filter (red) maintains buy-and-hold's Sharpe while cutting maximum drawdown.

The most useful finding wasn't a strategy at all — it was a risk filter. We tested three overlays against buy-and-hold SPY: long only when trending, cash when risk-off, and cash when high vol or backwardation. The results across the full 13.9-year window:

StrategyCAGRVolSharpeMax drawdown
Always long SPY (benchmark)11.95%17.19%0.74−34.10%
Cash when high vol or backwardation9.21%12.18%0.78−23.54%
Cash when risk-off8.54%13.20%0.69−29.97%
Long only when trending1.76%4.88%0.38−8.80%

The high-vol filter now beats buy-and-hold on risk-adjusted return — Sharpe 0.78 versus 0.74 — while cutting maximum drawdown by 10.6 percentage points. The cost is 2.74 points of CAGR (9.21% vs 11.95%). That's the trade: you give up some CAGR to get both a better Sharpe and a meaningfully smoother ride. For anyone who actually sat through a 34% drawdown in 2020 or 2022 and questioned why they were doing this to themselves, the smoother ride at higher risk-adjusted return is a clear improvement. The CAGR gap means buy-and-hold still wins on absolute terminal wealth for investors who stayed fully invested through the full window.

The filter is not a free lunch and we won't pretend it is. What it does offer is a different point on the risk-return frontier, with a real Sharpe-preserving drawdown reduction backing it. It's now built into v4 of the detector as a position-sizing input — when the tool detects high-vol or backwardation conditions, the recommended position size automatically drops to 0.4× baseline. Whether you actually take that cut is up to you and your risk tolerance.

What we changed in v4

Four concrete changes from v3 based on what the data showed:

  1. Long volatility playbook flipped (strongest finding). The label "long vol has edge in high-vol regimes" is gone. Replaced with "long vol does NOT work in high-vol regimes — VIX mean-reverts." This held with p<0.001 across 1,395 days in both the original and extended windows. Long-vol exposure is now recommended cautiously in low-vol regimes only, as cheap hedges, not as a momentum play.
  2. Mean reversion playbook revised (strongest finding). The "RSI fades work in chop" claim is removed. Replaced with "RSI fades work better in trends than in chop (counterintuitive but data-supported)." Held at p=0.003 across both test windows.
  3. Risk-on bias removed from buy-and-hold timing. The tool no longer suggests reducing SPY exposure during short-term risk-off readings. Note: the evidence for this change weakened in the extended window (p=0.088, no longer significant) — call this a "no clear edge either way" finding rather than a strict reversal.
  4. High-vol filter promoted to first-class position-sizing input (improved finding). When the detector identifies high-vol or backwardation conditions, position size automatically scales to 0.4× — the only overlay our backtest validated. With the extended window, this overlay now beats buy-and-hold on Sharpe (0.78 vs 0.74), not just matches it.

Each strategy in the v4 playbook now displays its actual backtest result — edge in basis points, p-value, and sample size — next to the recommendation. If you want to trust the tool, you can see exactly why. If you want to disagree with the tool, you can see exactly what evidence to argue with.

Why this matters

There's a kind of fintech content that promises a tool and never tests it. There's another kind that runs a test, finds the tool half-broken, quietly fixes the marketing copy, and moves on. We're trying to be a third thing: a site that builds tools, tests them publicly, and tells you when the tests embarrass the tools.

The reason this matters isn't moral, it's practical. Trading education content is overwhelmingly written by people who have never tested their own claims. If a tool's playbook says "long vol has edge when VIX is elevated" and that claim has never met data, the trader who follows it is the one who finds out it doesn't work — typically by losing money. The backtest doesn't make the tool perfect. It makes the tool honest about which parts of it you can rely on.

If you want to use v4 of the regime detector, it's here. Every recommendation is tagged with its backtest result. If you want to run the backtest yourself and find different numbers, the code is open and reproducible — Python, well-commented, ~700 lines, runs in five minutes on any laptop with a Yahoo Finance connection.

The next piece of research we're doing is whether per-asset regime classification — same logic but running on futures markets and crypto rather than SPY — produces materially different labels. If a trader's primary market is gold or oil or Bitcoin, the SPY-derived regime probably misses things that matter. We'll publish those results too. Including the inconvenient ones.

FAQ

What is a market regime detector?
A regime detector classifies current market conditions into states — for example, trending versus choppy, high volatility versus low, risk-on versus risk-off — to help traders choose strategies that historically work in similar conditions. Institutional desks have used regime classification for decades. Our tool brings the same idea to retail traders using free public data.
What does it mean that two labels were "backwards" in the backtest?
Two of v3's strategy recommendations had the opposite sign from what historical data showed. The long-volatility playbook said to buy vol in high-vol regimes; the data showed buying vol in high-vol regimes lost an average 665 basis points over 5 days. The mean-reversion playbook said RSI dip-buys work in chop; the data showed they work better in trends. v4 reverses both labels.
Can I trust the backtest results?
Trust them as exploratory evidence over the 2011–2026 window, not as proof of future performance. The test has real limitations: it uses SPY as a proxy for strategy P&L (not actual trades with stops), it doesn't include transaction costs, and one regime epoch isn't enough to claim universal validity. But the methodology is open-source and the results are reproducible.
Why does VIX mean-revert?
VIX measures 30-day implied volatility, and elevated readings reflect concentrated market fear that tends to resolve as conditions normalize. Historically, the highest-probability move when VIX is already high is back down toward its long-term mean, not further up. This is well-documented in academic and practitioner research.
Why does mean reversion work better in trends than in chop?
In a confirmed uptrend, an oversold reading is a pullback within a strong move — pullbacks within trends tend to resolve sharply higher. In genuine chop, an oversold reading is just one swing of an oscillating range, and the next move is roughly equally likely in either direction. The dip-buying edge lives in trends, not ranges.
How can I use v4 of the regime detector?
The tool is free to use at trailingstoploss.com/regime-detector/. It pulls end-of-day data from public sources and runs entirely in your browser. Each strategy recommendation now displays its backtest edge, p-value, and sample size so you can see exactly which labels to trust.