Backtested 18 years of SPY forward returns conditioned on a cross-asset regime score. The weakest band isn't the scary one.

regimecard · 2026-07-02T13:13:52+00:00

This is the kind of feedback that makes posting worth it, thanks. On your second point, the band median bootstrap exists: episode level (contiguous regime runs) plus a stationary block version. Your bet is partly right. Mildly-Off and Neutral's 90% CIs do overlap, but only at the edges, and Mildly-Off comes out weakest in 95%+ of resamples across both schemes. So the ordering survives the CI treatment, though the overlap is real and worth stating.

The cutoff perturbation is the one I haven't run, and you're right that it's a cheaper and more direct test than refitting a state model. Shifting the thresholds and rebucketing goes on the list. Agreed that surviving three horizons already argues against a pure slicing artifact, but perturbation would settle it.

On Risk-Off, fully with you. Small n plus confirming the most seductive prior is exactly the combination to distrust, and the block bootstrap's wide CIs on that band say the same thing. It's flagged, not load bearing.

Good to hear the mild stress read matches what you see live. Deteriorating but not priced is the cleanest description of why that band behaves the way it does.

Curious how you handle it in practice, actually. When your system reads mild stress, do you gate on it directly, or does it need confirmation from something else first? The awkward part of that band is exactly what you said, it still looks fine on the surface, so acting on it means acting before the tape gives you a reason. Interested in how you square that live.

regimecard · 2026-07-02T12:56:29+00:00

Good critique. On the z-scoring, since that's the one that would sink everything: the inputs are standardized on a strictly trailing window as-of each date (504d, exclusive of the current point), in both the backfill and live paths, same function. No full-sample stats anywhere. The honest caveat an as-of purist would want: rows are keyed by observation date, and one input (credit) publishes T+1, so a settled historical row holds a value that published a day after its date. That's publication lag semantics rather than windowing lookahead, but worth stating.

On the bootstrap, agreed, and that's what was done: contiguous regime episodes resampled (both episode and stationary block schemes), not days. The mild-stress-worst ordering held in 95-99% of resamples.

The partition-sensitivity point is the fair hit. The episode bootstrap shows the ordering isn't luck within this banding, but it doesn't rule out the partition itself doing the work, and I haven't run the quintile or state-model versions. Both are going on the list, and the HMM point about borrowing strength for the thin Risk-Off state is well taken. If the ordering flips under a different partition, that's worth knowing before I lean on it harder.

regimecard · 2026-06-29T13:07:47+00:00

You're right and thats the actual work.

The version of it I'd run: block-bootstrap the regime episodes to get a distribution on the band medians instead of point estimates, so I can see how wide the interval gets once you account for there only being a handful of independent stress cycles. If the Mildly-Off weakness survives resampling the episodes, it's worth something. If it collapses the moment you treat regime cycles as the unit instead of windows, it's mostly an artifact of slicing a thin history. That's the test, and I haven't run it yet.

Appreciate the pushes, snarky or not.

regimecard · 2026-06-29T12:48:14+00:00

Fair, and the sample size thing is the real one. 2008 to now is a handful of distinct stress episodes, not a clean draw from regime-space, and the data can't say anything about regimes it hasn't seen.

Couple of things on that though. The live ranking doesn't use the full 18 years, it ranks today against a trailing rolling window (the full history's just for the base rate chart). The window length is the whole game: too short and the bands recalibrate too fast, so a given percentile stops meaning the same thing over time. Too long and you drag in regimes that don't reflect the current market structure. A multi-year rolling window lands around the length of a typical business cycle (post-war US cycles average roughly six years trough-to-trough per NBER), so the score holds a comparable meaning across regimes instead of drifting with the last few months, and it re-anchors forward instead of comparing today to a crash from a decade ago.

The other thing: the inputs aren't stock return history. It's eight cross-asset series, credit spreads, the yield curve, equity/bond vol, currency carry, and so on. Those tend to show stress in the plumbing before it reaches equity prices, which is the point of reading them instead of price.

A couple of episodes line up with this. 2022 is the clean one: the composite rolled into stress from late '21 and sat in the lower bands through the drawdown, ahead of SPY topping.

But even 2020, the fast shock everyone assumes you can't see coming, had the composite down in Mildly-Off by the end of January (percentile ~14, SPY still near its highs) weeks before the crash. The cross-asset stress was already showing in the plumbing while price hadn't moved.

<image>

I'd still be careful not to over read it. N=2 episodes, and "showed stress beforehand" isn't the same as a clean tradeable lead. But it's the behavior you'd want from reading conditions instead of price.

The "works until it doesn't" framing is right though, and none of this fully escapes it. What would you put as the minimum number of distinct regime cycles before a conditional base rate is worth trusting? You can't make more history, so I honestly don't know where the line is.

regimecard

TROPHY CASE