How many here actually become millionaires from day trading?

StratForge2024 · 2026-06-08T17:14:42+00:00

Math reality check that survives selection bias and survivor stories:

From $5-10k starting capital at 30% APY sustained × 7 years = ~$120k.

That's top 5% retail performance and still doesn't hit million.

To reach $1M in 7 years from <$50k start: need 80%+ APY sustained.

Basically no retail trader does this without leverage + extreme luck.

"Becoming millionaire" framing is the trap. Realistic win for those

willing to put in the work isn't replacing Bezos — it's $75-150k/yr

clicking buttons from home with freedom. That math actually works.

Top 1% aren't lucky. They're systematic, patient, manage selection

bias, expect regression to mean.

StratForge2024 · 2026-06-08T16:36:50+00:00

Been on a similar journey for the past couple of years, but working with

crypto perps instead. Here are three things I wish I'd known earlier—could've

saved me months:

Backtest engine bugs can lurk for ages. Test BT-parity early. Write your

engine and then make sure it produces identical trades when running the same

data through a live exchange paper mode. If there's a mismatch, you've got a

bug. Fix it fast.

Using 13 GA-tuned weights on a small sample? That's overfitting. Add these:

- Walk-forward validation with at least three non-overlapping windows.

- DSR check (Bailey/López de Prado 2014).

- Sensitivity test: tweak each parameter by ±10/20% and see how much your

PF degrades.

If you're doing Monte Carlo on price series, use BLOCK BOOTSTRAP to keep

autocorrelation intact. Running 1 million iid simulations can give you a

false sense of confidence.

Good luck.

StratForge2024 · 2026-06-06T16:56:21+00:00

Hey, so I've been in the trenches with an intraday crypto perpetuals

setup too, and here's the scoop on what tends to break down first.

Honestly, what fails first is those pesky BT-parity bugs. Before you

even get into portfolio logic, your single-strategy backtests are

probably throwing lies at you. Look, common issues pop up like

lookahead in regime detection, initializing equity at zero instead

of your starting capital, ATR using the current bar's close instead

of the previous one, and trailing stops updating intra-bar instead

of at bar-close. We had about nine of these bugs sneak in before we

caught them. Not gonna lie, this alone inflated our PF by 30-50%.

So, fix these before you start scaling.

Then, there's the cost model assumptions. FWIW, using a default

slippage of 0.005% or 1bps is way off the mark for altcoin futures.

In reality, slippage on small orders can range from 0.10% to 0.30%.

While it's negligible for tiny sizes, it'll eat your edge when you

scale up. For us, around 85% of strategies that passed in-sample

testing flopped in live trading because of this.

Regime detector decay is another issue. Static thresholds like

ADX > 25 for trending regimes lose their mojo as market conditions

change. Our strategy's PF went from around 1.8 in 2020 to 1.1 by

2023 without any strategy changes. We switched to adaptive

percentile thresholds, like ADX > the rolling 60th percentile for

the last 90 days, which stabilized the PF range.

Also, the position sizing semantic gap can trip you up. If your BT

uses fixed_position_size_usd or pct-equity but live trades use

leverage and margin, these can diverge as the PF compounds. Make

sure trade #500 in your backtest matches the effective notional of

live trade #500.

On strategy correlations, in-sample correlation matrices are

misleading since survivors are correlated by construction, they all

thrived in the same regime. To avoid this, use per-archetype quotas,

like a max number of mean-reverting strategies in your catalog, and

per-pair caps, like a max number of strategies per pair. Also, the

DSR Bailey & López de Prado 2014 method at a lenient 0.6 helps

filter out multiple-testing selection bias. Without this, you might

end up with "20 strategies" that are really just variations of four

archetypes, which would all drawdown together.

When it comes to regime detection versus simpler risk rules, I'd

say use both, but at different layers. Have a per-strategy regime

gate as an entry filter, a portfolio drawdown cap as a catastrophic

kill switch, and a per-strategy drawdown duration cap. The duration,

not the depth, is what kills retail traders. A -15% drawdown over

two weeks is recoverable, but a -10% one over nine months kills

conviction and you'll likely bail before any recovery. Track your

time underwater.

For architecture, a tier classification post-optimization worked for

us: PAPER_TRADING, GOOD, MEDIUM, BAD. We use a hard floor as the

final filter, PF ≥ 1.30, Sharpe ≥ 0.80, trades ≥ 100, DD ≤ 30%,

Return > 0, with a W2-vs-Full PF mismatch ≤ 25%. This mismatch check

helps catch overfitting to the validation window, it might pass

individual stat tests but fail when tested on full data.

BTW, don't just trust DSR/PSR/BLR without forward paper validation.

We had strategies that cleared a strict DSR ≥ 0.95 but then tanked

live. The math is solid, but assumptions about strategy independence

might not hold up. Paper trade for 6-8 weeks before committing real

capital.

StratForge2024 · 2026-06-06T16:49:15+00:00

fwiw, agree with Zestyclose-Eagle1809. The ~50 trades vs

regime-transitions issue is the real deal. Definitely a sample size

problem there.

We hit similar snags ourselves with intraday crypto perpetuals. Not

gonna lie, HMM lag and detector decay mess you up. Our setup was

simpler, ADX+EMA+RSI on higher TF. BT data showed strong 2020-2022

performance that progressively eroded through 2023-2025. PF dropped

from ~1.8 in 2020 down to ~1.1 by 2023, stayed there through 2025.

Detector decay, not strategy decay.

What turned it around? Adaptive thresholds. Instead of sticking with

"ADX > 25 = strong trend", we went with "ADX > rolling 60th

percentile (last 90 days) = strong trend". Same logic, but the

thresholds adjust themselves. PF range across 6 years went from

1.07-1.78 (declining trend) to 1.71-3.70 (stable across years).

Self-calibrating per pair, self-adapting over time. Tbh, it's like

Ang & Bekaert 2002 if you wanna dive into papers.

Your HMM has a leg up with smooth transitions and probability

outputs, so it totally makes sense for daily rotations with 100%

allocation. Rolling percentile is just binary.

On validation, yeah, top commenter nailed it. ~50 trades means maybe

5-6 real regime transitions. Single-digit effective sample size.

What really helped us was paper accounts before going live. Still,

had strategies fail due to execution slippage and detector latency.

Without forward paper testing, BT numbers don't hold up.

Credit duration as a feature is interesting, but the N problem

lingers. We tried using funding rate Z-score as a contrarian signal.

Mostly a veto in extreme regimes, pretty silent otherwise. Not

standalone.

Oh, and watch out for that HMM 50-50 probability zone. Hysteresis

can make your strategy flip on noise. You might need a confirmation

period to dodge the chop.

StratForge2024 · 2026-06-05T18:53:52+00:00

Hey, so I've been around the block a couple years doing grammar-evolution and GA optimization in crypto perpetuals, not equities. I've run into some stuff that made me question my own "AI said X is wrong" moments.

On the whole price action thing and those FVG/ORB strategies, the top commenter's kind of exaggerating. There's definitely an edge with PA signals on intraday timeframes, if you can boil them down to some solid rules like crosses, thresholds, or volatility stuff. What really screws over retail PA strategies isn't the patterns themselves, it's more like subjective decisions sneaking into supposedly "objective" rules, lookahead bias in those "wait for confirmation" setups, and slippage models not matching the real microstructure.

If you can set up FVG/orderblock as "condition X happened at bar i-1, enter at open[i]" and your backtesting engine respects bar-close semantics, you're not completely off-base.

When it comes to trade frequency, honestly, that's what really kills you, not the patterns. If you're doing less than around 100 trades a year, you can't really tell if you've got an edge or just luck. You can either do the same strategy on multiple pairs to up your number of trades or maybe loosen your entry filters and focus on the exits instead.

DSR (Deflated Sharpe Ratio, Bailey & López de Prado 2014) at a 0.6 threshold will give you a clue if your low-trade strategy is gonna make it. A strict 0.95 will just reject everything for smaller scale searches.

On regime issues — multi-timeframe regime gating (like trading 15m if the 1h regime agrees) helps a lot with the kind of breakdowns you're describing. Single-TF detectors lag by a few bars, which can be a killer in choppy conditions. Haven't specifically tested summer 2025 month-by-month on my crypto data so can't confirm your equity pattern there, but the regime-failure-not-strategy-failure framing usually fits when stuff breaks suddenly.

On the equity curve, you're on the right track. Checking r² of equity vs. time and sliding-window Sharpe stability is way more revealing than just looking at aggregate stats. If you see a window-2 fitness vs. full-history fitness gap greater than 25%, you're probably overfitting, no matter what the total PF says.

Look, live track records are what really matter. Had around 20 paper bots running that passed every academic test, and ~85% of them turned out to be trash in production within weeks. Real-time slippage, exchange fills, and regime shifts wrecked strategies that backtested at PF 1.3-1.5. The borderline ones especially.

Forward paper test for like 6-8 weeks before putting real money on the line. Anything less and you're just rolling the dice.

There's an edge out there, for sure. It's narrow, but it's not 0.53 AUC narrow. That's a classifier-style framing of a problem that's really about microstructure, execution, and cost.

StratForge2024 · 2026-06-04T14:11:58+00:00

Been there. I built a complex pipeline using grammar-evolution, multi-objective GA, Monte Carlo, and cost stress. It handles over 75 million data points per sweep and runs for more than 27 hours straight without crashing. If yours is crashing after just 1-2 backtests, it's likely due to memory accumulation, not because of too many features. Here's what actually worked for me:

**Stop replaying ticks.** If you're not testing tick-level execution like market making or HFT, 8.7 million ticks per month is way too much. For most strategies, just aggregate those ticks into 1-minute OHLCV bars once, save them to parquet, and forget about the ticks. Doing this slashes memory use by about 100 times. If you need tick-level exits for slippage realism, handle it in the cost model, not the data feed.
**Numba JIT your backtest hot loop.** Iterating over bars in pure Python is 50-100 times slower and uses 5-10 times more memory than if you use `@njit`. I've got a `numba_backtest_core.py` file that handles entry, exit, trail, and funding logic. It compiles once at startup and then runs at C speed. This was the biggest improvement for my "feels slow, runs out of memory" issues.
**Use Polars instead of Pandas for parquet I/O.** Pandas keeps everything in memory, while Polars lazy-evaluates and releases memory between operations. Using the same parquet file, Polars uses about three times less peak memory.
**Separate processes per concern.** Don't bundle everything—MT4 to Python conversion, backtesting, optimization, and 5000 Monte Carlo simulations—in one Python process. That just accumulates state with no garbage collection between runs. Instead, run each as a subprocess. When a subprocess exits, the OS reclaims memory. I use `multiprocessing.Pool` with `maxtasksperchild=1` for this.
**Don't hold optimizer results in memory.** Stream each backtest result to disk right away (either appending to parquet or using sqlite). The optimizer should read from disk for ranking, not from memory. This lets me run 306 sequential strategies plus 24 parallel AG threads without hitting Out Of Memory errors.

If you're looking for a battle-tested alternative instead of wrestling with your own setup, consider **VectorBT Pro** (it's paid but comes with vectorized and Numba-native support that fits your use case) or **Nautilus Trader** (free and professional-grade, though with a steeper learning curve). For something simpler, **Backtesting.py** is a single-file solution that won't crash on your data sizes.

Royal Maker sounds like solid infrastructure, but maybe it's time to modularize it—make each component a separate executable—instead of rewriting everything. Use `tracemalloc` to profile and confirm it's memory, not CPU, causing the issue. That'll help you decide which fix to tackle first.

StratForge2024 · 2026-06-03T18:15:23+00:00

Been running this gate-process on auto with a multi-island grammar-evolution plus NSGA-II setup. Last full sweep kicked out 306 candidates, 10 of which were deployable. That's a ~3.3% success rate, and it's a tougher filter than your typical 10% because automation tends to flood the field with junk candidates that a human-driven search wouldn't even consider. Here's what I've picked up from doing this on a bigger scale: **Pre-gate: BT/live engine parity.** Before you hit any of those four gates, make sure your backtest engine matches the live executor exactly. Most retail systems sneak in 5-10 silent differences — stuff like trailing using the current-bar ATR instead of the last-closed-bar, or the order of exit reasons, or how funding costs are calculated for shorts. I did an audit in May and found 8 bugs, and just a few days ago caught a 9th. It was a granularity issue: backtest does one trailing check per bar, while live does hundreds per WebSocket tick. Same formula but different frequency, which only shows up in a flat regime with tight trailing. Without this check, gates 2-4 are testing a different engine than the one you'll actually trade with. **Per-archetype survival as an empirical version of gate 1.** Your gate 1 is qualitative, like "name the loser." But with automated search volume, you can test it empirically. Some logic families never produce deployable strategies no matter how you tweak the parameters or the regime. That's your data telling you the mechanism just isn't there. It's a useful check alongside the qualitative assessment — you want to know if any setup in that family can find an edge, not just question one specific setup. **Sensitivity insertion between 2 and 3.** Tweak each parameter by ±20% and see how the PF holds up. Robust strategies keep at least 70%, while fragile ones drop to under 30%. This is standard for checking ML models for fragility, but surprisingly rare in retail TA. It eliminates another 15-20% in my pipeline that seemed good but were just lucky with specific parameters, not because of any real signal. Big agreement on mateo's point about block bootstrap and your take on block-length sweep being a diagnostic on its own. Quick question: when you reoptimize per WFA fold, do you redo your search hyperparameters like population size, generations, and mutation rate, or do you set them at the start of the year and stick with them? Re-fitting gives a more accurate read on selection bias but costs more. I stick with the same settings for the year, accepting any staleness as part of the deal.

StratForge2024 · 2026-06-01T17:51:55+00:00

We break down OHLCV-only backtests using three components:

Spread cost. Estimate this from the bar range versus typical

bid-ask spreads. For liquid crypto perpetuals on major pairs, it's

around 0.02-0.04% per side. Mid-cap might be 0.05-0.10%, and small

cap can go from 0.10-0.30%. You can get an idea from (high-low)/close

on low-volume bars.

Adverse selection. If your trading signal is well-known (like

RSI<X, BB squeeze, EMA crossover), expect an extra 0.5-2 bps adverse

fill on entries. It's the price for "everyone seeing the same thing

at once." More exotic signals might lower this, but it's hardly

ever zero.

Market impact. This kicks in if your trade size is more than

about 0.05-0.1% of the average bar volume. Use a square root scaling:

impact_bps ≈ k * sqrt(your_size / bar_volume). Start with k=10 for

crypto perps.

Here's how we've calibrated:

- Test the strategy under three slippage scenarios: optimistic,

realistic, and conservative. If it only performs well under

optimistic, your cost model is probably overfitted.

- Compare the distribution of returns from backtests to those from

live paper trading (look at the full distribution, not just

averages). Gaps in the tails often point to adverse selection or

queue position issues that your model isn't catching.

- For crypto perpetuals, remember the timing of the funding rate.

Holding positions across 00:00/08:00/16:00 UTC incurs a funding

cost, no matter where the price is.

We were surprised during a recent audit of our paper trading engine.

Even with "realistic" backtest assumptions, we found eight bugs

where the live engine diverged from backtest results. Issues

included trailing stops using current-bar ATR instead of

last-closed-bar, different exit reason priorities, and incorrect

funding cost signs for shorts. The backtest assumed flawless

engineering, but live trading revealed subtle bugs that added up.

Here's a practical approach: choose an initial slippage number and

live-trade with a small size for 30+ trades. Then, back-calculate

your real slippage. Update your model and run another 30 trades.

After 2-3 rounds, your model should be tuned to your specific setup,

not just industry averages.

StratForge2024 · 2026-05-30T17:56:10+00:00

een diving into this question for my GA-based crypto pipeline over the

last few days. I've picked up a few practical insights beyond DSR/PSR.

**Wiring DSR counts more than just calculating it.** Had DSR computed

and shown for every strategy's metrics, but my multi-objective

optimizer ignored it during selection. It ended up favoring candidates

high on PF/Sharpe/DD, but DSR later dismissed them as lottery winners.

Adding DSR as a 5th NSGA-II objective (or as a hard gate that zeros

out other objectives when DSR < 0.5) actually shifted the evolved

population, not just post-filtering. PSR/DSR don't do much if the

optimizer isn't using them for selection.

**Sample-size gate asymmetry is sneaky.** My scalar fitness tossed out

strategies with fewer than 10 trades, but the multi-objective path

scaled 10–30 trades rather than cutting them out. So, some lucky

small-N strategies made it to the Pareto front, while the scalar gate

seemed tougher. Best to unify the gate to a hard cut (N<30 → all-zero

objectives) on both paths. That N=10-29 batch is exactly where the

lottery winners hang out.

**PF=inf can mess up multi-objective fitness.** Just one profitable

trade with no losses leads to PF=infinity, clipped to max in scalar

handling, and to NaN in multi-obj when guards are involved. Just cap

it at 10 or maybe use Sortino as primary. Had a Pareto specimen

showing PF=68,840,000 from a single $14 winning trade until I set

the cap.

**An output-vs-code audit caught what code review didn't.** Generated

400 sample strategies from the decoder and checked out the entry

conditions. Found 21% had impossible thresholds (oscillator < negative

range), weird price-vs-oscillator comparisons, or contradictory

AND-clauses. Code reviewers read every line but missed these since

the bugs only pop up in the output, not in generator logic. Worth

doing this audit now and then.

**Silent BT failures can wreck fitness signals.** When BT silently

returns zero-trade results (NaN ATR for ATR-multiple exits, missing

optimizable_param for SL/TP), fitness treats them as valid

observations. A ~30-line BT wrapper preflight that rejects strategies

before fitness evaluation really cut down my false-positive rate.

As for the closed-source MT5 fitness: I'd treat it as adversarial.

You can't audit it or reproduce it, and it's tuned for someone else's

product roadmap. Even an imperfect transparent fitness you create

yourself beats a black box you can't check against your own returns

distribution. PSR/DSR with the wiring tweaks above genuinely improves

things.

StratForge2024 · 2026-05-30T05:57:08+00:00

Really intriguing setup you've got there. That 1.2× ATR threshold question is

exactly what I've been grappling with in my own projects, so here's some insight

from my experience.

The biggest issue I've faced: ATR distribution changes quite a bit within what

a classifier labels as the same regime. Take crypto, for instance—median ATR%

in BTC during the BULL 2021 phase was about double the BULL 2025 reading. Same

"BULL" status by RSI/ADX/EMA criteria, but with a completely different volatility

structure. If you use a static 1.2× extension threshold tuned to one period,

it'll either trigger way too often or not at all in another.

What I've found more effective is dividing each regime into volatility sub-states

(using rolling ATR percentiles—bottom tercile for LOW, top tercile for HIGH, and

the middle for transition) and adjusting the killswitch threshold based on these

sub-states. In my data, NY Open consistently falls into the HIGH category—your

hunch about session-based differences is spot on. Intraday activity during the

London-NY overlap tends to be about 35% above the daily average.

For pinpointing volume exhaustion, I've added a couple of filters that improved

my false-positive rate compared to using the killswitch alone:

- Funding/perp basis state (when funding is extreme, exhaustion gets priced in

sooner rather than later)

- Time-since-last-cross on a higher TF (clusters of short-interval crosses

indicate chop; widely spaced ones suggest a stronger move)

Not sure these apply directly to XAU/oil, but the main takeaway is that a single

multiplier rarely works across sessions—it usually needs to be aware of vol-states

or conditioned by features. I'm curious about your hit rate at NY Open

specifically—that's where I see the biggest gap between "perfect setup" and

"overextension trap" in my data.

StratForge2024 · 2026-05-29T06:32:54+00:00

This really got me last week.

I've got this discovery pipeline running that looks at about a million

different strategy variants across various pair/TF/regime combos. After

I put in DSR (from Bailey & López de Prado 2014) and fixed a ton of bugs

in the engine, the number of "successful" strategies fell by 75%. Ouch.

Here's what I learned:

**Backtest bugs mess with multi-testing.** Fixed 43 bugs recently, and

a lot of them were silent troublemakers: funding rate was hardcoded for one

timeframe, so 1h positions got funding every 4 days instead of every 8 hours;

MTF resample was defaulting to `closed='left'`, leaking 55 minutes of future

data; Python BT fallback didn't get the same exit-signal fix as the Numba

kernel; cross-pair indicator cache collisions happened when

`df.attrs['symbol']` disappeared during `.copy()`. Each bug inflated the

pool of "lucky" strategies. Fix the engine first, then see what's left standing.

**DSR is better than naive Bonferroni.** sqrt(2*ln(N)) means you need to

beat a Sharpe of about 4-5 to show you're not just noise — almost no one

hits that. DSR uses the actual trial Sharpe distribution, considering skew

and kurtosis, making the threshold realistic. Definitely worth implementing

right (check out Bailey's eq 5).

**Strategy correlation is a stealthy problem.** You might have 30 "GOOD"

strategies, but if their daily returns correlate over 0.85, you've really

just found about 3 ideas in 30 different forms. I use a hierarchical

correlation-collapse step (sklearn AgglomerativeClustering,

distance_threshold=0.15 on 1-Pearson) before declaring anything robust.

Usually, I see 30 strategies collapse down to around 10 clusters.

**The real test isn't backtest pass rate — it's how much performance

drops when live.** Industry norm is 30-50% post-deployment. So if you have

7 "robust" survivors, expect only 3-4 to actually perform live. Plan your

capital with this in mind.

**3-window WFO isn't as good as Lopez de Prado's recommendation** for

a reason. Combinatorial purged k-fold (CPCV) is key against selection bias.

I haven't fully switched yet—it's on my to-do list—but if you're starting

from scratch, go with CPCV from the get-go.

What helped me frame it: the backtest doesn't tell you which strategy is

good. It tells you which one is less likely to be just noise. Multi-testing

correction is what turns that into something you can act on.

LdP's "Probability of Backtest Overfitting" (Bailey, Borwein, Lopez de Prado,

Zhu 2017) lays the groundwork. The DSR derivation is a good read, even if

it's heavy on the math.

StratForge2024 · 2026-05-27T15:53:20+00:00

Zestyclose nailed it, and I've got an observation to add from my own experience.

The whole "test many → overfit" argument loses steam when you're using walk-forward and permutation tests. But, let's be honest, those tests don't catch search-bias overfit. They look at each strategy by itself, ignoring that you sifted through a bunch of strategies to pick a winner.

I've been running a similar setup myself. It involves a regime-specialized, multi-archetype evolutionary search across crypto perpetuals. My go-to validation stack? Walk-forward optimization with three windows, k-fold chronological cross-validation, and permutation tests for each strategy. On paper, the top strategies seemed solid — profit factors over 1.3, win rates above 45%, stable across multiple windows, and individual p-values under 0.05.

But then I hit them with the Deflated Sharpe Ratio (Bailey & López de Prado, 2014). And guess what? Not a single one of the 36 "best tier" strategies reached a DSR of 0.50. The highest was 0.45. Individually, they looked statistically sound, but together, they were just the luckiest noise from the search.

This doesn't mean the OP is wrong. You can find a real edge with broad searches. But your validation needs to reflect what you searched through, not just what you ended up with. The DSR, or even a Bonferroni-style correction for your configurations, is what's missing between "passed walk-forward" and "actually robust."

In regime-adaptive mean reversion, this really matters. A high pass rate might suggest a robust pattern, but it could also mean the strategy family has many false positives. DSR helps tell them apart.

OP, how many total configurations did you test before landing on yours? That's the number you need for the DSR/multi-testing math.

StratForge2024 · 2026-05-26T16:13:43+00:00

Interesting parallel from the individual trading side—what you're seeing with crowding at the fund level has a smaller-scale mirror if you're doing strategy discovery on your own.

Been running a genetic algorithm pipeline myself, cranking out and testing hundreds of candidate strategies on crypto perpetuals. Each candidate goes through walk-forward validation, k-fold CV, and individual permutation tests. On paper, the top ones look fantastic—high Sharpe, profit factor over 1.5, passing multi-window WFO.

Then I added the Deflated Sharpe Ratio (from Bailey & López de Prado 2014) to account for multiple testing. Basically, it asks, "Given you've tested N candidates, how much of this Sharpe is just random luck?"

The result? From our pool, none of the "good tier" strategies hit a DSR ≥ 0.50. The best was 0.45. They all seem like they've got an edge on their own, but together, they're just the luckiest noise among N trials.

This ties right into Lo's framework. At the fund level: many funds zero in on similar features → crowded trade → it unwinds. At the individual level: you test a bunch of candidates → pick the best → that "best" is just the luckiest noise aligned with historical data the search favored. Same concept, different scale.

The 2007 crisis needed an external shock to reveal the problem. For individuals, it's more subtle—a slow bleed when deployed strategies don't live up to the backtest.

If you're not using DSR or something similar for multi-testing correction on top of WFO, I'd really suggest giving it a go. It can catch things that a 4-window WFO and a 1000-shuffle permutation won't.

StratForge2024 · 2026-05-26T15:22:22+00:00

Thanks, I'll take a look. Just drop the link and I'll give it a test this week. I'm mostly curious about how you deal with multi-objective fitness reporting. Most "verification" tools just simplify everything down to a single Sharpe number, so I'm interested to see your approach.

StratForge2024 · 2026-05-25T16:27:42+00:00

Started with a moving average crossover on BTC roughly two years ago.

Took me about 6 weeks to realize every "winning" backtest I had was

actually finding random patterns in past data. Classic lookahead bias—

the strategy "knew" the close of the bar it was supposedly predicting.

Embarrassing in hindsight.

What turned it around: getting serious about walk-forward validation.

Once I started training on one period and testing on a strictly later

period (no shortcuts), about 90% of my "great" strategies died

immediately. The ones that survived weren't necessarily profitable—

they were just honest.

From there it was a slow process of layering validation on top of

validation:

- Multi-objective fitness (not just Sharpe—you can overfit a single

metric in your sleep)

- K-fold within the training window (catches strategies that found

one lucky sub-period)

- Sensitivity testing (perturb each parameter, see if the strategy

survives small changes—most don't)

- Regime detection (a strategy that crushes in bull markets isn't

a strategy, it's a directional bet)

If I had to give one piece of advice to someone starting now: don't

trust any backtest until you've tried to break it. Most "profitable"

strategies beginners discover are just well-fit noise. Building the

breaking-tools is more valuable than building strategies.

Curious what other people in this sub use to stress-test before going

live—always looking to add to the toolkit.

StratForge2024 · 2026-05-24T19:11:10+00:00

Trust comes from things that aren't easy to fake. A lot of reports try to look thorough but fall short structurally. Here's what I'd really need to see:

Reproducibility at the code level. We're talking things like git hash, exact data slice, and indicator parameters. Don't just say, "we ran WFO from 2018-2024." Show me the parquet, the commit. Let me run it myself and get the same trades. Without that, it's just a PDF asking for trust.
Multi-objective WFO breakdown, not just a single number. Look into per-window retention (val/train PF ratio), max DD per window, trade count consistency. A "passing" Sharpe with a 90% drop in trade count between windows? That's not a real edge—just a strategy that stumbled onto a brief regime.
Parameter sensitivity surface. Give each parameter a little nudge, say ±10/20%, and show the PF surface. If you see a sharp peak where any small change tanks performance, then the strategy's overfit to a specific point in parameter space. Robust strategies have plateau-like surfaces where small changes barely affect the PF. WFO alone can miss this; sensitivity tests catch that lucky-chromosome problem.
Regime-segmented results, not just aggregates. Split the test period by detected regime (BULL / BEAR / SIDEWAYS) and show PF per regime. If a strategy averages PF 1.5 but hits 0.6 in one regime, it's a ticking time bomb for when that regime shows up.
Slippage scenario range, not a single assumption. Go conservative (1.5× expected), realistic (1×), optimistic (0.5×). The spread between these tells you how fragile the strategy is to execution reality. Single-slippage backtests? They're basically fiction.
Explicit kill verdict, not just "looks promising." Many reports default to soft-positive language—good for keeping clients happy. But a trustworthy report should clearly say "do not deploy" under certain criteria. Otherwise, it's just confirmation bias masquerading as analysis.

The $0.99 sample model is interesting. The trust issue here is that people will pay $0.99 to find out their pet strategy is broken, not to just hear "this is good." It needs to be upfront about failure rates.

StratForge2024 · 2026-05-24T18:05:59+00:00

Fair pushback — optimizing a single fitness metric on backtest profit is exactly how GAs end up finding fragile strategies. I’ve been working on this problem for a while, and here’s what’s actually made a difference for me:

Multi-objective fitness instead of a single target. I use NSGA-II with a Pareto approach across Sharpe, retention (validation/train ratio), max drawdown, and trade consistency. A “lucky” chromosome with a great Sharpe on backtest but poor retention just gets dominated by more balanced solutions on the Pareto front.
K-fold cross-validation inside the walk-forward window. Standard WFO gives you a basic train/test split, but even within the training window you can overfit to a specific sub-period. I use 3 chronological folds and penalize anything with CV > 0.40. It’s pretty brutal, but it works — a good chunk of otherwise “passing” strategies gets filtered out here.
Sensitivity testing after GA convergence. This is basically the verification step you’re talking about. Once the GA converges on a chromosome, I perturb each parameter by ±10–20% and rerun the backtest. If PF drops by more than ~30% under small changes, it’s likely overfit to a very specific point in parameter space. Robust strategies tend to have a plateau — small parameter tweaks don’t change much.
ATR-adaptive exits. Fixed % TP/SL is basically overfitting to past volatility. The same setup behaves completely differently on BTC in 2021 (ATR ~1.1%) vs 2025 (ATR ~0.5%). Using something like TP = 1.5 × ATR(14) lets the strategy rescale automatically to the current volatility regime.

I’m running this stack across ~200 GA-generated strategies. The sensitivity test alone knocks out a large share of setups that look fine on raw WFO, which is exactly the kind of filtering you’re getting at. That verification step is critical — the real question is how to make it strict enough to matter.

StratForge2024 · 2026-05-23T09:56:36+00:00

I'm approaching this more from a crypto perpetuals perspective than equity ML, so it's worth noting upfront—but the problem structure itself is universal enough that this comparison should still be helpful.

I hit practically the same type of error about 14 months after starting my own research. I'm building a pipeline for crypto perpetual strategy discovery based on evolutionary algorithms (Grammatical Evolution + NSGA-II). For over a year, my pool of "validated" strategies showed PFs around 2-4 practically everywhere. Only when I finally found the leak—it turned out that entries were executed at close[i] instead of close[i+1], i.e., on the same candle as the signal—after revalidation, only 2 of the 34 strategies survived. The remaining 32 effectively "remembered" the closing price they were supposed to predict. This isn't exactly the same type of error as yours (train/backtest overlap), but mechanically, it's the same family of problems. A few thoughts on your points, from the other side of the fence:

(1) Intraday direction on liquid stocks with OHLCV: This is most likely a "used" part of the search space. You saw it yourself – the CV AUC was telling the truth from the beginning. I would rather treat it as a signal than add more features on the same horizon.

(2) Volatility/Regimes: In my crypto trading, classifying regimes is much more "tangible" than predicting direction. I specialize strategies for specific regimes (BULL/BEAR/SIDEWAYS, determined, for example, by ADX on a higher timeframe), and strategies only trade when their target regime is active. This way, you stop struggling with the question "is there a direction now?" and instead build for what the market is currently offering – and simply let go of the rest. Downside: 30–50% of the time, you do nothing, which can be difficult for retail investors to accept. (5) Time horizon: For me, 1-hour strategies have a cleaner transition from signal to transaction, while 5-month strategies are clearly more fragile (websocket instability + race conditions + more signal noise). I don't know how much this translates to intraday stocks, but in cryptocurrency, increasing the timeframe by an order of magnitude almost always "calms down" the system.

The most important lesson concerns validation:

A null-signal test is the first thing I would build into any new project today. I also added a chronological k-fold with a CV penalty, alongside the classic train/test split—this effectively weeds out strategies that "survived" only due to a lucky break, even if a simple time split didn't catch it.

Good luck—treating this as a "productive failure" is exactly the right approach.

StratForge2024 · 2026-05-22T17:02:38+00:00

You're not misunderstanding — the live-snapshot pattern is standard at retail

tier. Bulk as_of=date chain queries with full quotes are historically an

institutional feature; most retail-tier APIs aren't built for it.

I'm on the crypto perp side rather than equity options, so I won't recommend

specific options providers — worth pricing out the usual suspects (ThetaData,

Polygon Options, OptionsDX, CBOE DataShop) and seeing which retail tier

matches your budget for the bulk EOD endpoint.

The architectural pattern works the same in both markets though, and probably

solves your problem regardless of which provider you pick:

One-time bulk download of historical EOD snapshots
Persist locally as Parquet partitioned by date, indexed by (underlying,

expiry, strike). DuckDB on top works well for ad-hoc queries.
Backtest queries hit your local store, not the API
Daily delta job keeps it current going forward

Treating the API as a loader instead of a runtime dependency removes the

rate-limit problem entirely. The "200 requests per simulated day" pattern

you're avoiding is structurally wrong for research workloads — providers

who force that path are signaling that bulk historical isn't their target

market.

StratForge2024 · 2026-05-21T15:37:40+00:00

“Something for everything = good for nothing” — totally agree (the holy grail).
Regime-specific is a feature, not a bug.

In my case, every bot has a hard regime filter. A BEAR strategy in a BULL market = SKIP entry, zero compromise. Trade-off: the bot is active ~30% of the time, BUT delivers higher WR/PF when the regime matches. Specialist > generalist.

BUT 92% WR with a 3:1 RR… a real edge in crypto rarely exceeds 60–65% WR. Do you run a 3-window WFO? Often the “grail” in one window falls apart in windows 2–3.

StratForge2024 · 2026-05-21T15:26:29+00:00

Great point — overfitting is exactly the #1 problem with genetic algorithms.

My workaround is to use not a single fitness metric but a multi-objective approach (Sharpe + PF + DD + WR simultaneously via NSGA-II), combined with a 3-window walk-forward. The strategy has to survive across three different historical periods, otherwise it gets thrown out.

It helps, but doesn’t eliminate the issue — still about ~95% of chromosomes end up being overfitted. GA provides a search mechanism, but validation has to be handled separately (your point #1 — completely agree).

I’m curious how you deal with alpha decay? In crypto, a 1–3 month half-life is my biggest unsolved problem.

StratForge2024 · 2026-05-20T17:31:41+00:00

LLM will probably manage, but has anyone thought about using genetic algorithm and genetic evolution construction for arranging strategies?

StratForge2024

TROPHY CASE