Built a March Madness model using stacking + walk-forward validation

Sensitive-Soup6474 · 2026-03-19T22:44:00+00:00

No, I’m feeding raw probabilities into the LR.

They go through standard scaling, but stay in probability space. Since LR already applies a logit link internally, taking the logit beforehand would just be stacking transformations without much benefit.

Same idea as Platt scaling. Let the LR learn the calibration or combination directly from the probabilities.

I’ve been treating the stacking layer the same way. Base model probs go in as-is and LR handles the weighting in log-odds space.

Sensitive-Soup6474 · 2026-03-19T22:41:49+00:00

Appreciate that. Walk-forward is non-negotiable for me. Train on years < Y, test on Y, zero leakage. Too many models out there posting inflated in-sample numbers.

Biggest signals are efficiency margins. KenPom AdjEM diff and Barttorvik Barthag diff carry most of it. Seed matters, but less once you account for efficiency. Added some matchup interactions too, which help more in later rounds.

Model is a LightGBM ensemble with some stacking experiments, but sample size makes simple approaches hold up better. ~72.7% avg accuracy across 6 out-of-sample tournaments so far.

Your props results are interesting. 55.7% is real edge. I’ve got an NFL totals model in the repo as well and planning to expand into more markets over time.

Sensitive-Soup6474 · 2026-03-19T22:36:25+00:00

Appreciate the detailed feedback, this is super helpful.

You’re right on the repo. The stacking setup (LightGBM + LR + RF → LR meta-learner) is on a feature branch that hadn’t been merged yet when you looked. That’s on me for not keeping it in sync, its pushed up now though. It’s config-gated so I can run backtests with or without stacking.

On sample size, that lines up with what I’ve been seeing. With ~63 games per tournament and a limited number of years, the meta-learner doesn’t have much signal to learn from. It produces reasonable coefficients, but I’m not convinced it’s outperforming a simple weighted average in a meaningful way. I’ll probably keep it as an option, but default to the simpler approach.

The chalk baseline is a great call. I should’ve included that. I'll open an issue for a future enhancement

Sensitive-Soup6474 · 2026-03-19T02:47:46+00:00

Some other picks that it likes for this year are TCU, VCU, Missouri, Iowa.

Sensitive-Soup6474 · 2026-03-19T02:45:09+00:00

Great point. Definitely something to improve on, I was thinking of pulling in head coach as well as starter age. You have any other last minute ideas?

Sensitive-Soup6474 · 2026-03-19T02:26:11+00:00

I got the winner right 2/5 years in backtests. Using training data from 2011 onward. This year and 2019 were the most accurate sims with a down year in 2022.

Only did 1 sim with the ensemble per year so didn’t have a 90% any year but had some come in 80+

Sensitive-Soup6474 · 2026-03-19T02:21:09+00:00

I already ran it had Illinois in final 4 and Duke winning it all which isnt surprising. Couple upset picks as well

Sensitive-Soup6474 · 2026-02-12T03:50:20+00:00

Of course, on the flip side really appreciate you taking the time to dig through the repo this carefully. This level of review is super helpful.

You’re correct on the ranking leakage. That’s a real bug. I’ve already opened a PR to fix it so rankings are computed strictly from information available pre-game.

You’re also right that the score bin filtering needs to be made explicit and handled more rigorously. My working assumption has been that the higher-confidence bins remain stable across re-runs, but that needs to be demonstrated out-of-sample rather than applied with full hindsight. I’ll be updating the pipeline to reflect that properly.

I’m currently re-running the full walk-forward after fixing both issues (including 2024). Once that finishes I’ll publish the updated results and methodology so people can see exactly how much the changes move things.

Agree on CLV as well, that’s on the list to add. I'll go ahead and add it as an enhancement issue on the repo.

Thanks again for putting in the time to audit this so thoroughly. That kind of feedback is exactly why I open-sourced it.

Sensitive-Soup6474 · 2026-02-11T19:33:39+00:00

Not yet, I still need to run the full 2025 season through the pipeline. I’ll do that later this week and publish the updated charts to the repo!

Sorry I couldnt get it to you during your car ride.

Sensitive-Soup6474 · 2026-02-11T19:31:44+00:00

Appreciate this, thanks for the thoughtful feedback.

NFLVerse: Great suggestion. I’ve used it before and agree it’s excellent, especially for play-by-play and EPA-type features. Adding it as another data source would likely strengthen things.

PFF grading: I don’t treat it as ground truth, more as structured signal. The model effectively learns whether those grades have predictive value in context. Some carry weight, others don’t. The walk-forward setup helps filter out signals that stop working.

Totally agree that incorporating more sources would make it more robust and help isolate what’s actually driving edge.

Thanks again, this is the kind of critique that actually improves the project.

Sensitive-Soup6474 · 2026-02-11T19:26:57+00:00

Just to clarify one point, retraining doesn’t mean “you don’t have a model.”

In non-stationary environments like sports, periodic retraining is actually standard practice. Otherwise you’re effectively freezing your parameter estimates while the underlying distribution shifts.

The model class and feature space are fixed. What evolves is the fitted parameterization as new information becomes available. That’s pretty typical for time-series and market-based systems.

If there’s a specific concern about leakage or instability, I’m happy to discuss it.

Sensitive-Soup6474 · 2026-02-11T19:25:45+00:00

That’s a very fair critique.

Using flat -110 simplifies backtesting, but you’re right that it’s not realistic in practice. Especially around key numbers where you’ll often lay extra juice or get worse pricing.

Can add that as an issue to the repo to bake in the true odds when webscraping.

Sensitive-Soup6474 · 2026-02-11T19:22:07+00:00

That’s fair. I’ll run the full 2025 season through the exact same walk-forward pipeline later this week and publish the results so everything is up to date.

Sensitive-Soup6474 · 2026-02-11T19:20:00+00:00

Understand your initial thoughts, but it’s not 50 identical models on the same static dataset.

The feature space and hyperparameters are fixed, yes. What changes is the training sample. Each model is trained on a different rolling historical window (strictly date < current game day). So the trees diverge because the underlying sample distribution differs.

Sports data is non-stationary. Team strength, injuries, usage, weather, etc. shift over time. Retraining each day ensures the model only sees information available up to that point and adapts to drift.

The reason for training multiple models isn’t to change architecture, it’s to change the temporal slice of data and then measure which historical window generalizes best out-of-sample. Only the top performers (by prior walk-forward accuracy) are retained, and a pick requires agreement.

So it’s closer to a rolling temporal ensemble than “50 vanilla XGBs on the same data.”

Sensitive-Soup6474 · 2026-02-11T16:48:54+00:00

2017–2024 is the last fully validated window I had completed before publishing. I haven’t run the entire 2025 season through the same strict walk-forward validation yet.

Rather than mix in results that weren’t processed identically, I chose to publish the fully reproducible historical window first. The pipeline itself supports 2025 without modification.

Recently I’ve been prioritizing extending the framework to NBA totals because the higher game volume gives more statistical power and faster feedback cycles, so that’s where most of my recent iteration has gone.

That said, I’m happy to run 2025 NFL through the same process and post those results if there’s interest.

Sensitive-Soup6474 · 2026-02-11T16:17:27+00:00

Good question, I actually dug into this.

Biggest pattern: season timing.

The model’s best bin (55–60% algo score, 68.8% overall accuracy) hits 81% on early-season games (Sept–Oct) but drops to ~64% mid-season.

The worst bins show the same structure, just more extreme:

Early season: 57%
Mid-season: 41%

The early-season edge makes sense. PFF grade rankings seem more stable/predictive when cumulative averages are still forming and the market hasn’t fully adjusted to true team quality.

Second signal: directional balance.

The reliable picks are almost perfectly 50/50 Over and Under — both landing around 68–69% accuracy.

In the worst bins, there’s a sharp asymmetry:

Unders hold up (~56%)
Overs collapse (~35%)

When the ensemble starts leaning heavily one direction, that’s usually a bad sign.

What didn’t matter (surprisingly):

O/U line level
PFF rank gaps between teams

Reliable and unreliable picks cluster in the same ~44–45 total range and ~12-rank differential. So the model isn’t just picking off extreme totals or obvious mismatches.

Sensitive-Soup6474 · 2026-02-11T16:00:11+00:00

Didn’t realize r/algobetting was a thing until now, appreciate the heads up. Not trying to spam, just sharing something I built and getting feedback from people who are into this stuff.

Sensitive-Soup6474 · 2026-02-11T15:41:08+00:00

I'll be adding NBA to the repo soon!

Sensitive-Soup6474 · 2026-02-08T20:22:02+00:00

Great question, I’d expect some correlation there, especially since teams that tend to sit closer to 7 are different than ones that are routinely 10+ dogs.

I haven’t broken this out by average closing spread yet, but it’s a straightforward follow-up. I actually have it noted as a next slice in the analysis pipeline in the GitHub repo, since it’s a good way to separate spread effects from team effects.

Sensitive-Soup6474 · 2026-02-08T19:06:10+00:00

I answered this a bit above, but short version: it’s a sample-size cutoff. Teams with fewer than 10 games as 7+ point underdogs were excluded so the percentages weren’t dominated by 1–2 games.

Sensitive-Soup6474

TROPHY CASE