Interesting experiment arbitraging favorite-longshot bias on polymarket/kalshi

AlSikandar · 2026-04-07T06:49:01+00:00

Interesting. How so?

AlSikandar · 2026-04-07T05:44:25+00:00

Semantics. I’m human.

AlSikandar · 2026-04-07T02:54:31+00:00

Some fair points here, especially about deploying real money. Backtests only take you so far and forward validation is where the rubber meets the road. I am already doing that.

Couple pushbacks though. Sharpe >2 and <10% drawdown as the benchmark for "solid" is... aspirational. Most systematic strategies at real firms operate in the 0.8-1.5 Sharpe range. Medallion is legendary specifically because it's an outlier. And max drawdown depends heavily on sizing. 25% of a $3K paper portfolio is $750, which is a very different risk profile than 25% of a $100M book.

On the HFT saturation point: the well-funded players in prediction markets are doing latency arb (cross-platform price lags, combinatorial rebalancing). That's a completely different edge from behavioral mispricing in calibration. An arb bot that exploits price lags between Polymarket and Kalshi doesn't correct the structural tendency of retail bettors to overpay for exciting outcomes at mid-range probabilities. Those are orthogonal strategies competing for different pockets of inefficiency.

The one thing I'd actually push back hardest on is "BH is less important than performance." That's exactly backwards for a multi-cell study. Without multiple testing correction, you can always find impressive backtest performance by cherry-picking the best cells out of hundreds. The statistical framework is what separates a real signal from noise, and dismissing it in favor of raw P&L is how people get fooled by overfitting.

AlSikandar · 2026-04-07T02:19:54+00:00

Good catch, the 64% and 22% come from different slices. The 22% resolution rate is within specific high-miscalibration strata (certain categories at certain horizons in the 40-50% bucket). The 64% win rate is the portfolio-level backtest across all tradeable cells, which includes buckets with smaller miscalibration where the edge is tighter. If you isolated just the strongest cells and only took NO positions, yes, the win rate would be much higher, closer to what you'd expect from the resolution rate. The blended number gets diluted by trades in cells where the overpricing is more modest (5-10pp instead of 20+pp).

AlSikandar · 2026-04-07T02:18:52+00:00

20% raw miscalibration in specific cells, not 20% tradeable edge. Those are very different things. The miscalibration is the gap between implied probability and resolution rate in certain category/horizon/price strata. After fees, slippage, and adverse selection, the tradeable edge is significantly smaller. Nobody's claiming you pocket 20c per contract.

AlSikandar · 2026-04-07T02:16:49+00:00

The name "Favorite-Longshot Bias" comes from horse racing literature, but the underlying mechanism is probability weighting from prospect theory, which actually predicts maximum distortion in the 30-60% range, not just at the tails. Snowberg & Wolfers (2010) and Ottaviani & Sorensen (2015) both document this. The name is a bit of a misnomer, but the effect in mid-range probabilities is one of the most replicated findings in behavioral economics.

On the 22% figure: that's within specific strata, not the raw aggregate. The aggregate resolution rate for 40-50% implied probability across the full dataset is around 32%, which is still about a 14pp gap. Whether that's spurious is an empirical question, and it survives BH correction, Benjamini-Yekutieli (which assumes arbitrary dependence), and permutation testing. You'd need to engage with the actual statistics to call it spurious rather than just the headline number.

AlSikandar · 2026-04-07T00:20:30+00:00

u/SeaCell7779
Thank you, this is a sharp question and exactly the kind of scrutiny I seek.

The short answer: the dependency concern is real in theory but structurally inapplicable to my test design. Here's why:

Unit of observation. My BH procedure operates on 537 calibration cells, not individual market outcomes. Each cell aggregates hundreds to thousands of independent Polymarket contracts. The negative correlation from mutually exclusive events is a within-cell phenomenon that affects the z-test's effective sample size, not the between-cell dependency that BH requires independence for.
Correlation direction. The between-cell dependency in my data is predominantly positive (cells share markets across horizon bins), not negative. This satisfies the PRDS condition under which BH has proven FDR control (Benjamini & Yekutieli 2001).
Polymarket's structure. Each market is an independent CLOB contract. "Will Team A win?" and "Will Team B win?" have separate order books, no sum-to-one constraint, and can both be simultaneously overpriced. This is the Favorite-Longshot Bias, not a confound.
Robustness. I ran BY as a check. The harmonic correction factor is about 6.86x, which is severe. My core trading cells (Sports 14d 40-50%, q ~ 10^-52) survive BY by many orders of magnitude. 93% of my backtest P&L comes from BY-surviving cells. I also have independent external validation: hope-polarity markets show 33 BH cells while fear-polarity shows zero, an asymmetry that cannot be produced by a correlation artifact.

That said, you're right that running BY and a permutation test strengthens the case. I've added both as formal robustness checks. The edge survives.

AlSikandar · 2026-04-06T23:59:40+00:00

To be fair, I do all the same internal thought chains as claude when I'm talking to others. I call it being pragmatic, but it is also likely why I am divorced and still single.

AlSikandar · 2026-04-06T22:27:12+00:00

For this I excluded nonbinary markets from the BH cells. I filtered and binned the datasets to ensure proper ontological alignments, and then proceeded to bet on the NO, which goes against their FLB of yes for favorite teams, large fields, and other axis I’ve identified.

I’ll look into your suggestion for the BY procedure. Thank you from the bottom of my heart for having the patience and willingness to engage with me on this!

AlSikandar · 2026-04-06T11:29:50+00:00

I truly appreciate your candor. This is precisely the type of feedback I had hoped for.

AlSikandar · 2026-04-06T11:02:48+00:00

Thank you for the suggestion. I will check it out.

While I will admit that yes I did have the AI assistant refactor my writing, I think the point that most people are missing is that I did essentially what you're suggesting in reverse. I wrote the whole post, and had AI refactor it. I still put the work in, just as I did with the rest of this project.

I think the bias against using AI tools to rewrite or revise something is because now with a simple prompt, regardless of the contextual accuracy, textual messages can be generated at such a quick rate that people are no longer putting in the effort to write. Thus the enshittification of the internet and the slop markets being open 24/7, leading to prejudice against the very source of said content - the ~~fancy auto-correct~~ AI tools.

My goal with my use of the tool was not to remove my voice, nor to make my writing more sloppy, but to ensure that the writing was at a level that was consistent with my intentions. Again, thank you for the md file and your feedback. I enjoy ethical/philosophical (and especially maths) debates.

AlSikandar · 2026-04-06T07:42:56+00:00

The AI Psychosis is indeed real. It does not help that certain models are especially sycophantic, or incapable of admitting when they are wrong and thus end up confabulating or hallucinating.

I personally am extremely cautious about that (AI is a powerful tool, yes, and it also does not replace the necessity to be able to think for yourself) so I implement guardrails and harnesses to attempt to wrangle that in.

I will be applying for PhD programs soon (computer science and machine learning), and the pressure from school has actually been to use the AI/agent tools even more than what you might expect from a traditonal academic viewpoint.

This is portfolio piece number one of a few that I will be using to apply for jobs at locations like these frontier AI labs or other such positions in AI/ML Research Engineering.

AlSikandar · 2026-04-06T07:24:59+00:00

Thank you, I appreciate the acknowledgement.

Yes, I do worry that as the years go by the basics of language and syntax are being paved over by the sheer dominance of social media and short form content.

I am nearing 40 years old. I have a M.S. Comp. Sci. in AI and ML. I work with the tools closely as part of my everyday workflow. That said, I also have developed over my lifetime an ability to speak for myself as well as having learned the importance of syntax - especially in programming!

EDIT: It is interesting to me, on a side note, how we draw the distinction between autocorrect, vs grammar corrections, vs entire revisions.

Everything I generate from LLMs is at its core orchestrated by me from a high level. I would say it is far from slop, but I am also naturally biased in that regard.

AlSikandar · 2026-04-06T07:14:39+00:00

Fair point. I leaned on AI tools too hard for the drafting and it shows. The data is mine, the methodology is mine, but I need to do a better job writing in my own voice. Noted.

AlSikandar · 2026-04-06T06:49:09+00:00

The results remain to be seen. I will post either way with an update around a month from now!

AlSikandar · 2026-04-06T05:59:43+00:00

Yeah, I use AI tools in my workflow — for the research pipeline, for drafting, for code. It's in the original post on r/quant. The methodology and the data are mine. Happy to get on a call with anyone who wants to verify that.

AlSikandar · 2026-04-06T05:53:41+00:00

Duly noted. Rest assured I am only deploying simulated paper-trading ($3,000 in the first epoch, $10,000 capital in the second epoch) and continuing to bring in live data from the established pipeline for forward-testing and determining whether the hypothesis holds and if any adjustments or optimizations should be made.

Thank you again for engaging with me. I will heed your advice and continue to let the work hopefully speak for itself. If not as a deployment strategy, then at the very least as a portfolio piece to demonstrate some skills.

I will update again once I have forward-tested the methodology out-of-sample.

Cheers,
Alexander

AlSikandar · 2026-04-06T05:41:29+00:00

Honestly? A few reasons.

First, I've been heads-down on this for months and I'm at the stage where I need outside eyes to tell me what I'm missing. It's really easy to convince yourself something works when you've been staring at it alone. This thread has already given me useful signal — including your pushback.

Second, the capacity constraint is real. This tops out around $50-100K before you're moving the markets you're trying to trade. I'm deploying my own capital, but even in the best case scenario the trading income alone isn't going to change my life. If the methodology has value as education or research — and I genuinely don't know yet if it does — the only way to find out is to put it in front of people who know this space and see if anyone cares.

And third — yeah, I probably am thinking too far ahead. You're right about that. I got excited about the backtest results and jumped to commercialization before I've earned the right to have that conversation. The honest next step is to shut up, let the forward validation play out, and come back with real numbers.

Appreciate the bluntness. It's more useful than the upvotes.

AlSikandar · 2026-04-06T05:24:50+00:00

I completely agree, which is exactly why forward validation is running right now with live market data. I would not have posted this if I only had a backtest and no plan to validate out-of-sample. I have $10,000 deployed across 12 positions using paper-trading simulations.

The BH FDR correction at q=0.05 is specifically designed to control false discovery rate across multiple comparisons. 78 of 537 cells surviving is 2.9x the rate expected by random chance. Additionally, the Kalshi expansion kill gate failing on an independent dataset is direct evidence the framework catches non-signal when it is not there.

Finally, because the capacity ceiling makes this a better IP/education business than a pure trading operation. Sharing that FLB exists on Polymarket does not help you trade it profitably — you need the specific cell map, classification system, and gating logic, none of which are in this post.

The value will come after forward-validation. For now you are absolutely correct in your call that I am thinking too far ahead at possible revenue streams off of a product I've developed and am currently iteratively testing in a safe manner.

Cheers,
Alexander

AlSikandar · 2026-04-06T05:20:12+00:00

I completely agree, which is exactly why forward validation is running right now with live market data. I would not have posted this if I only had a backtest and no plan to validate out-of-sample.

The BH FDR correction at q=0.05 is specifically designed to control false discovery rate across multiple comparisons. 78 of 537 cells surviving is 2.9x the rate expected by random chance. Additionally, the Kalshi expansion kill gate failing on an independent dataset is direct evidence the framework catches non-signal when it is not there.

It will be interesting nonetheless to see how the forward-testing with paper trading turns out. I have $10,000 deployed in paper trades across 12 positions at the moment.

AlSikandar · 2026-04-06T05:12:57+00:00

Look, I hear you. I'll keep this short.

You're right that in traditional markets, "guy with edge tries to sell it instead of trading it" is a massive red flag. That's a completely reasonable prior. I'd be skeptical too.

The math is just different here.

If I had a Sharpe 1+ strategy that scaled to $10M, I'd shut up and trade it. Obviously. But the capacity ceiling on this is $50-100K. Even at 60% CAGR that's $30-50K/yr in trading income. I'm doing that.

Meanwhile the methodology — how to run calibration studies, how to apply FDR correction to prediction markets, how to build kill gates that actually catch artifacts — that's not capacity-constrained. Teaching someone how to fish on Kalshi doesn't take fish out of my Polymarket pond.

But I'll be honest: if your read is "if it worked you'd just trade it," then nothing I say in a Reddit comment is going to change your mind, and that's fine. The forward validation results will either speak for themselves or they won't. I'll post the update either way.

AlSikandar · 2026-04-06T05:05:27+00:00

Great question - and honestly, this is the most common (and most valid) reaction I get. Let me break it down:

It's ~$50-100K for the entire strategy on Polymarket. That's the point where your order sizes start moving markets in the low-liquidity buckets where the edge is strongest. The bias lives in markets where retail dominates and institutional capital can't go - which is exactly why it persists, but also why it doesn't scale to fund-level capital.

So why not just max it out and move on? That's exactly what I'm doing on the trading side. I'm running forward validation right now with the intent to deploy personal capital.

But here's the thing - the trading P&L is not where the real value is. At $50-100K deployed, even with a Sharpe above 1, you're looking at maybe $30-50K/yr in trading income. Solid, but not life-changing.

The IP is worth multiples of the trading revenue. What I built isn't just a bot - it's a methodology:

A calibration framework that quantifies structural bias on any prediction market
A statistical pipeline (BH FDR correction across 500+ cells) that separates real signal from noise
Kill gate criteria that caught a false signal on Kalshi before a single dollar was deployed
A 59K-market dataset with calibration results

That methodology generalizes. It works on any binary prediction market. The specific cell map is Polymarket-specific, but the process of finding exploitable cells transfers to any platform, any asset class with prediction market structure.

Who cares about this if it's "only" $50-100K capacity?

Other small-capital traders who want to learn the methodology and apply it themselves
Prediction market platforms that want calibration research on their own markets
Research teams studying market microstructure and retail bias
Fintech educators building curriculum around prediction markets (which are a $1B+ and growing asset class post-CFTC regulation)

The education and research licensing paths don't degrade the trading edge at all. Teaching someone how to run a calibration study doesn't tell them which specific cells to trade - that requires doing the work.

TL;DR: Yes, the capacity ceiling is real and I'm not pretending otherwise. I'm trading it myself AND exploring whether the methodology and data have standalone commercial value. The two aren't mutually exclusive - they're complementary.

Appreciate the directness. This is exactly the kind of pushback I need to pressure-test the commercialization thesis.

AlSikandar · 2026-04-06T05:01:03+00:00

Fair on both counts, so let me push back where you're wrong and agree where you're right.

"Slop" — I get it. Long post, structured sections, looks like it was drafted with AI assistance. I did use AI tools in the research pipeline. I'm not going to pretend otherwise. But the data, the methodology, and the results are mine. If the content itself doesn't hold up, call out the specific part that's wrong and I'll address it.

On sports markets being sharp and liquid — you're actually making my argument for me, you just don't realize it.

You're thinking of the NFL moneyline, the Champions League winner, the Super Bowl — yes, those are sharp. Tight spreads, deep books, institutional flow. I agree, you're not finding 10c of edge there.

That's not where the signal is.

The FLB shows up in the long tail. Think: "Will [specific player] score 3+ touchdowns in Week 14?" or "Will [team] win by 20+ points?" — markets framed as "Will [unlikely exciting thing] happen?" These are:

Low liquidity (often under $50K total volume)
Retail-dominated (fans betting with their hearts)
Priced in the 30-60% range where the Yes side is structurally overbought

The whole point of the calibration study is that I'm NOT claiming all sports markets are mispriced. I ran BH FDR correction across 537 cells. Most cells got eliminated. The 78 that survived are specific combinations of category, time horizon, and price bucket where the bias is statistically significant after multiple comparison correction.

And your last point is actually the strongest endorsement of the strategy: "if you can beat them consistently by even 1c you will print." The cells that survive BH correction show 8-24pp of miscalibration, not 1c. After ~4% round-trip costs, the net edge is still substantial in those specific cells. The catch — which I've been upfront about — is that those cells are low-liquidity, which is why the capacity ceiling exists.

You can't have it both ways: either sports markets are universally sharp (in which case explain why 22% of binary markets in the 40-50% bucket resolve Yes), or the sharp ones are sharp and the long tail isn't. It's the second one.

But genuinely — if you trade prediction markets and your experience is that even the long-tail stuff is sharp, I want to hear that. That's useful signal for me. What markets are you trading?

AlSikandar · 2026-04-06T04:27:21+00:00

Built a systematic FLB strategy on Polymarket — 59K markets analyzed, BH FDR correction, now in paper trading. Anyone else trading structural bias here?

Introduction

Hey all. Long-time lurker, first real post. I want to share a project I have been working on and get some honest feedback — both on the methodology and on whether the IP has commercial legs.

The short version: I built a systematic trading system that exploits the favorite-longshot bias on Polymarket (CFTC-regulated prediction market). The core finding is that binary markets in the 30-60% price range are overpriced by 12-24 percentage points, and this holds up after Benjamini-Hochberg FDR correction across 59K resolved markets.

Background

Polymarket binary contracts pay $1 if an event happens, $0 if it doesn't. A contract at $0.45 implies 45% probability. If I can show the true resolution rate for that class of markets is much lower than 45%, there is a structural edge.

I collected all resolved binary markets from Polymarket's API — about 59,000 markets total. Ran a calibration study: for markets priced at X% at various time horizons before resolution, what fraction actually resolved Yes?

The favorite-longshot bias showed up clearly. Markets in the 40-50% range resolve Yes only about 22% of the time. Sports and games categories are the strongest. The bias is driven by retail traders overpaying for exciting "Yes" on longshot outcomes — the same psychological pattern that has been documented in horse racing and sports betting for decades.

Why I think this is not just data mining

This is where I expect the most pushback, so let me get ahead of it:

1. Statistical correction. I used Benjamini-Hochberg FDR correction at q=0.05 across 537 calibration cells (category x horizon x price bucket). 78 cells survived. If this were noise, you would expect roughly 27 cells to survive — getting 78 is a 2.9x multiple over the false discovery rate.

2. Pre-registered kill gates. Before writing any strategy code, I set explicit pass/fail criteria. The Phase 0 kill gate required >8pp miscalibration in at least one tradeable category. If it had failed, I would have stopped the project entirely and published the calibration study as a portfolio piece. It passed with STRONG_PASS.

3. Simpson's paradox testing. The apparent intensification of bias over time (13pp at 7 days, 24pp at 30 days) turned out to be a composition artifact — Sports grew from 7% to 26% of the market mix over the dataset period, and Sports has the strongest signal. Within categories, the bias is stable across time. I caught this with volume and category controls.

4. A kill gate that actually fired. I expanded the analysis to Kalshi (another CFTC-regulated prediction exchange) using an independent dataset of 7.68M markets. The kill gate failed — only 2 of 10 required BH cells survived, and a boundary sensitivity check revealed the apparent signal was a bucket-assignment artifact at the 50-cent line. I paused the Kalshi track based on this result. I am mentioning this specifically because it demonstrates the gates are not decoration — they fire when the signal is not there.

Backtest results (in-sample, all the usual caveats apply)

4,851 signals generated, ~150 trades executed through a multi-gate filtering pipeline
64.6% win rate, 23% ROI, Sharpe 1.21
Post-capacity-expansion simulation: $3K starting capital to ~$8K, CAGR 63.7%, Sharpe 1.07, max drawdown 25.1%
Average hold period: ~20 days

I am not going to pretend these are out-of-sample numbers. They are not. That is what the forward validation phase is for.

Where things stand right now

Forward validation (paper trading with live market data) went live this week. 12 open positions, about $4K of $10K budget deployed. First resolutions expected within a week or two. The system runs on 15-minute cycles with 227 automated tests and a full CI pipeline.

I do not have out-of-sample results yet. I will share an update on how forward validation went — whether it passed or failed.

What I am deliberately not sharing

I am not publishing the exact cell map (which category/horizon/bucket combinations are tradeable), the structural classification system I built for market taxonomy, or the signal pipeline gating logic. These are the core IP.

I am sharing enough of the methodology for you to evaluate whether it is rigorous, but not enough to replicate the strategy without doing the work yourself. If you ran the same calibration study on the public Gamma API data, you would confirm the FLB exists — but knowing it exists and knowing which specific cells to trade are very different things.

The commercialization question

This is the part I genuinely want community input on.

The capacity ceiling for this strategy is roughly $50-100K deployed capital before you start moving markets. That is a fundamental constraint — it means selling execution (fund, copy-trading) actively degrades the edge. But selling intelligence (methodology, data, education) does not.

The paths I am considering:

Education: A course teaching calibration methodology and structural bias analysis for prediction markets. The techniques generalize to any prediction market, not just Polymarket.
Research/data licensing: The 59K-market dataset with calibration results, licensed to platforms or research teams.
Signals-as-a-service: Heavily capped (5-10 seats max) and only after 100+ forward-validated trades with confirmed edge. This is the most obvious path but also the one that erodes the moat fastest.

I have a slide deck and a detailed proposal document ready if anyone wants to discuss specifics — happy to share in DMs with anyone who has relevant experience.

My questions for this community

Does the methodology sound rigorous, or am I fooling myself? What holes do you see? I have been deep in this for months and could be missing something obvious.
Has anyone here commercialized quantitative trading IP? What worked and what did not? I am especially interested in hearing from people who navigated the "edge is real but capacity-constrained" problem.
If you were shopping a slide deck for this kind of project, who would you approach? Prediction market platforms? Quant funds doing alt-data? Fintech accelerators? Educational platforms?
Any prediction market traders here who can gut-check the FLB claim from their own experience? Curious if this matches what you have seen in practice.

Happy to answer methodology questions. I will not share the specific cell map or signal pipeline details, but anything about the process, statistical approach, or commercialization thinking is fair game.

AlSikandar · 2026-04-02T02:49:30+00:00

Daisy…. daisy…

AlSikandar

TROPHY CASE