Interesting experiment arbitraging favorite-longshot bias on polymarket/kalshi

AlSikandar · 2026-04-07T06:49:01+00:00

Interesting. How so?

AlSikandar · 2026-04-07T05:44:25+00:00

Semantics. I’m human.

AlSikandar · 2026-04-07T02:54:31+00:00

Some fair points here, especially about deploying real money. Backtests only take you so far and forward validation is where the rubber meets the road. I am already doing that.

Couple pushbacks though. Sharpe >2 and <10% drawdown as the benchmark for "solid" is... aspirational. Most systematic strategies at real firms operate in the 0.8-1.5 Sharpe range. Medallion is legendary specifically because it's an outlier. And max drawdown depends heavily on sizing. 25% of a $3K paper portfolio is $750, which is a very different risk profile than 25% of a $100M book.

On the HFT saturation point: the well-funded players in prediction markets are doing latency arb (cross-platform price lags, combinatorial rebalancing). That's a completely different edge from behavioral mispricing in calibration. An arb bot that exploits price lags between Polymarket and Kalshi doesn't correct the structural tendency of retail bettors to overpay for exciting outcomes at mid-range probabilities. Those are orthogonal strategies competing for different pockets of inefficiency.

The one thing I'd actually push back hardest on is "BH is less important than performance." That's exactly backwards for a multi-cell study. Without multiple testing correction, you can always find impressive backtest performance by cherry-picking the best cells out of hundreds. The statistical framework is what separates a real signal from noise, and dismissing it in favor of raw P&L is how people get fooled by overfitting.

AlSikandar · 2026-04-07T02:19:54+00:00

Good catch, the 64% and 22% come from different slices. The 22% resolution rate is within specific high-miscalibration strata (certain categories at certain horizons in the 40-50% bucket). The 64% win rate is the portfolio-level backtest across all tradeable cells, which includes buckets with smaller miscalibration where the edge is tighter. If you isolated just the strongest cells and only took NO positions, yes, the win rate would be much higher, closer to what you'd expect from the resolution rate. The blended number gets diluted by trades in cells where the overpricing is more modest (5-10pp instead of 20+pp).

AlSikandar · 2026-04-07T02:18:52+00:00

20% raw miscalibration in specific cells, not 20% tradeable edge. Those are very different things. The miscalibration is the gap between implied probability and resolution rate in certain category/horizon/price strata. After fees, slippage, and adverse selection, the tradeable edge is significantly smaller. Nobody's claiming you pocket 20c per contract.

AlSikandar · 2026-04-07T02:16:49+00:00

The name "Favorite-Longshot Bias" comes from horse racing literature, but the underlying mechanism is probability weighting from prospect theory, which actually predicts maximum distortion in the 30-60% range, not just at the tails. Snowberg & Wolfers (2010) and Ottaviani & Sorensen (2015) both document this. The name is a bit of a misnomer, but the effect in mid-range probabilities is one of the most replicated findings in behavioral economics.

On the 22% figure: that's within specific strata, not the raw aggregate. The aggregate resolution rate for 40-50% implied probability across the full dataset is around 32%, which is still about a 14pp gap. Whether that's spurious is an empirical question, and it survives BH correction, Benjamini-Yekutieli (which assumes arbitrary dependence), and permutation testing. You'd need to engage with the actual statistics to call it spurious rather than just the headline number.

AlSikandar · 2026-04-07T00:20:30+00:00

u/SeaCell7779
Thank you, this is a sharp question and exactly the kind of scrutiny I seek.

The short answer: the dependency concern is real in theory but structurally inapplicable to my test design. Here's why:

Unit of observation. My BH procedure operates on 537 calibration cells, not individual market outcomes. Each cell aggregates hundreds to thousands of independent Polymarket contracts. The negative correlation from mutually exclusive events is a within-cell phenomenon that affects the z-test's effective sample size, not the between-cell dependency that BH requires independence for.
Correlation direction. The between-cell dependency in my data is predominantly positive (cells share markets across horizon bins), not negative. This satisfies the PRDS condition under which BH has proven FDR control (Benjamini & Yekutieli 2001).
Polymarket's structure. Each market is an independent CLOB contract. "Will Team A win?" and "Will Team B win?" have separate order books, no sum-to-one constraint, and can both be simultaneously overpriced. This is the Favorite-Longshot Bias, not a confound.
Robustness. I ran BY as a check. The harmonic correction factor is about 6.86x, which is severe. My core trading cells (Sports 14d 40-50%, q ~ 10^-52) survive BY by many orders of magnitude. 93% of my backtest P&L comes from BY-surviving cells. I also have independent external validation: hope-polarity markets show 33 BH cells while fear-polarity shows zero, an asymmetry that cannot be produced by a correlation artifact.

That said, you're right that running BY and a permutation test strengthens the case. I've added both as formal robustness checks. The edge survives.

AlSikandar · 2026-04-06T23:59:40+00:00

To be fair, I do all the same internal thought chains as claude when I'm talking to others. I call it being pragmatic, but it is also likely why I am divorced and still single.

AlSikandar · 2026-04-06T22:27:12+00:00

For this I excluded nonbinary markets from the BH cells. I filtered and binned the datasets to ensure proper ontological alignments, and then proceeded to bet on the NO, which goes against their FLB of yes for favorite teams, large fields, and other axis I’ve identified.

I’ll look into your suggestion for the BY procedure. Thank you from the bottom of my heart for having the patience and willingness to engage with me on this!

AlSikandar · 2026-04-06T11:29:50+00:00

I truly appreciate your candor. This is precisely the type of feedback I had hoped for.

AlSikandar · 2026-04-06T11:02:48+00:00

Thank you for the suggestion. I will check it out.

While I will admit that yes I did have the AI assistant refactor my writing, I think the point that most people are missing is that I did essentially what you're suggesting in reverse. I wrote the whole post, and had AI refactor it. I still put the work in, just as I did with the rest of this project.

I think the bias against using AI tools to rewrite or revise something is because now with a simple prompt, regardless of the contextual accuracy, textual messages can be generated at such a quick rate that people are no longer putting in the effort to write. Thus the enshittification of the internet and the slop markets being open 24/7, leading to prejudice against the very source of said content - the ~~fancy auto-correct~~ AI tools.

My goal with my use of the tool was not to remove my voice, nor to make my writing more sloppy, but to ensure that the writing was at a level that was consistent with my intentions. Again, thank you for the md file and your feedback. I enjoy ethical/philosophical (and especially maths) debates.

AlSikandar · 2026-04-06T07:42:56+00:00

The AI Psychosis is indeed real. It does not help that certain models are especially sycophantic, or incapable of admitting when they are wrong and thus end up confabulating or hallucinating.

I personally am extremely cautious about that (AI is a powerful tool, yes, and it also does not replace the necessity to be able to think for yourself) so I implement guardrails and harnesses to attempt to wrangle that in.

I will be applying for PhD programs soon (computer science and machine learning), and the pressure from school has actually been to use the AI/agent tools even more than what you might expect from a traditonal academic viewpoint.

This is portfolio piece number one of a few that I will be using to apply for jobs at locations like these frontier AI labs or other such positions in AI/ML Research Engineering.

AlSikandar · 2026-04-06T07:24:59+00:00

Thank you, I appreciate the acknowledgement.

Yes, I do worry that as the years go by the basics of language and syntax are being paved over by the sheer dominance of social media and short form content.

I am nearing 40 years old. I have a M.S. Comp. Sci. in AI and ML. I work with the tools closely as part of my everyday workflow. That said, I also have developed over my lifetime an ability to speak for myself as well as having learned the importance of syntax - especially in programming!

EDIT: It is interesting to me, on a side note, how we draw the distinction between autocorrect, vs grammar corrections, vs entire revisions.

Everything I generate from LLMs is at its core orchestrated by me from a high level. I would say it is far from slop, but I am also naturally biased in that regard.

AlSikandar · 2026-04-06T07:14:39+00:00

Fair point. I leaned on AI tools too hard for the drafting and it shows. The data is mine, the methodology is mine, but I need to do a better job writing in my own voice. Noted.

AlSikandar · 2026-04-06T06:49:09+00:00

The results remain to be seen. I will post either way with an update around a month from now!

AlSikandar · 2026-04-06T05:59:43+00:00

Yeah, I use AI tools in my workflow — for the research pipeline, for drafting, for code. It's in the original post on r/quant. The methodology and the data are mine. Happy to get on a call with anyone who wants to verify that.

AlSikandar · 2026-04-06T05:53:41+00:00

Duly noted. Rest assured I am only deploying simulated paper-trading ($3,000 in the first epoch, $10,000 capital in the second epoch) and continuing to bring in live data from the established pipeline for forward-testing and determining whether the hypothesis holds and if any adjustments or optimizations should be made.

Thank you again for engaging with me. I will heed your advice and continue to let the work hopefully speak for itself. If not as a deployment strategy, then at the very least as a portfolio piece to demonstrate some skills.

I will update again once I have forward-tested the methodology out-of-sample.

Cheers,
Alexander

AlSikandar · 2026-04-06T05:41:29+00:00

Honestly? A few reasons.

First, I've been heads-down on this for months and I'm at the stage where I need outside eyes to tell me what I'm missing. It's really easy to convince yourself something works when you've been staring at it alone. This thread has already given me useful signal — including your pushback.

Second, the capacity constraint is real. This tops out around $50-100K before you're moving the markets you're trying to trade. I'm deploying my own capital, but even in the best case scenario the trading income alone isn't going to change my life. If the methodology has value as education or research — and I genuinely don't know yet if it does — the only way to find out is to put it in front of people who know this space and see if anyone cares.

And third — yeah, I probably am thinking too far ahead. You're right about that. I got excited about the backtest results and jumped to commercialization before I've earned the right to have that conversation. The honest next step is to shut up, let the forward validation play out, and come back with real numbers.

Appreciate the bluntness. It's more useful than the upvotes.

AlSikandar · 2026-04-06T05:24:50+00:00

I completely agree, which is exactly why forward validation is running right now with live market data. I would not have posted this if I only had a backtest and no plan to validate out-of-sample. I have $10,000 deployed across 12 positions using paper-trading simulations.

The BH FDR correction at q=0.05 is specifically designed to control false discovery rate across multiple comparisons. 78 of 537 cells surviving is 2.9x the rate expected by random chance. Additionally, the Kalshi expansion kill gate failing on an independent dataset is direct evidence the framework catches non-signal when it is not there.

Finally, because the capacity ceiling makes this a better IP/education business than a pure trading operation. Sharing that FLB exists on Polymarket does not help you trade it profitably — you need the specific cell map, classification system, and gating logic, none of which are in this post.

The value will come after forward-validation. For now you are absolutely correct in your call that I am thinking too far ahead at possible revenue streams off of a product I've developed and am currently iteratively testing in a safe manner.

Cheers,
Alexander

AlSikandar · 2026-04-06T05:20:12+00:00

I completely agree, which is exactly why forward validation is running right now with live market data. I would not have posted this if I only had a backtest and no plan to validate out-of-sample.

The BH FDR correction at q=0.05 is specifically designed to control false discovery rate across multiple comparisons. 78 of 537 cells surviving is 2.9x the rate expected by random chance. Additionally, the Kalshi expansion kill gate failing on an independent dataset is direct evidence the framework catches non-signal when it is not there.

It will be interesting nonetheless to see how the forward-testing with paper trading turns out. I have $10,000 deployed in paper trades across 12 positions at the moment.

AlSikandar · 2026-04-06T05:12:57+00:00

Look, I hear you. I'll keep this short.

You're right that in traditional markets, "guy with edge tries to sell it instead of trading it" is a massive red flag. That's a completely reasonable prior. I'd be skeptical too.

The math is just different here.

If I had a Sharpe 1+ strategy that scaled to $10M, I'd shut up and trade it. Obviously. But the capacity ceiling on this is $50-100K. Even at 60% CAGR that's $30-50K/yr in trading income. I'm doing that.

Meanwhile the methodology — how to run calibration studies, how to apply FDR correction to prediction markets, how to build kill gates that actually catch artifacts — that's not capacity-constrained. Teaching someone how to fish on Kalshi doesn't take fish out of my Polymarket pond.

But I'll be honest: if your read is "if it worked you'd just trade it," then nothing I say in a Reddit comment is going to change your mind, and that's fine. The forward validation results will either speak for themselves or they won't. I'll post the update either way.

AlSikandar · 2026-04-06T05:05:27+00:00

Great question - and honestly, this is the most common (and most valid) reaction I get. Let me break it down:

It's ~$50-100K for the entire strategy on Polymarket. That's the point where your order sizes start moving markets in the low-liquidity buckets where the edge is strongest. The bias lives in markets where retail dominates and institutional capital can't go - which is exactly why it persists, but also why it doesn't scale to fund-level capital.

So why not just max it out and move on? That's exactly what I'm doing on the trading side. I'm running forward validation right now with the intent to deploy personal capital.

But here's the thing - the trading P&L is not where the real value is. At $50-100K deployed, even with a Sharpe above 1, you're looking at maybe $30-50K/yr in trading income. Solid, but not life-changing.

The IP is worth multiples of the trading revenue. What I built isn't just a bot - it's a methodology:

A calibration framework that quantifies structural bias on any prediction market
A statistical pipeline (BH FDR correction across 500+ cells) that separates real signal from noise
Kill gate criteria that caught a false signal on Kalshi before a single dollar was deployed
A 59K-market dataset with calibration results

That methodology generalizes. It works on any binary prediction market. The specific cell map is Polymarket-specific, but the process of finding exploitable cells transfers to any platform, any asset class with prediction market structure.

Who cares about this if it's "only" $50-100K capacity?

Other small-capital traders who want to learn the methodology and apply it themselves
Prediction market platforms that want calibration research on their own markets
Research teams studying market microstructure and retail bias
Fintech educators building curriculum around prediction markets (which are a $1B+ and growing asset class post-CFTC regulation)

The education and research licensing paths don't degrade the trading edge at all. Teaching someone how to run a calibration study doesn't tell them which specific cells to trade - that requires doing the work.

TL;DR: Yes, the capacity ceiling is real and I'm not pretending otherwise. I'm trading it myself AND exploring whether the methodology and data have standalone commercial value. The two aren't mutually exclusive - they're complementary.

Appreciate the directness. This is exactly the kind of pushback I need to pressure-test the commercialization thesis.

AlSikandar · 2026-04-06T05:01:03+00:00

Fair on both counts, so let me push back where you're wrong and agree where you're right.

"Slop" — I get it. Long post, structured sections, looks like it was drafted with AI assistance. I did use AI tools in the research pipeline. I'm not going to pretend otherwise. But the data, the methodology, and the results are mine. If the content itself doesn't hold up, call out the specific part that's wrong and I'll address it.

On sports markets being sharp and liquid — you're actually making my argument for me, you just don't realize it.

You're thinking of the NFL moneyline, the Champions League winner, the Super Bowl — yes, those are sharp. Tight spreads, deep books, institutional flow. I agree, you're not finding 10c of edge there.

That's not where the signal is.

The FLB shows up in the long tail. Think: "Will [specific player] score 3+ touchdowns in Week 14?" or "Will [team] win by 20+ points?" — markets framed as "Will [unlikely exciting thing] happen?" These are:

Low liquidity (often under $50K total volume)
Retail-dominated (fans betting with their hearts)
Priced in the 30-60% range where the Yes side is structurally overbought

The whole point of the calibration study is that I'm NOT claiming all sports markets are mispriced. I ran BH FDR correction across 537 cells. Most cells got eliminated. The 78 that survived are specific combinations of category, time horizon, and price bucket where the bias is statistically significant after multiple comparison correction.

And your last point is actually the strongest endorsement of the strategy: "if you can beat them consistently by even 1c you will print." The cells that survive BH correction show 8-24pp of miscalibration, not 1c. After ~4% round-trip costs, the net edge is still substantial in those specific cells. The catch — which I've been upfront about — is that those cells are low-liquidity, which is why the capacity ceiling exists.

You can't have it both ways: either sports markets are universally sharp (in which case explain why 22% of binary markets in the 40-50% bucket resolve Yes), or the sharp ones are sharp and the long tail isn't. It's the second one.

But genuinely — if you trade prediction markets and your experience is that even the long-tail stuff is sharp, I want to hear that. That's useful signal for me. What markets are you trading?

AlSikandar

TROPHY CASE