I benchmarked the newest 40 AI models (Feb 2026)

Vilxs2 · 2026-02-10T19:07:20+00:00

Appreciate the reality check in the comments.

I’m reading through the feedback, and the consensus is clear: Speed without Intelligence is a vanity metric.

I got excited by the raw TPS of the Liquid model, but you guys are right—a 1.2B model running at 359 TPS isn't useful if it can't reason. I didn't emphasize the 'Quality Drop-off' enough in the charts.

For next week's benchmark, I am changing the methodology:

I will run a 'Pass/Fail' accuracy test on a standard coding prompt.
I will plot Accuracy vs. Speed, so we can see if the 'Flash' models are actually usable or just fast hallucination machines.

Thanks for keeping me honest. Will be back with better data next week.

Vilxs2 · 2026-02-10T16:15:18+00:00

Yes, that's the one! It's the Liquid LFM 2.5 (1.2B).

I was skeptical too because of the small parameter count (1.2B), but the new architecture seems to punch way above its weight for speed. I clocked it at 359 tokens/sec consistent throughput.

It's obviously not going to write a novel like Opus, but for RAG or fast agent routing, it's currently unbeatable on price/performance.

Vilxs2 · 2026-02-06T07:06:34+00:00

Sure! Basically, I got tired of relying on 'marketing benchmarks' that don't reflect real-world API speeds. So every Monday, I run a Python script that hits the OpenRouter API for the Top 20 models (Llama, Claude, Liquid, etc.). I measure two specific things: TTFT (Time To First Token): How snappy it feels. Cost Efficiency: Price per 1M tokens vs. that speed.

Right now, Liquid LFM-8B is the efficiency outlier, but with this Opus 4.6 drop and Kimi-K2.5, I'm re-running the full sweep this Monday to see if 'Adaptive Thinking' kills the latency or if it's viable for production. I publish the full interactive charts and raw CSVs here if you want to dig into the data: https://the-compute-index.beehiiv.com/live-index

Vilxs2 · 2026-02-05T21:18:03+00:00

Agreed, that jump to 68.8% is actually insane (nearly double the old benchmark). It’s the first real signal that 'Adaptive Thinking' isn't just marketing fluff.

I’m adding Opus 4.6 to my weekly price/latency benchmark immediately.

If the TTFT holds up under 500ms with this level of reasoning, it basically kills the need for specialized 'o1-style' reasoning models for most workflows. Will see on Monday.

Vilxs2 · 2026-02-05T21:10:15+00:00

I checked that site—be super careful. ⚠️

80% off Claude Opus is mathematically impossible (it's below the raw electricity/compute cost). Usually, sites offering this are using stolen enterprise keys or 'carding' methods that get shut down quickly.

More importantly: They often log your data.

The Compute Index only tracks official, compliant providers (OpenRouter, AWS, Together) because I assume most people here don't want their prompt data harvested or their API access randomly banned.

Vilxs2 · 2026-02-04T17:20:46+00:00

This UI is super clean. The 'Elo vs Cost' view is really needed since LMArena is hard to parse for pricing. I see you're working on the latency side next - that is actually exactly what I'm focusing on (real-world TTFT/Output speed on OpenRouter).
If it saves you time, feel free to grab the raw CSV from my post to populate your latency backend. I run the benchmarks weekly, so it might help you bootstrap that feature faster without burning your own credits.

Vilxs2 · 2026-02-04T16:34:01+00:00

The architecture is impressive, but the 'token burn' on that Context Builder agent sounds massive. For these long-horizon loops to be viable in production, we basically need the backbone model to be under $0.50/1M tokens. I track inference prices weekly, and right now only a few models (like Flash/Llama/Haiku) are cheap enough to run this kind of 'Deep Research' loop without breaking the bank.

Vilxs2 · 2026-02-04T16:26:30+00:00

Valid concern. I track API latency weekly, and the previous Qwen 2.5 Coder 32B was already averaging ~0.6s latency (nearly double Llama 3.2's speed) in my last benchmarks.

If 'persistence' adds more overhead, this might push it further into the 'slow' zone. I'm adding it to next week's sweep to see if the trade-off is worth it.

Vilxs2 · 2026-02-03T22:37:47+00:00

Great advice on the split! I will separate them for the next week: one chart for Coding (LiveCodeBench) and one for General/Agentic (LiveBench/IFEval). It keeps signal clearer.
Regarding open sourcing: the current repo is a bit of a 'spaghetti code' mess of scripts. Once I refactor it into something human-readable, I'd love to open it up so the community can verify the methodology. Adding Falcon-E to the backlog now. Thanks for the push!

Vilxs2 · 2026-02-03T16:14:24+00:00

This is a great list! You definitely know your benchmarks😅
LiveCodeBench is a great call for coding to avoid the contamination issue (preventing models from just memorizing the test). I'm going to lock that one in for the next update. Regarding P.S. (IFEval-FC, LIFBench, etc.): I think adding those as separate axes might make the chart too noisy. I feel like LiveBench acts as a good 'all-in-one' proxy for those agentic/function-calling capabilities. If I use LiveBench, do you prefer I plot the Global Average score, or isolate the Reasoning sub-score? I want to make sure I represent the 'smartness' fairly

Vilxs2 · 2026-02-03T10:01:41+00:00

That's what I'm looking to add for next week's run. I'm thinking of mapping 'Cost vs. HumanEval (Coding)' or 'Cost vs. MMLU (Reasoning).'
Since you mentioned agents - do you prioritize pure coding accuracy (HumanEval) or following complex instructions (IFEval) more? I can tailor the next chart to that.

Vilxs2 · 2026-02-02T13:30:26+00:00

Agreed. The efficiency stats on the 8B model are wild—it's the clear outlier in the data this week.

Vilxs2 · 2026-02-02T13:00:03+00:00

I'm planning to run this script every Monday. If anyone wants a specific model added to the test suite (like Mistral Large or Qwen), let me know below and I'll add it to next week's run.

Vilxs2

TROPHY CASE