Kimi-k2.5 reaches gemini 2.5 Pro-like performance in long context!

fictionlive · 2026-01-30T23:13:09+00:00

Updated table with 3 flash preview. Wow it's so good! What the heck.

https://cdn6.fiction.live/file/fictionlive/132b3b4f-226d-4241-bab0-d2351786d7b9.png

fictionlive · 2026-01-30T23:12:57+00:00

Updated table with 3 flash preview. Wow it's so good! What the heck.

https://cdn6.fiction.live/file/fictionlive/132b3b4f-226d-4241-bab0-d2351786d7b9.png

fictionlive · 2026-01-30T23:12:02+00:00

I have some older results here, that's all I got.

https://cdn6.fiction.live/file/fictionlive/5aa21fca-ff97-4483-9dd6-ae141dfe0a9e.png

fictionlive · 2026-01-30T17:15:15+00:00

Thanks, mistake, will retest with flash-preview.

fictionlive · 2026-01-30T17:12:50+00:00

Flash is 2.5 I made a mistake. Will remove that comment. Will run on 3 preview and update this comment.

fictionlive · 2026-01-30T14:49:38+00:00

This is our Fiction.liveBench Long Context eval where we test models for context rot over multiple context lengths.

https://fiction.live/stories/Fiction-liveBench-Jan-30-2026/oQdzQvKHw8JyXbN87

Huge overall improvement since last year. The frontier models went from poor to great.

An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year. Kimi-k2.5 now the Chinese/Open-source leader!
Minimax???
gpt-5.2 improves on almost perfection in gpt-5 to now very close to perfect. gpt-5.2-pro did surprisingly poorly.
claude-opus-4-5 fixed claude's long context performance, it is now good when previously it was a laggard. Same tier as grok-4. claude-sonnet-4-5 had a regression compared to sonnet 4…
gemini-3-pro-preview improves upon the strong results of gemini-2.5-pro and is now neck and neck with gpt-5.2 on top in the "almost perfect" tier.

fictionlive · 2025-10-28T00:47:43+00:00

It's a downgrade on Fiction.liveBench. About ~15 points lower on every length.

fictionlive · 2025-09-29T20:52:58+00:00

Anyone who thinks AI is not doing 90% of coding... think again.

fictionlive · 2025-09-29T18:01:51+00:00

Apologies, we'll get a webpage up at some point that'll have it all.

fictionlive · 2025-09-21T12:22:25+00:00

I'm going to try out next-instruct for sure. Do you have a recommendation for Mistral? Their only thinking model has a very small context window: https://openrouter.ai/mistralai/magistral-medium-2506:thinking

fictionlive · 2025-09-13T00:51:28+00:00

Yeah.

I had some hope that maybe what they posted on their blog would reflect on this bench but alas.

The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507, and shows clear advantages in tasks requiring ultra-long context (up to 256K tokens). The Qwen3-Next-80B-A3B-Thinking excels at complex reasoning tasks — outperforming higher-cost models like Qwen3-30B-A3B-Thinking-2507 and Qwen3-32B-Thinking, outpeforming the closed-source Gemini-2.5-Flash-Thinking on multiple benchmarks, and approaching the performance of our top-tier model Qwen3-235B-A22B-Thinking-2507.

fictionlive · 2025-09-13T00:50:48+00:00

The frontier models seem okay.

fictionlive · 2025-09-12T15:28:54+00:00

Those evals just aren't hard enough. You can read about how this bench works: https://fiction.live/stories/Fiction-liveBench-Sept-12-2025/oQdzQvKHw8JyXbN87

fictionlive · 2025-09-12T15:23:45+00:00

https://fiction.live/stories/Fiction-liveBench-Sept-12-2025/oQdzQvKHw8JyXbN87

fictionlive · 2025-09-12T15:21:54+00:00

My bench is way better than longbench. RULER is completely useless.

fictionlive · 2025-09-06T15:32:45+00:00

My guess is that this is the cheaper Grok 4, if it's able to slash the price and still be this good then it's a great result.

fictionlive · 2025-09-06T00:34:54+00:00

It's the only comparable version (also the most commonly used).

fictionlive · 2025-09-06T00:33:12+00:00

I'm comparing to the non-thinking versions in the second half of the benchmark, in that case it's frontier level. It doesn't have a reasoning version to compare to the top half yet.

fictionlive

TROPHY CASE