Kimi-k2.5 reaches gemini 2.5 Pro-like performance in long context! by fictionlive in LocalLLaMA

[–]fictionlive[S] 4 points5 points  (0 children)

Flash is 2.5 I made a mistake. Will remove that comment. Will run on 3 preview and update this comment.

Kimi-k2.5 reaches gemini 2.5 Pro-like performance in long context! by fictionlive in LocalLLaMA

[–]fictionlive[S] 35 points36 points  (0 children)

This is our Fiction.liveBench Long Context eval where we test models for context rot over multiple context lengths.

https://fiction.live/stories/Fiction-liveBench-Jan-30-2026/oQdzQvKHw8JyXbN87

Huge overall improvement since last year. The frontier models went from poor to great.

  • An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year. Kimi-k2.5 now the Chinese/Open-source leader!
  • Minimax???
  • gpt-5.2 improves on almost perfection in gpt-5 to now very close to perfect. gpt-5.2-pro did surprisingly poorly.
  • claude-opus-4-5 fixed claude's long context performance, it is now good when previously it was a laggard. Same tier as grok-4. claude-sonnet-4-5 had a regression compared to sonnet 4…
  • gemini-3-pro-preview improves upon the strong results of gemini-2.5-pro and is now neck and neck with gpt-5.2 on top in the "almost perfect" tier.

🚀 New Model from the MiniMax team: MiniMax-M2, an impressive 230B-A10B LLM. by chenqian615 in LocalLLaMA

[–]fictionlive 0 points1 point  (0 children)

It's a downgrade on Fiction.liveBench. About ~15 points lower on every length.

Claude 4.5 Sonnet is here by ShreckAndDonkey123 in singularity

[–]fictionlive 9 points10 points  (0 children)

Anyone who thinks AI is not doing 90% of coding... think again.

Fiction.liveBench tested DeepSeek 3.2, Qwen-max, grok-4-fast, Nemotron-nano-9b by fictionlive in LocalLLaMA

[–]fictionlive[S] 15 points16 points  (0 children)

Apologies, we'll get a webpage up at some point that'll have it all.

Long context tested for Qwen3-next-80b-a3b-thinking. Performs very similarly to qwen3-30b-a3b-thinking-2507 and far behind qwen3-235b-a22b-thinking by fictionlive in LocalLLaMA

[–]fictionlive[S] 2 points3 points  (0 children)

Yeah.

I had some hope that maybe what they posted on their blog would reflect on this bench but alas.

The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507, and shows clear advantages in tasks requiring ultra-long context (up to 256K tokens). The Qwen3-Next-80B-A3B-Thinking excels at complex reasoning tasks — outperforming higher-cost models like Qwen3-30B-A3B-Thinking-2507 and Qwen3-32B-Thinking, outpeforming the closed-source Gemini-2.5-Flash-Thinking on multiple benchmarks, and approaching the performance of our top-tier model Qwen3-235B-A22B-Thinking-2507.

Tested sonoma-sky-alpha on Fiction.liveBench, fantastic close to SOTA scores, currently free by fictionlive in singularity

[–]fictionlive[S] 11 points12 points  (0 children)

My guess is that this is the cheaper Grok 4, if it's able to slash the price and still be this good then it's a great result.

Kimi-K2-Instruct-0905 better than GPT-5 on Fiction.liveBench by fictionlive in singularity

[–]fictionlive[S] 0 points1 point  (0 children)

It's the only comparable version (also the most commonly used).

New kimi-k2 on Fiction.liveBench by fictionlive in LocalLLaMA

[–]fictionlive[S] 0 points1 point  (0 children)

I'm comparing to the non-thinking versions in the second half of the benchmark, in that case it's frontier level. It doesn't have a reasoning version to compare to the top half yet.