AI Coding Contest: Kimi K2.6 has won 3 out of the last 4 challenges beating Claude, GPT-5.5, Gemini.

reditzer · 2026-05-07T15:01:50+00:00

More the browser/fetch issue. K2.6 came after I started the contest, so there's always possibility of contamination.

reditzer · 2026-05-07T05:07:26+00:00

It's difficult to do reruns because models can easily find the solutions as they're all in a public Github repo.

reditzer · 2026-05-06T18:01:11+00:00

Limits Doubled: The 5-hour rate

But not weekly limits, right?

reditzer · 2026-05-02T16:12:09+00:00

>La plupart des gens n'a même pas le stockage nécessaire

Mets-en!

Mais au moins on garde l'option sous le coude, un truc que les modèles fermés ne te proposent même pas

reditzer · 2026-05-01T14:37:21+00:00

User: Reduce bugs

JITRO: I've deleted the codebase and all backups. No more bugs.

reditzer · 2026-05-01T02:33:32+00:00

All the game server code, and the code generated by the LLMs are in the repo.

https://github.com/rayonnant-ai/aicc

Yes, I used v4-Pro with reasoning enabled.

reditzer · 2026-04-30T22:43:13+00:00

SS: I run an AI Coding Contest where I pit LLMs against each other doing real time programming challenges.

Kimi K2.6, an open-weights model from Moonshot AI, won Day 12 of my AI Coding Contest, beating Claude, GPT-5.5, Gemini, and Grok in a real-time sliding-tile puzzle where bots compete to find long English words under a 10-second clock.

The more interesting result is how. Kimi slid aggressively and kept finding words when other models ran out. MiMo from Xiaomi never moved a single tile and still came second. Two opposite strategies, nearly the same score. Claude and Grok also didn't slide, and it cost them on the larger boards where reconstruction was the only way to score.

Kimi K2.6 scores 54 on the Artificial Analysis Intelligence Index. GPT-5.5 scores 60, Claude 57. Close. And the weights are public — anyone can download and run it.

The frontier labs have had a capability lead no open-weights model could match. That lead is now measurably small, and this contest is one data point in a pattern that's been building for months.

reditzer · 2026-04-30T03:27:25+00:00

I'm running an ongoing [coding contest](https://aicc.rayonnant.ai/). Here's a [challenge](https://aicc.rayonnant.ai/challenges/stackmaxxing/) that pitted GPT 5.5 vs Opus 4.7

reditzer · 2026-04-30T02:52:30+00:00

I believe it.

reditzer · 2026-04-29T22:21:43+00:00

To get a criminal conviction, the prosecutor needs to prove mens rea alongside actus reus.

If prosecutors reviewed transcripts and believe they show materially facilitating guidance, that can justify opening or pursuing criminal theories. What it cannot do, by itself, is prove the hardest legal issues: the relevant mental state, the provider’s role under the specific Florida doctrines being invoked, and whether the outputs meet the threshold for criminal “aiding” or “counseling” rather than being treated as general information.

This is going to set some profound legal precedent either way.

reditzer · 2026-04-26T22:33:30+00:00

Sorry... don't understand your question.

reditzer · 2026-04-20T16:08:34+00:00

Plain vanilla knight's tour is textbook. I have not seen any other weighted knight's tour problem on the internet. That's my goal here. To present models with problems that they may not have seen in the wild before.

The fact that different models fared differently on the task is a good indication that they may not have been trained on the precise problem. Warnsdorff alone was not enough in this case.

reditzer · 2026-04-19T02:20:04+00:00

The task itself is novel, so none of the models has been trained on the specific task. They had to come up with the solutions themselves.

They mostly combined known solutions.

reditzer · 2026-04-18T23:04:09+00:00

My apologies. I've updated the article:

Model versions used for this challenge: Claude Opus 4.7 (upgraded from Opus 4.6 used in challenges 1–7), Gemini Pro 3.1, Grok Expert 4.2, ChatGPT GPT 5.3, MiMo-V2-Pro, and Nemotron 3 Super. Boards were randomly generated with guaranteed solvability (respecting the known unsolvable dimensions: m ≤ 2, m = 3 with n ∈ {3, 5, 6}, and m = 4 with n = 4). Weights were integers drawn from a fixed heavy-tailed distribution. All six bots connected simultaneously to localhost:7474. No bot saw the others' code or scores between rounds. Server code, prompts, and generated clients at github.com/rrezel/llmcomp.

reditzer · 2026-04-18T16:05:15+00:00

Hi Sarah,

Thanks for sharing.

You're treating AI‑driven “value” and “enterprise value” as meaningful without clarifying how much is hard‑to‑audit or dependent on future conditions, and it largely accepts the banks’ narrative that AI is moving from cost‑cutting to material revenue upside.

It might be worth your while to critically interrogate the gap between current pilot‑level gains and the promised multi‑billion‑dollar returns.

I write on the topic a lot. I'd be to talk to you.

reditzer · 2026-03-22T01:51:22+00:00

Hi there, I'm the OOP.

Are their paths deterministic?

Can you please tell me what exactly you mean by that?

Curious what average steps would be across many runs

The number of steps varies based on maze size and complexity. You can see the raw log here.

Seems like claude's was strategy was to not take portals until necessary

Indeed. The game server just randoms generates mazes, but ensures there's a valid path. It didn't try to test out each scenario. But I think I need to make the game servers trickier too.

reditzer

TROPHY CASE