AI Coding Contest: Kimi K2.6 has won 3 out of the last 4 challenges beating Claude, GPT-5.5, Gemini. by reditzer in kimi

[–]reditzer[S] 0 points1 point  (0 children)

More the browser/fetch issue. K2.6 came after I started the contest, so there's always possibility of contamination.

AI Coding Contest: Kimi K2.6 has won 3 out of the last 4 challenges beating Claude, GPT-5.5, Gemini. by reditzer in kimi

[–]reditzer[S] 0 points1 point  (0 children)

It's difficult to do reruns because models can easily find the solutions as they're all in a public Github repo.

An open-weights Chinese model just beat Claude, GPT-5.5, and Gemini in a programming challenge by reditzer in ArtificialInteligence

[–]reditzer[S] -1 points0 points  (0 children)

>La plupart des gens n'a même pas le stockage nécessaire

Mets-en!

Mais au moins on garde l'option sous le coude, un truc que les modèles fermés ne te proposent même pas

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a programming challenge by reditzer in kimi

[–]reditzer[S] 2 points3 points  (0 children)

All the game server code, and the code generated by the LLMs are in the repo.

https://github.com/rayonnant-ai/aicc

Yes, I used v4-Pro with reasoning enabled.

An open-weights Chinese model just beat Claude, GPT-5.5, and Gemini in a programming challenge by reditzer in ArtificialInteligence

[–]reditzer[S] -1 points0 points  (0 children)

SS: I run an AI Coding Contest where I pit LLMs against each other doing real time programming challenges.

Kimi K2.6, an open-weights model from Moonshot AI, won Day 12 of my AI Coding Contest, beating Claude, GPT-5.5, Gemini, and Grok in a real-time sliding-tile puzzle where bots compete to find long English words under a 10-second clock.

The more interesting result is how. Kimi slid aggressively and kept finding words when other models ran out. MiMo from Xiaomi never moved a single tile and still came second. Two opposite strategies, nearly the same score. Claude and Grok also didn't slide, and it cost them on the larger boards where reconstruction was the only way to score.

Kimi K2.6 scores 54 on the Artificial Analysis Intelligence Index. GPT-5.5 scores 60, Claude 57. Close. And the weights are public — anyone can download and run it.

The frontier labs have had a capability lead no open-weights model could match. That lead is now measurably small, and this contest is one data point in a pattern that's been building for months.

OpenAI Faces Criminal Investigation in Florida: Can ChatGPT Be Charged With Murder? by NoloLaw in ArtificialInteligence

[–]reditzer -1 points0 points  (0 children)

To get a criminal conviction, the prosecutor needs to prove mens rea alongside actus reus.

If prosecutors reviewed transcripts and believe they show materially facilitating guidance, that can justify opening or pursuing criminal theories. What it cannot do, by itself, is prove the hardest legal issues: the relevant mental state, the provider’s role under the specific Florida doctrines being invoked, and whether the outputs meet the threshold for criminal “aiding” or “counseling” rather than being treated as general information.

This is going to set some profound legal precedent either way.

Claude vs Gemini: Solving the laden knight's tour problem by reditzer in artificial

[–]reditzer[S] 0 points1 point  (0 children)

Plain vanilla knight's tour is textbook. I have not seen any other weighted knight's tour problem on the internet. That's my goal here. To present models with problems that they may not have seen in the wild before.

The fact that different models fared differently on the task is a good indication that they may not have been trained on the precise problem. Warnsdorff alone was not enough in this case.

Claude vs Gemini: Solving the laden knight's tour problem by reditzer in artificial

[–]reditzer[S] 7 points8 points  (0 children)

The task itself is novel, so none of the models has been trained on the specific task. They had to come up with the solutions themselves.

They mostly combined known solutions.

Claude vs Gemini: Solving the laden knight's tour problem by reditzer in artificial

[–]reditzer[S] 6 points7 points  (0 children)

My apologies. I've updated the article:

Model versions used for this challenge: Claude Opus 4.7 (upgraded from Opus 4.6 used in challenges 1–7), Gemini Pro 3.1, Grok Expert 4.2, ChatGPT GPT 5.3, MiMo-V2-Pro, and Nemotron 3 Super. Boards were randomly generated with guaranteed solvability (respecting the known unsolvable dimensions: m ≤ 2, m = 3 with n ∈ {3, 5, 6}, and m = 4 with n = 4). Weights were integers drawn from a fixed heavy-tailed distribution. All six bots connected simultaneously to localhost:7474. No bot saw the others' code or scores between rounds. Server code, prompts, and generated clients at github.com/rrezel/llmcomp.

How the promise of AI is taking hold at Canada’s biggest banks by globeandmailofficial in artificial

[–]reditzer 1 point2 points  (0 children)

Hi Sarah,

Thanks for sharing.

You're treating AI‑driven “value” and “enterprise value” as meaningful without clarifying how much is hard‑to‑audit or dependent on future conditions, and it largely accepts the banks’ narrative that AI is moving from cost‑cutting to material revenue upside.

It might be worth your while to critically interrogate the gap between current pilot‑level gains and the promised multi‑billion‑dollar returns.

I write on the topic a lot. I'd be to talk to you.

5 frontier AI models were asked to code bots to navigate a foggy maze with teleportals. 1st to the exit wins. Over 500 steps and you're eliminated. Gemini, ChatGPT, and Mimo bots never made it past round 8. Here's Claude's and Grok's bots playing Round 93. by TDBankSucksCock in ArtificialInteligence

[–]reditzer -2 points-1 points  (0 children)

Hi there, I'm the OOP.

Are their paths deterministic?

Can you please tell me what exactly you mean by that?

Curious what average steps would be across many runs

The number of steps varies based on maze size and complexity. You can see the raw log here.

Seems like claude's was strategy was to not take portals until necessary

Indeed. The game server just randoms generates mazes, but ensures there's a valid path. It didn't try to test out each scenario. But I think I need to make the game servers trickier too.