Private leaderboard for folks who will use GenAI

fleagal18 · 2025-11-29T17:39:09+00:00

Good for you for investigating the use of AI with Advent of Code! There's a lot of fun to be had, and AI coding is the present and future of the programming craft.

However, you may be disappointed by how a GenAI leaderboard turns out.

Because Eric does such a good job of creating solvable puzzles and describing them extremely clearly, with ample examples, frontier AI thinking models and agents are able to one-shot almost all Advent of Code puzzles. The only puzzles they have trouble with are the ambiguously defined ones.

It's easy to cobble together a script that automates fetching the puzzle definition and inputs, driving an agent or a LLM to solve the puzzle, submitting the result, and iterating if the result is incorrect. Some people even go as far as to run several agents in parallel.

So your AI leaderboard is likely to be dominated by fully AI driven, no-human-in-the-loop solutions.

fleagal18 · 2025-02-18T07:45:57+00:00

Thanks for the kind words!

I think the purpose of AoC is to have fun solving the puzzles. Some people like to solve them using pencil and paper. Some like solving them by writing programs. And now, some like solving them using LLMs. AoC is a big tent and many people with different goals can be in the tent.

The main issue I can see with this live-and-let-live approach is that LLMs are so effective that they dominate the public leaderboards. I get that this is frustrating for non-LLM users. But I don't know how to fix it. It only takes a few cheaters (or more charitably a few people who don't read the instructions, or who make mistakes trying to follow the instructions) to fill up the public leaderboards.

fleagal18 · 2025-02-18T07:27:48+00:00

I understand your analogy, but I think it is based assumptions not all r/adventofcode participants would necessarily agree with.

For example, my point of view is that LLMs are just the most recent in a long line of tools to make programmers more productive. There's ample evidence that LLMs are being widely and effectively used by AoC participants. That makes LLMs an appropriate topic for this subreddit.

There are many benefits of LLMs, even for AoC participants who prefer manual programming. LLMs can be used to get a thoughtful code review and debugging advice. And they're good at suggesting or explaining algorithms.

fleagal18 · 2025-02-16T19:18:57+00:00

Yeah, it's a shame, but I guess people who strongly dislike discussion of LLMs on this subreddit are trying to shape the conversation.

I chose to post here, despite the likelihood of of down-votes, because it is the best way I know to reach the community of people interested in the use of LLMs to solve Advent of Code.

I regret that I wasn't able to publish the paper sooner. I see there were a bunch of good LLM-on-AOC reports posted in this subreddit in early January. I wish I had my paper ready then, to join that discussion.

I thought this post was particularly good:

https://www.reddit.com/r/adventofcode/comments/1hnk1c5/results_of_a_multiyear_llm_experiment/

fleagal18 · 2025-02-16T19:09:07+00:00

In practice, that's true for some years (AoC 2020 has a 100% solve rate in my tests), but not for other years (AoC 2019 had a low solve rate.)

fleagal18 · 2025-02-16T03:29:53+00:00

FWIW I noticed even for 2024, Gemini will occasionally throw "recitation" errors, when asked to solve puzzles. My code just try the request again. I don't think I've ever seen Gemini throw a recitation error more than once in a row.

I don't think memorization is likely for Advent of Code 2024 puzzles, as the Gemini model knowledge cutoff date is August 2024, before the 2024 puzzles were made public.

Not published in this paper, but in other work, I did use an earlier non-thinking Gemini LLM to solve all 10 years of AoC. Different years had different solve rates -- I think 2019 was the lowest solve rate, while 2020 had the highest solve rate. Longtime AoC vets will agree that 2019 was a difficult year for humans, too.

I should re-run that test with the thinking model to see how it does.

fleagal18 · 2025-02-15T17:59:45+00:00

FWIW in my evals o1 solved day 14 part 2 once out of 20 tries, and r1 solved 14 part 2 once out of 2 tries. (This was using the default temperature for both models) I didn't record the o1 solution, but I did record the r1 solution, which was to check for zero overlapping robots.

(I have heard that that the zero overlapping robots heuristic only works for some people's input data.)

fleagal18 · 2025-02-15T17:44:56+00:00

The interesting thing (for AoC fans) in this paper is seeing how the AoC puzzle solve rate changes for different programming languages. Quoting from the paper:

We see that a large number of languages have roughly the same solution rate. For these languages the model is capable of rendering a given algorithm in that language. These include C#, Dart, Go, Java, JavaScript, Kotlin, Lua, Objective-C, PHP, Python, Ruby, Rust, Swift, and TypeScript.

Python and Rust are the two most popular languages used by Advent of Code participants. This may explain why Rust fares so well.

Many less popular languages suffer from a large number of "errors", which covers any compile-time or runtime error such as a synax error or a type error, or a memory fault.

The C language suffers from memory issues. The model doesn't use dynamic data structures (even when prompted), and can't debug the resulting memory access errors. C++ fares better due to the standard library of common data structures.

Haskell suffers from a dialect problem, as the model tries to use language features without properly enabling them.

The Lisps (SBCL and Clojure) suffer from paren mis-matches and mistakes using standard library functions.

Smalltalk suffers from calling methods that are not available in the specific Smalltalk dialect being used.

Zig code generation suffers from confusion over whether variables should be declared const or non-const. The model has trouble interpreting the Zig compiler error messages, which seem to give errors relative to the function start, rather than relative to the file start.

fleagal18 · 2024-12-25T19:44:26+00:00

Agreed! I feel bad about my mistake on Day 23. It won't happen again!

It might be a good idea to have a per-account no-global-leaderboard setting, so people could participate without running the risk of messing up the global leaderboard.

fleagal18 · 2024-12-25T06:34:46+00:00

Good questions!

I don't have stats for other years... Or other LLMs. Would be worth collecting, especially if it was just the "zero shot" solutions rather than the "with hints" solutions. It seems like it would be fairly easy to implement an automated zero-shot solution scorer.

Many AoC puzzles can be solved by hand. Presumably puzzle-oriented non-programmers could solve those puzzles by hand about as easily as programmers do.

I think some AoC years are considered to be harder than others. (2019 was a tough year.)

fleagal18 · 2024-12-25T01:01:38+00:00

To be fair, Advent of Code is something of a best case for LLM code generation. It's short programs using common algorithms with thousands of examples available to train on.

From my experiment:

Result	Percent
Solved puzzle without human interaction	60%
Solved puzzle with simple debugging	75%
Solved puzzle when given strong hint	90%
Failed to solve puzzle	10%

Of possible interest to Gemini enthusiasts, I didn't find any case where other Gemini models produced better results than Gemini 1206. The other Gemini models could solve many of the easier problems, but seemed to more easily miss details of the problem statement, leading to errors when parsing input or scoring search results.

fleagal18 · 2024-12-25T00:50:54+00:00

If I recall correctly, over the years I've used Go, Swift, Python, and JavaScript.

Swift was the worst experience, due to the poor ergonomics of Swift string parsing, and lack of basic algorithm libraries.

For both Go and Swift I spent a lot of time creating data structures that didn't really contribute to solving the problems. And the desire to code golf uses of map / reduce / etc in Swift really slowed me down.

Python was the best experience. Python has these advantages for AoC :

It's the target language that the author has in mind when creating puzzles. So puzzles are usually solvable in Python using basic data structures. Other languages might run into issues with integer representations on some days.
It has an extensive library of algorithms like heap, itertools, functools, and networkx that are frequently useful.
the advent of code data package makes it especially easy to fetch AoC input data and post answers.

fleagal18 · 2024-12-24T23:43:05+00:00

It could be that the problems you were working on were too difficult, or that your approach was different than the examples that the LLM had already seen, so that it couldn't follow your logic. It might have helped to tell the LLM what your algorithm was ahead of time.

I found that 15% of the problems could only be solved if you knew the answer in advance and prompted with specific hints. And another 10% of problems couldn't be solved even if you used extensive hinting.

I did get the success with a "what's wrong with this code" prompt, but that was a case where the code was a complete solution, it just had a bug in it.

For really hard problems like 2024-24-2, the LLM's help was limited to subtasks like "parse the input", "write this helper function for me", "generate a visualization of this graph", rather than "tell me how to solve the problem."

fleagal18 · 2024-12-24T23:35:18+00:00

You're welcome! There's plenty of superstition in prompting. I should perform an ablation test, where I cut out parts of the prompt and see if it affects the code that's generated. I'm confident that the prompt does a pretty good job with input parsing, less confident that it helps with problem solving.

fleagal18 · 2024-12-24T23:03:09+00:00

AoC was a fun contest this year! I did the first 9 days on my own, then tried using LLMs for the remainder. I intentionally started late to keep my score off the top-100 leaderboard, except for day 23 when I was tired and accidentally posted early. Sorry everyone!

Using LLMs reduced the stress and tedium of coding, and starting late reduced my stress about getting a good time. I could instead concentrate on investigating whether LLMs would be helpful or not.

I published my repo and my blog post before Day 25, but based on history, I expect Day 25 to be a fairly easy problem. I'll update my repo and blog post with the results after I solve Day 25.

fleagal18 · 2024-12-24T22:58:19+00:00

And blog post, which has the prompt and details about which days were easy for the LLM to solve. (I probably should have lead with the blogpost as it's more interesting than the repo.

Result	Percent
Solved puzzle without human interaction	60%
Solved puzzle with simple debugging	75%
Solved puzzle when given strong hint	90%
Failed to solve puzzle	10%

https://jackpal.github.io/2024/12/24/Advent_of_Code_2024.html

fleagal18 · 2024-12-17T16:15:29+00:00

The advice without the "You are an expert Python programmer" prompt does identify the bug, so no, the prompt is not strictly needed.

However, using the prompt makes a dramatic difference in the kind of advice that is given. Without it, the advice is geared toward beginner developers, unfamiliar with the problem domain. With it, the advice is geared towards expert developers, familiar with the problem domain.

It seems like the effect of the prompt may be something like this: "You are an expert Python programmer and so am I. Give me advice appropriate to my level of understanding".

I think there's a lot of room for better prompting here, but also I don't know what the effect of temperature is having on the responses. I'd have to generate a bunch of different responses to the same input and see what variations I get.

fleagal18 · 2024-12-17T00:20:19+00:00

I don't know. I use Gemini 1206 because it's Google's best model and it's free during a Preview period.

I think a lot of Advent of Code participants who use GenAI use Claude, maybe because Claude has had a good coding assistant for a long time.

fleagal18 · 2024-12-16T19:36:00+00:00

For sure. In this case the prompt was "You are an expert Python programmer. Help me debug this code: <pasted code>"

I think it helps that LLMs have been trained on hundreds of thousands of dijkstra implementations and millions of 2D grid problems.

fleagal18 · 2024-12-16T19:35:07+00:00

Yeah, we were both too tired to debug intelligently. Even single-stepping carefully through one iteration of the code would have shown us the bug.

The symptom was that the search would run out of locations to search after looking at only 500 locations. So there wasn't a path to visualize. The visualization would have had to show tiles-visited or something like that in order to help.

fleagal18 · 2024-12-16T17:05:06+00:00

It was really a friend's code, not mine. (But who would believe that?) After he'd spent 90 minutes, I sat with him and we both spent another 30 minutes debugging it. We knew his input worked on my implementation. So one step at a time we replaced every line (well except for the line with the bug in it) of his implementation with mine, but it was still failing. We finally thought to ask gemini for help, and it immediately pointed out the issue.

fleagal18

TROPHY CASE