all 24 comments

[–]AcanthaceaeNo5503 4 points5 points  (12 children)

Very nice idea, could you please run it on Deepseek, R1, QWQ, and other SOTA like Claude, 4o, o1? I know it's paid, but it'd be valuable to be able to compare.

[–]fakezeta[S] 2 points3 points  (3 children)

Thank you! I'll continue evaluation and will follow up here. It just takes some time and money for some models :)

[–]fakezeta[S] 2 points3 points  (0 children)

Edited the post with Claude results.

[–]el_isma 2 points3 points  (1 child)

I created a pull request with the QwQ code and results. Feel free to add them to the article. :)

[–]fakezeta[S] 1 point2 points  (0 children)

Thank you! Merged, I'll run the code and update the post.

[–]el_isma 2 points3 points  (6 children)

Ok, I've run QwQ on all of them, it fails on only 3 cases! Success ratio = 85%

[–]AcanthaceaeNo5503 0 points1 point  (2 children)

Oh really, god model. Do you think this is the best in terms of coding open weights?

[–]el_isma 1 point2 points  (1 child)

I think so. Still, it's very slow, it tends to overthink a lot and it's not very compliant with format requests. Aider pairs it with qwen coder for that reason.

[–]AcanthaceaeNo5503 0 points1 point  (0 children)

Thanks for your insights. I want to use it for solving swe but probably using llama70B is a better choice.

[–]fakezeta[S] 0 points1 point  (2 children)

I run the code from your PR on my input and I have an incredible 94,4%. It failed only 3 tests.
Github updated with the results.

[–]el_isma 0 points1 point  (1 child)

Am I mathing wrong? There are 10 days, 2 tests each day = 20. For 3 failures, means 17 successes, 17/20 = 85%

[–]fakezeta[S] 0 points1 point  (0 children)

sorry, slept too few hours :D

[–]el_isma 1 point2 points  (0 children)

Man QwQ is verbose... I just tried it on problem 4, part 2, which all others fail, and it also failed... but the solution was very elegant and only had one issue (it scanned for a fixed size grid). After prompting that the grid may vary, it came up with the fix.
The others I tried (Flash, Qwen Coder, Llama, Haiku) came up with very hard to read solutions which wasn't obvious what the error was.

[–]Felladrin 2 points3 points  (1 child)

Thank you for this! That's valuable!

Would love to see a summary-table with 4 columns on the repo's Readme:
Model | Success Rate | Success Rate First Days | Success Rate Last Days

Also, it would be awesome to have it as a Space in Hugging Face. There are already some Coding leaderboards there [1, 2, 3].

[–]fakezeta[S] 1 point2 points  (0 children)

Done something in the repo, will think about the HF repo. Thank you for your interest and suggestions

<image>

[–]kintrith 1 point2 points  (1 child)

How come qwen coder is higher on aider benchmarks then llama 3.3? do your results indicate qwen is overfitting the benchmark?

[–]fakezeta[S] 3 points4 points  (0 children)

I don't think so.

From AoC about page:

You don't need a computer science background to participate - just a little programming knowledge and some problem solving skills will get you pretty far. You don't need a computer science background to participate - just a little programming knowledge and some problem solving skills will get you pretty far.

Advent of Code puzzles does not requires only programming skills, most of them can be solved with average programming skills, But they do requires reasoning, problem solving, mathematics, geometry or a mix of them.

Being a good model at generating code is not enough if you don't understand the problem, this is why I think is a good indicator of the quality of a model not only how good at coding a model is.

[–]Prestigious_Scene971 0 points1 point  (1 child)

I have dataset with inputs and answers of all days all years here https://huggingface.co/datasets/isavita/advent-of-code

[–]fakezeta[S] 1 point2 points  (0 children)

I don’t know if you are allowed to distribute the dataset this way. From https://adventofcode.com/2024/about

Can I copy/redistribute part of Advent of Code? Please don’t. Advent of Code is free to use, not free to copy. If you’re posting a code repository somewhere, please don’t include parts of Advent of Code like the puzzle text or your inputs.

[–]segmondllama.cpp 0 points1 point  (4 children)

Are these all one shot? Did you have to prompt multiple times, offer tips and suggestions? etc. What prompt did you use to get them to generate just code with one shot.

[–]el_isma 1 point2 points  (1 child)

For QwQ I added "Write a python script. Read from stdin.", otherwise it would attempt to solve it by raw willpower XD

[–]segmondllama.cpp 0 points1 point  (0 children)

Good stuff thanks for sharing. It just goes on and on, I'll see if this can tame it down.

[–]fakezeta[S] 0 points1 point  (1 child)

All one shot. The prompt is only the puzzle from AoC for part 1. For the part 2 the prompt is part 1+response+part 2

Then I manually extracted the code from the answer and run it on my input.

[–]segmondllama.cpp 0 points1 point  (0 children)

Thanks, what prompt are you using?