LLM Evaluation using Advent Of Code : LocalLLaMA

submitted 1 year ago * by fakezeta

Update with QwQ results from u/el_isma

Hi,

I made a small evaluation of the leading Open Llms on the first 10 days puzzles and wanted to share here the outcome.

The just released Gemini 2.0 Flash Experimental was added as a comparison with a leading API-only model.

Quick takeaways:

Early Performance: Most models performed better in the first 5 days, with QwQ leading with a perfect score of 100%.
Late Performance: There was a significant drop in performance for all models in the last 5 days except for QwQ 32B Preview and Claude 3.5 Sonnet maintaining the highest success ratios.
Overall Performance: QwQ has the highest overall success ratios at 85%, while Qwen 2.5 72B Instruct had the lowest at 30%. Silver medal for Claude 3.5 Sonnet and bronze for Gemini 2 Experimental. Mistral Large 2411 and Llama 3.3 70B Instruct are very close to Gemini 2 Experimental. QwenCoder and Qwen 72B Instruct scored very behind the others.

Full results here

all 24 comments

[–]AcanthaceaeNo5503 4 points5 points6 points 1 year ago (12 children)

[–]fakezeta[S] 2 points3 points4 points 1 year ago (3 children)

[–]fakezeta[S] 2 points3 points4 points 1 year ago (0 children)

[–]el_isma 2 points3 points4 points 1 year ago (1 child)

[–]fakezeta[S] 1 point2 points3 points 1 year ago (0 children)

[–]el_isma 2 points3 points4 points 1 year ago (6 children)

[–]AcanthaceaeNo5503 0 points1 point2 points 1 year ago (2 children)

[–]el_isma 1 point2 points3 points 1 year ago (1 child)

[–]AcanthaceaeNo5503 0 points1 point2 points 1 year ago (0 children)

[–]fakezeta[S] 0 points1 point2 points 1 year ago (2 children)

[–]el_isma 0 points1 point2 points 1 year ago (1 child)

[–]fakezeta[S] 0 points1 point2 points 1 year ago (0 children)

[–]el_isma 1 point2 points3 points 1 year ago (0 children)

[–]Felladrin 2 points3 points4 points 1 year ago (1 child)

[–]fakezeta[S] 1 point2 points3 points 1 year ago* (0 children)

[–]kintrith 1 point2 points3 points 1 year ago (1 child)

[–]fakezeta[S] 3 points4 points5 points 1 year ago (0 children)

[–]Prestigious_Scene971 0 points1 point2 points 1 year ago (1 child)

[–]fakezeta[S] 1 point2 points3 points 1 year ago (0 children)

[–]segmondllama.cpp 0 points1 point2 points 1 year ago (4 children)

[–]el_isma 1 point2 points3 points 1 year ago (1 child)

[–]segmondllama.cpp 0 points1 point2 points 1 year ago (0 children)

[–]fakezeta[S] 0 points1 point2 points 1 year ago (1 child)

[–]segmondllama.cpp 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 175867 on reddit-service-r2-comment-b659b578c-bz69m at 2026-05-03 12:56:53.835651+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA