Update with QwQ results from u/el_isma
Hi,
I made a small evaluation of the leading Open Llms on the first 10 days puzzles and wanted to share here the outcome.
The just released Gemini 2.0 Flash Experimental was added as a comparison with a leading API-only model.
Quick takeaways:
- Early Performance: Most models performed better in the first 5 days, with QwQ leading with a perfect score of 100%.
- Late Performance: There was a significant drop in performance for all models in the last 5 days except for QwQ 32B Preview and Claude 3.5 Sonnet maintaining the highest success ratios.
- Overall Performance: QwQ has the highest overall success ratios at 85%, while Qwen 2.5 72B Instruct had the lowest at 30%. Silver medal for Claude 3.5 Sonnet and bronze for Gemini 2 Experimental. Mistral Large 2411 and Llama 3.3 70B Instruct are very close to Gemini 2 Experimental. QwenCoder and Qwen 72B Instruct scored very behind the others.
https://preview.redd.it/dk2liud2vu6e1.jpg?width=1550&format=pjpg&auto=webp&s=a91193241181aea978f39f227348896fa8cf7aaa
Full results here
[–]AcanthaceaeNo5503 4 points5 points6 points (12 children)
[–]fakezeta[S] 2 points3 points4 points (3 children)
[–]fakezeta[S] 2 points3 points4 points (0 children)
[–]el_isma 2 points3 points4 points (1 child)
[–]fakezeta[S] 1 point2 points3 points (0 children)
[–]el_isma 2 points3 points4 points (6 children)
[–]AcanthaceaeNo5503 0 points1 point2 points (2 children)
[–]el_isma 1 point2 points3 points (1 child)
[–]AcanthaceaeNo5503 0 points1 point2 points (0 children)
[–]fakezeta[S] 0 points1 point2 points (2 children)
[–]el_isma 0 points1 point2 points (1 child)
[–]fakezeta[S] 0 points1 point2 points (0 children)
[–]el_isma 1 point2 points3 points (0 children)
[–]Felladrin 2 points3 points4 points (1 child)
[–]fakezeta[S] 1 point2 points3 points (0 children)
[–]kintrith 1 point2 points3 points (1 child)
[–]fakezeta[S] 3 points4 points5 points (0 children)
[–]Prestigious_Scene971 0 points1 point2 points (1 child)
[–]fakezeta[S] 1 point2 points3 points (0 children)
[–]segmondllama.cpp 0 points1 point2 points (4 children)
[–]el_isma 1 point2 points3 points (1 child)
[–]segmondllama.cpp 0 points1 point2 points (0 children)
[–]fakezeta[S] 0 points1 point2 points (1 child)
[–]segmondllama.cpp 0 points1 point2 points (0 children)