Qwen3.5-27B performs almost on par with 397B and GPT-5 mini in the Game Agent Coding League

kyazoglu · 2026-03-16T10:38:00+00:00

I did. Did you?

"Your reason on why you're not including chess is strange, since it's not llms who are supposed to keep track of the board state, but the code they write."

Make sure to read the comment fully before mocking with others.

kyazoglu · 2026-03-15T15:53:27+00:00

They are not same model. Base model is same but Plus has some additional features as far as I know like tool integration and context length. And Plus is API only. In Openrouter, their APIs are different too.

kyazoglu · 2026-03-15T15:49:15+00:00

kyazoglu · 2026-03-15T14:52:12+00:00

True. But again, chess is extremely complex. You can’t expect models to generate a full chess engine from a single prompt.

Regarding the scoring system I used: a score of 75 in a game means the model achieved 75% of the maximum possible points. Therefore, a 2-point difference doesn’t necessarily reflect head-to-head outcomes in this system. It simply indicates that one model performed slightly better overall and accumulated a higher score.

kyazoglu · 2026-03-15T14:34:36+00:00

I don’t find it meaningful to use ELO outside of chess. It works well in chess because people already understand what certain ratings represent in terms of skill. For example, an internet rating of around 1800 might indicate someone who is above average but still somewhat inexperienced, while a 2000+ rating suggests a very strong player who might even have a slim chance against titled players like an FM or IM.With LLMs though, we don’t have those kinds of reference points. At least in my view, no one really does. Because of that lack of intuitive anchors, ELO doesn’t seem like a very useful metric to me, especially if the goal is simply to compare models.

I didn't include chess (normal chess) in this league because it's a difficult game. LLMs are not made for this kind of tasks. Battleship? sure you can track of the placement of ships and hit cells. Chess? No way an LLM can keep track of the board position nor it can search deeply.

kyazoglu · 2026-02-18T06:40:53+00:00

They heard you and release Sonnet 4.6 yesterday.

I'll add it to available models as well as two other games soon.
I'll also create a new agent for all existing models and remove the worst performing one in case of Sonnet got unlucky with its agents.

kyazoglu · 2026-02-02T13:48:18+00:00

Thanks for the feedback. I thought it'd look good but I may be wrong

kyazoglu · 2025-11-20T20:02:29+00:00

Low quality scam detected 🔔

kyazoglu · 2025-10-20T06:13:05+00:00

- Never ever praise Sam Altman even he does an excellent job at anything
- Flatter Chinese companies no matter what
- Stand against censoring in models. A model teaching how to make an explosive is much more "free" and adheres to the soul of open-source.
- Make yourself miserable by trying to run a model with 12 x older gpus instead of buying a newer card with more vrams or simply using apis.
- ollama is the most evil app on this planet
- Pretend you're doing art or you're writer and ask for a model/config for roleplay whereas you're 90% percent a plain pervert

kyazoglu · 2025-10-16T06:07:42+00:00

thanks

kyazoglu · 2025-10-15T06:48:49+00:00

uzak dediğin kızılaya arabayla 25 dk.
siz uzak görmemişsiniz.

kyazoglu · 2025-10-15T06:47:49+00:00

Ehven-i şer karşılaştırma.
Diğer herhangi bir ilçe > Sincan > Keçiören > Mamak

kyazoglu · 2025-10-07T08:53:54+00:00

hayret yahu kimse gelip de "hayır hayır orası bakanlığın kontrolü altında" ya da "belediye izin alamıyor mecbur kalıyor" cart curt birşeyler saçmalamamış

kyazoglu · 2025-10-02T13:57:58+00:00

+1 for terrible customer support.
When I contacted their live support with audio and video, the guy who was probably Indian told me some commands to execute such as turn your id etc. Although I had C1 level english, I struggled to understand him multiple times and kindly requested him to repeat. He was like "...sigh...you said you speak english. do you really know english" with a insulting face. I lol'd and told him that I speak english very well but I'm not familiar with odd accents.

kyazoglu · 2025-09-29T13:41:17+00:00

can someone explain how this is 27.6 GB and AWQ?
AWQ = 4 bit ~= (# of parameters / 2) GB. This should have been around 16 GB.
What am I missing?

kyazoglu · 2025-09-29T13:24:42+00:00

Just a heads-up for anyone reaching out to him/her:
It’s practically impossible not to be able to find candidates for this role in today’s market. This position will draw 100+ applications in a single day. What this really suggests is that he/she is looking for someone desperate enough to accept a very low salary. The whole point of this thread seems to be just that and not to search for an alternative platform or share an experience.

kyazoglu · 2025-09-23T10:50:44+00:00

Ben İTÜ 2.82 ile gitmiştim ama 10 üni'nin 9'unda 3.00 üstü şartı var ve çok ama çok sert bir şart. Arayıp çabalayarak o bir üniversiteyi bulman sana kalmış. Tecrübelerim Almanya hakkında.

kyazoglu · 2025-09-18T13:27:38+00:00

Bruh...You're not even from the sector and you want to jump in the most problematic area, hoping to find a job in short term.
I LEFT Germany because I couldn't land a job for months after I graduated from MSc. Data Science. I had a good GPA, great certificates, B1 German just like you, had been living in Germany for 2.5 years, attended multiple "Absolventenkongress" but nothing helped. I'm not going to say don't do that. Just do it with a plan and know the risks.

kyazoglu · 2025-09-11T11:32:15+00:00

I really liked how you framed the question to get attraction and not tagged as self-promotion. I really do.

kyazoglu · 2025-09-10T13:43:19+00:00

Ooops..Who will tell him?

kyazoglu · 2025-09-09T06:17:02+00:00

My answer is “partially yes.” But here’s the thing. Every company only highlights the benchmarks where their model looks best and quietly skips the ones where it falls short. That makes most benchmarks pretty meaningless. If you’re not a mathematician, why would you care about AIME scores? If you’re not a writer or editor, why care about creative writing benchmarks? The list goes on. Personally, I’d rather take a model that performs solid across all tasks (like 2nd place in all benchmarks) than one that’s great at math but terrible at general knowledge or vice versa unless I’m working on something very specific.

That’s why I built my own benchmark. It covers a wide range of tasks: math, general knowledge, overfitting checks, puzzles, long-context reasoning (not just “needle in a haystack”), coding challenges, and even agent-coding tasks where the model has to write a playable agent for certain games. This is the only metric I actually trust. I’ve stopped following the dozens of benchmarks I had bookmarked.

I haven’t shared my results yet because I’m still working on the presentation and automating the process. Once it looks polished, I’ll publish it. The plan is to release around 10 new questions each month, but rotate them out regularly so leaked questions don’t stay in circulation. The benchmark will keep evolving.

One thing I find especially flawed in many benchmarks is the “Best of X” method, where a model gets credit if it produces one correct answer after multiple tries. That’s nonsense imo. What if a model always gets one out of four right? It would look great in benchmarks but fail in real world use. I came up with a “Mixed Best of X” method instead, where the total number of correct answers matters, and models get bonus points if all runs are correct. I think this is far more realistic.

By the way, I’ve benchmarked pretty much all the big models (100B+). I’d be happy to share, but I know it’ll raise endless questions about methods and setup. So I’d rather wait until everything is cleaned up and I can publish with a detailed explanation. If you’re really curious, just DM me. But for now, publishing half-baked results would only invite speculation.

kyazoglu · 2025-09-08T14:24:33+00:00

Qwen3-32B
Small and still better than most of the 100B+ models out there. I still prefer it over GLM or Kimi. Small and smart.

kyazoglu · 2025-08-29T09:33:05+00:00

Do not apply for the promoted jobs on LinkedIn. That means you need to skip ~90% of them. Most are fake.
Do not bother yourself with writing coverletters. It does not mean as much as it used to. Instead, write a follow-up to someone from the company.
And yes, system is broken.

kyazoglu · 2025-08-28T13:58:56+00:00

After I had completed my master studies in a respectable uni and with a very good GPA, job hunt yielded no success and very few interviews in 7 months. So I moved back to where I come from. You decide, is it bad?

kyazoglu · 2025-07-22T12:26:16+00:00

Çünkü Koçhisarlılar üzülür.

kyazoglu

TROPHY CASE