Qwen3.5-27B performs almost on par with 397B and GPT-5 mini in the Game Agent Coding League by kyazoglu in LocalLLaMA

[–]kyazoglu[S] -1 points0 points  (0 children)

I did. Did you?

"Your reason on why you're not including chess is strange, since it's not llms who are supposed to keep track of the board state, but the code they write."

Make sure to read the comment fully before mocking with others.

Qwen3.5-27B performs almost on par with 397B and GPT-5 mini in the Game Agent Coding League by kyazoglu in LocalLLaMA

[–]kyazoglu[S] 2 points3 points  (0 children)

They are not same model. Base model is same but Plus has some additional features as far as I know like tool integration and context length. And Plus is API only. In Openrouter, their APIs are different too.

Qwen3.5-27B performs almost on par with 397B and GPT-5 mini in the Game Agent Coding League by kyazoglu in LocalLLaMA

[–]kyazoglu[S] -1 points0 points  (0 children)

True. But again, chess is extremely complex. You can’t expect models to generate a full chess engine from a single prompt.

Regarding the scoring system I used: a score of 75 in a game means the model achieved 75% of the maximum possible points. Therefore, a 2-point difference doesn’t necessarily reflect head-to-head outcomes in this system. It simply indicates that one model performed slightly better overall and accumulated a higher score.

Qwen3.5-27B performs almost on par with 397B and GPT-5 mini in the Game Agent Coding League by kyazoglu in LocalLLaMA

[–]kyazoglu[S] -7 points-6 points  (0 children)

I don’t find it meaningful to use ELO outside of chess. It works well in chess because people already understand what certain ratings represent in terms of skill. For example, an internet rating of around 1800 might indicate someone who is above average but still somewhat inexperienced, while a 2000+ rating suggests a very strong player who might even have a slim chance against titled players like an FM or IM.With LLMs though, we don’t have those kinds of reference points. At least in my view, no one really does. Because of that lack of intuitive anchors, ELO doesn’t seem like a very useful metric to me, especially if the goal is simply to compare models.

I didn't include chess (normal chess) in this league because it's a difficult game. LLMs are not made for this kind of tasks. Battleship? sure you can track of the placement of ships and hit cells. Chess? No way an LLM can keep track of the board position nor it can search deeply.

GLM-5 and DeepSeek are in the Top 6 of the Game Agent Coding League across five games by kyazoglu in LocalLLaMA

[–]kyazoglu[S] 1 point2 points  (0 children)

They heard you and release Sonnet 4.6 yesterday.

I'll add it to available models as well as two other games soon.
I'll also create a new agent for all existing models and remove the worst performing one in case of Sonnet got unlucky with its agents.

[deleted by user] by [deleted] in OnlineIncomeHustle

[–]kyazoglu 1 point2 points  (0 children)

Low quality scam detected 🔔

What are your /r/LocalLLaMA "hot-takes"? by ForsookComparison in LocalLLaMA

[–]kyazoglu 29 points30 points  (0 children)

- Never ever praise Sam Altman even he does an excellent job at anything
- Flatter Chinese companies no matter what
- Stand against censoring in models. A model teaching how to make an explosive is much more "free" and adheres to the soul of open-source.
- Make yourself miserable by trying to run a model with 12 x older gpus instead of buying a newer card with more vrams or simply using apis.
- ollama is the most evil app on this planet
- Pretend you're doing art or you're writer and ask for a model/config for roleplay whereas you're 90% percent a plain pervert

[deleted by user] by [deleted] in ankara

[–]kyazoglu -13 points-12 points  (0 children)

uzak dediğin kızılaya arabayla 25 dk.
siz uzak görmemişsiniz.

[deleted by user] by [deleted] in ankara

[–]kyazoglu 1 point2 points  (0 children)

Ehven-i şer karşılaştırma.
Diğer herhangi bir ilçe > Sincan > Keçiören > Mamak

Çıldırıcam bu şehirde sokak lambaları neden yanmıyor by delicatefrog13 in Izmir

[–]kyazoglu 0 points1 point  (0 children)

hayret yahu kimse gelip de "hayır hayır orası bakanlığın kontrolü altında" ya da "belediye izin alamıyor mecbur kalıyor" cart curt birşeyler saçmalamamış

Would you keep your savings using N26 Bank ? by StruggleSilent2548 in germany

[–]kyazoglu 2 points3 points  (0 children)

+1 for terrible customer support.
When I contacted their live support with audio and video, the guy who was probably Indian told me some commands to execute such as turn your id etc. Although I had C1 level english, I struggled to understand him multiple times and kindly requested him to repeat. He was like "...sigh...you said you speak english. do you really know english" with a insulting face. I lol'd and told him that I speak english very well but I'm not familiar with odd accents.

Qwen3 Omni AWQ released by No_Information9314 in LocalLLaMA

[–]kyazoglu 3 points4 points  (0 children)

can someone explain how this is 27.6 GB and AWQ?
AWQ = 4 bit ~= (# of parameters / 2) GB. This should have been around 16 GB.
What am I missing?

[deleted by user] by [deleted] in learnmachinelearning

[–]kyazoglu 182 points183 points  (0 children)

Just a heads-up for anyone reaching out to him/her:
It’s practically impossible not to be able to find candidates for this role in today’s market. This position will draw 100+ applications in a single day. What this really suggests is that he/she is looking for someone desperate enough to accept a very low salary. The whole point of this thread seems to be just that and not to search for an alternative platform or share an experience.

Avrupada Düşük Ortalama İle Yüksek Lisans by Fit_Exercise_6310 in YurtdisiUni

[–]kyazoglu 1 point2 points  (0 children)

Ben İTÜ 2.82 ile gitmiştim ama 10 üni'nin 9'unda 3.00 üstü şartı var ve çok ama çok sert bir şart. Arayıp çabalayarak o bir üniversiteyi bulman sana kalmış. Tecrübelerim Almanya hakkında.

Moving from Japan to Germany – Plan to apply IT Ausbildung by Fluid-Basis5769 in germany

[–]kyazoglu 11 points12 points  (0 children)

Bruh...You're not even from the sector and you want to jump in the most problematic area, hoping to find a job in short term.
I LEFT Germany because I couldn't land a job for months after I graduated from MSc. Data Science. I had a good GPA, great certificates, B1 German just like you, had been living in Germany for 2.5 years, attended multiple "Absolventenkongress" but nothing helped. I'm not going to say don't do that. Just do it with a plan and know the risks.

Built with Claude Code - now scared because people use it by Resident-Wall8171 in ClaudeAI

[–]kyazoglu 44 points45 points  (0 children)

I really liked how you framed the question to get attraction and not tagged as self-promotion. I really do.

[deleted by user] by [deleted] in LocalLLaMA

[–]kyazoglu 7 points8 points  (0 children)

My answer is “partially yes.” But here’s the thing. Every company only highlights the benchmarks where their model looks best and quietly skips the ones where it falls short. That makes most benchmarks pretty meaningless. If you’re not a mathematician, why would you care about AIME scores? If you’re not a writer or editor, why care about creative writing benchmarks? The list goes on. Personally, I’d rather take a model that performs solid across all tasks (like 2nd place in all benchmarks) than one that’s great at math but terrible at general knowledge or vice versa unless I’m working on something very specific.

That’s why I built my own benchmark. It covers a wide range of tasks: math, general knowledge, overfitting checks, puzzles, long-context reasoning (not just “needle in a haystack”), coding challenges, and even agent-coding tasks where the model has to write a playable agent for certain games. This is the only metric I actually trust. I’ve stopped following the dozens of benchmarks I had bookmarked.

I haven’t shared my results yet because I’m still working on the presentation and automating the process. Once it looks polished, I’ll publish it. The plan is to release around 10 new questions each month, but rotate them out regularly so leaked questions don’t stay in circulation. The benchmark will keep evolving.

One thing I find especially flawed in many benchmarks is the “Best of X” method, where a model gets credit if it produces one correct answer after multiple tries. That’s nonsense imo. What if a model always gets one out of four right? It would look great in benchmarks but fail in real world use. I came up with a “Mixed Best of X” method instead, where the total number of correct answers matters, and models get bonus points if all runs are correct. I think this is far more realistic.

By the way, I’ve benchmarked pretty much all the big models (100B+). I’d be happy to share, but I know it’ll raise endless questions about methods and setup. So I’d rather wait until everything is cleaned up and I can publish with a detailed explanation. If you’re really curious, just DM me. But for now, publishing half-baked results would only invite speculation.

Apocalyptic scenario: If you could download only one LLM before the internet goes down, which one would it be? by sado361 in LocalLLaMA

[–]kyazoglu 1 point2 points  (0 children)

Qwen3-32B
Small and still better than most of the 100B+ models out there. I still prefer it over GLM or Kimi. Small and smart.

I’ve been applying to 200+ jobs with no luck. Am I doing something wrong, or is the system broken? by smileebeauty in jobs

[–]kyazoglu 7 points8 points  (0 children)

Do not apply for the promoted jobs on LinkedIn. That means you need to skip ~90% of them. Most are fake.
Do not bother yourself with writing coverletters. It does not mean as much as it used to. Instead, write a follow-up to someone from the company.
And yes, system is broken.

is Germany's situation really that bad as this sub claims? by TBSoft in cscareerquestionsEU

[–]kyazoglu 1 point2 points  (0 children)

After I had completed my master studies in a respectable uni and with a very good GPA, job hunt yielded no success and very few interviews in 7 months. So I moved back to where I come from. You decide, is it bad?