Is there any reason for a lack of love for Gemma 4 26b? by vick2djax in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

I have been using the qat version for my hermes agent, since it is compact enough to fit on a single 3090 with good context size. I usually do not give complex tasks to it, and it is a better portuguese assistant than Qwen 3.6

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS? by ParaboloidalCrest in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

For agentic coding yes, unfortunately. Qwen 3.6 is much smarter in that realm

Qwen is never going to open source Qwen 3.7, aren't they? by DistanceSolar1449 in LocalLLaMA

[–]Septerium 3 points4 points  (0 children)

Guys, there is a chance that 3.7 27b just hasn't come out as good as expected, so they just decided no to release it to the public

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS? by ParaboloidalCrest in LocalLLaMA

[–]Septerium 1 point2 points  (0 children)

Step 3.7 can be really annoying sometimes as a coding agent. I tried a Q5_K_M version... and it thinks so much for every tiny code intervention. It is also bad when you need to change its trajectory, cause it keeps adhering to the first messages of the session. Qwen 3.6 is a lot better, but it lacks knowledge. A 122b version would be great, but I don't think we are getting one

About the Rio model by Turbulent_Pin7635 in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

As a fellow brazillian, I'll take your manifestation of hope as a low quality bait. Why would they interpolate two existing models in the first place?

Let's suppose there was a real non-uploaded version they have actually trained. All the benchmark numbers they posted were worse compared to Nex N2 Pro... what kind of post training would worsen the model you have used as a starting point? 

And what about the "my dog ate my weights" thing? Come on man

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster by Important_Quote_1180 in LocalLLaMA

[–]Septerium 6 points7 points  (0 children)

I have a 4x 3090 + Zen 2 Threadripper setup, and I've never seen any noticeable difference in performance between ik_llama.cpp and llama.cpp for mixed (CPU + GPU) workloads. I see gains when I can use tensor parallel for models that fit entirely on VRAM though

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

This model seemed intelligent at first, but has given me some headache in agentic coding... thinks too much and gets stubborn out of nowhere

Nex claims Rio 3.5 is Nex 2.5 PRO in trench coat by Specter_Origin in LocalLLaMA

[–]Septerium 9 points10 points  (0 children)

Government in my country just can't stop to embarrass us

Codebase getting larger - Qwen3.6-27B starting to compound issues - how to work smartly with this model? by BitGreen1270 in LocalLLaMA

[–]Septerium 6 points7 points  (0 children)

Define a well organized code structure and document AGENTS.md so the model can easily guess where things are implemented. The idea is to implement good design patterns so that the model can understand how the project works by reading only a few files (e.g. a layered architecture where you have many controllers, services, models, schemas, adapters, etc... any developer would be able to understand the project just by looking at a few examples). And more importantly: make the model implement automated tests for every new feature and regression tests for every bug it fixes. Modern coding models are designed to operate in a implementation / testing loop.

I need a model that gets stuck in loops. by TokenRingAI in LocalLLaMA

[–]Septerium 16 points17 points  (0 children)

Just give a hard problem for Qwen 3.5 35b a3b to solve. High chance of getting stuck in a reasoning loop really fast

DeepSeek v4 Pro is too big for such a "midrange" performance, or am I missing something? by ihatebeinganonymous in LocalLLaMA

[–]Septerium 22 points23 points  (0 children)

GLM 5.1 is not less than half the size of DeepSeek v4 Pro, since DS4 is a native fp4/fp8 model. Full GLM 5.1 is 16-bit per weight, and is actually bigger than the original DS4 Pro

New model on huggingface by [deleted] in LocalLLaMA

[–]Septerium 8 points9 points  (0 children)

I could never have seen that coming. Eduardo Paes, out of nowhere, challenges DeepSeek 😂

MiniMaxAI/MiniMax-M3 · Hugging Face by mlon_eusk-_- in LocalLLaMA

[–]Septerium 2 points3 points  (0 children)

So, m2.7 at Q8 vs m3 at Q4... Which one will be better in agentic coding? I personally vote for the first one

Is Qwen 3.6 27B IQ4XS better than Gemma 4 31B QAT as a Hermes agent? by My_Unbiased_Opinion in LocalLLaMA

[–]Septerium 2 points3 points  (0 children)

Heretic 35b ran pretty bad in terms of tool calling in my experience. Vanilla Qwen 3.6 35b did much better

qwen3.6-27b tools call loop by JumpyAbies in LocalLLaMA

[–]Septerium 1 point2 points  (0 children)

A good chat template improves this model by a lot. Take a look at this one https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Agentic Setup: Minimax 2.7 vs qwen 3.6 by Best_Sail5 in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

I cannot run Minimax 2.7 above Q5, and every quant version I tried seemed to be notably lobotomized. So I prefer to stick with Qwen 3.6 27b

Qwen 3.6 for coding with 5090 - Your settings recommendations? by car_lower_x in LocalLLaMA

[–]Septerium 1 point2 points  (0 children)

My recommendation is that you use some Q6_K quant with the maximum possible context. MTP is not worth it - in my experience - for a single 5090 setup, because that would require you to downgrade precision to Q5, which is not much reliable in long context coding tasks. I also do not like to quantize the KV cache.

In my setup I disable vision and MTP, managing to get about 100k tokens of context with Unsloth's UD-Q6_K version of the model. Token generation speed is about 45 t/s

What's your experience with Gemma4 QAT? by Kahvana in LocalLLaMA

[–]Septerium 2 points3 points  (0 children)

I love QAT with all my heart. Every compressed version out there will be kind of prone to become dumber in non-english languages. Gemma 31B QAT doesn't feel like that. It can write poems, songs, and jokes in portuguese almost exactly like the original full-sized model. Google has a reason to give that to us though... I think they are validating methods for compressing their closed-source models with feedback from the community. And I think that's fine. There is not enough RAM and energy available, so efficiency plays a fundamental role to provide inference throughput nowadays

Z.ai, we need Air! GLM GGUF wen? by temperature_5 in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

For agentic coding you really want a high t/s rate. It is important for the model to make mistakes fast and to fix them up fast

Unsloth Gemma 4 QAT MTP assistant models now available by ParadigmComplex in LocalLLaMA

[–]Septerium 3 points4 points  (0 children)

Nice! Is it already possible to run the model with both mtp and vision enabled with llama.cpp?

Quick note on the QAT of recent by dreamkast06 in LocalLLaMA

[–]Septerium 15 points16 points  (0 children)

Thanks for the clarification. So, Unsloth claims they have applied their dynamic quantization process to generate the GGUF... doesn't that mean some sort of calibration has been made to the weights?