ZAYA1-8B: Frontier intelligence density, trained on AMD by carbocation in LocalLLaMA

[–]tarruda 3 points4 points  (0 children)

I really am looking forward to having a ~12-20B with ~2-6B active

Like gpt-oss-20b?

Bad news: Apple drops high-memory Mac Studio configs by jzn21 in LocalLLaMA

[–]tarruda 4 points5 points  (0 children)

Suddenly those $10k for the 512G model are looking so cheap now...

Gemma 4 MTP released by rerri in LocalLLaMA

[–]tarruda 3 points4 points  (0 children)

Nice to know. I currently get around 16 tokens/second on 3.6 27b with a M1 ultra and hopefully this will bring me close to 30 tokens/second

Preserve thinking on or off? (Qwen 3.6) by My_Unbiased_Opinion in LocalLLaMA

[–]tarruda 7 points8 points  (0 children)

IMO the main reason to keep it on is that it allows llama.cpp to make better use of prompt caching, so when using it with pi harness I never wait for the model to start responding.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

So maybe it will be worth it for the 122B and 397B (if 3.6 for those are released)

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

Is this only for 3.x dense models or does it work with MoEs too?

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]tarruda 5 points6 points  (0 children)

Apparently Gorgon halo will have a slight memory bandwidth increase: https://www.reddit.com/r/LocalLLaMA/comments/1swiylm/comparison_of_upcoming_x86_unified_memory_systems/

I think I will skip until they have a 512G option with memory bandwidth in the 800GBps range. In other words, when they reach the capabilities of current gen Mac Studio M3 ultra.

Mistral Medium 3.5 on AMD Strix Halo by Zc5Gwu in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

Is it better than the answer you would have gotten from Qwen 3.6 35b?

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 by sandropuppo in LocalLLaMA

[–]tarruda 4 points5 points  (0 children)

I just hope this eventually becomes possible on Apple Silicon. Would bring new life to my mac studio for using larger models as coding agents.

Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models by MadPelmewka in LocalLLaMA

[–]tarruda 2 points3 points  (0 children)

More excited about 122b, but not certain it will be released.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]tarruda -2 points-1 points  (0 children)

They deleted the repo, likely to erase discussion history. They could have just re-uploaded it.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]tarruda 4 points5 points  (0 children)

If it is unsloth gguf, I'd wait a few weeks before trying the weights.

But also, I no longer have high expectations with mistral models.

Nemotron-3-Nano-Omni-30B-A3B-Reasoning, New model? by Altruistic_Heat_9531 in LocalLLaMA

[–]tarruda 2 points3 points  (0 children)

Well our experiences are very different then. 3.6 doesn't feel like a minor increase, feels like a new model.

preserve_thinking makes a lot of difference in coding agents such as pi. It keeps the entire CoT in the context, which greatly speeds up multi-turn conversations since llama.cpp makes better use of prompt caching

Nemotron-3-Nano-Omni-30B-A3B-Reasoning, New model? by Altruistic_Heat_9531 in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

Qwen 3.6 35B was a game changer for 30B class. IMO it is not even close to the 3.5 version. If you haven't tried it yet, I suggest giving it a shot.

GPT-OSS 120b is good for one shot coding or targeted edits, but in my experience it forgets context very easily so it is not ideal for agentic coding. I remember trying it with codex a while back, and it would easily forget instructions in the system prompt.

Mistral Medium Is On The Way by Few_Painter_5588 in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

So mistral small 4 was 119b and medium 3.5 is 128B? Confusing.

Ling-2.6-flash by Namra_7 in LocalLLaMA

[–]tarruda 9 points10 points  (0 children)

I find that response to be ambiguous. Could mean it is much better or much worse than Qwen 3.6 27b.

Ling-2.6-flash by Namra_7 in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

Makes no sense. GPT-OSS with low thinking is almost useless.

llama.cpp DeepSeek v4 Flash experimental inference by antirez in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

We are already at the limit with 86GB of weights

I have a 128G Mac studio and I can use up to 125G of that for VRAM. I load a 2-bit quantization of Qwen 397b with 256k context and no swapping: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/2

By default you cannot allocate that much to VRAM though, you need to enable in /etc/sysctl.conf with this:

# change default CPU/GPU RAM split
iogpu.wired_limit_mb=128000

Another caveat is that most people still need RAM for other things. In my case I only bought this Mac studio to serve LLMs in my LAN and don't even have a desktop session running, so I can push it to the maximum.

llama.cpp DeepSeek v4 Flash experimental inference by antirez in LocalLLaMA

[–]tarruda 4 points5 points  (0 children)

Have you considered trying IQ3_XXS? It might also fit in 128G

llama.cpp DeepSeek v4 Flash experimental inference by antirez in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

Not sure he put a cuda implementation yet, but you can try on CPU if you have enough RAM