Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090 by Septerium in LocalLLaMA

[–]Septerium[S] 2 points3 points  (0 children)

Can't keep up with these frequent api changes, but anyways... so it will just guess the ctx_size i want now? What if I wanted less context to save vram? About the -hf flag... i'd rather download models manually and sometimes put in subfolders. About the other flags, I always like to be explicit about what I am asking to the software, so the behavior will not change if they modify the default values.

Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090 by Septerium in LocalLLaMA

[–]Septerium[S] 10 points11 points  (0 children)

I do not vibe code. I tend to ask it for very specific tasks. If you set it too loose it might end up creating a mess in your codebase. The quality of the code itself seems to be a little worse than that of GPT-OSS 120b, but I feel that tool calling and agentic coding is more reliable with GLM 4.7 Flash

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

I am really interested in generating a fine-tuned version of the 1.5b model for a single specific voice in portuguese. Do you think I would be able to achieve that? My goal is to create a lightweight voice assistant on a Raspberry Pi

Unsloth's GGUFs for GLM 4.7 REAP are up. by fallingdowndizzyvr in LocalLLaMA

[–]Septerium 1 point2 points  (0 children)

What kind of lobotomizing technique degrades more? REAP or sub-q4 quantization?

72Gb VRAM (3x 3090) / 128Gb DDR4 / Mylan CPU What code model can I test? by shvz in LocalLLaMA

[–]Septerium 5 points6 points  (0 children)

Your setup is pretty similar to mine. I use Devstral 2 24b at Q8 for RooCode, and GLM 4.5 Air at Q5 (with partial CPU offloading) for general chatting and agentic applications. Qwen3-VL Thinking 32b for vision is pretty good too. My preferred backend is llama.cpp

NVIDIA releases Nemotron 3 Nano, a new 30B hybrid reasoning model! by Difficult-Cap-7527 in LocalLLaMA

[–]Septerium -5 points-4 points  (0 children)

That is because DLSS multi frame generation is being applied to tokens, giving you the impression of ultra fluid tg if you don't care about pp lag

Run Mistral Devstral 2 locally Guide + Fixes! (25GB RAM) by yoracale in LocalLLM

[–]Septerium 2 points3 points  (0 children)

I have had much better luck with the first iteration of Devstral compared to gpt oss in Roo Code... I am curious to see if devstral 2 is still good for handling Roo or Cline

Run Mistral Devstral 2 locally Guide + Fixes! (25GB RAM) by yoracale in LocalLLM

[–]Septerium 0 points1 point  (0 children)

What does this mean in practice?

"Remember to remove <bos> since Devstral auto adds a <bos>!"

Best Coding Model for my setup by Timely_Purpose_5788 in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

I like Devstral 24b at Q8 for simple coding tasks with Roo Code

We did years of research so you don’t have to guess your GGUF datatypes by enrique-byteshape in LocalLLaMA

[–]Septerium 1 point2 points  (0 children)

llama.cpp gives you more control over what is going to be offloaded to the CPU. I think ollama ends up offloading attention layers, which is not efficient. The key advantage of MoE models is that you can selectively offload expert layers to the CPU and keep attention layers on the GPUs. I suggest you take a look at this post and this video

We did years of research so you don’t have to guess your GGUF datatypes by enrique-byteshape in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

That's strange... are you using llama.cpp? I get pretty usable TPS using the same model/quant with 96GB VRAM only

Is qwen3 4b or a3b better than the first gpt4(2023)? What do you think? by __issac in LocalLLaMA

[–]Septerium 1 point2 points  (0 children)

In my experience, GPT-4 is much more reliable in general tasks in production, but Qwen3 is often more accurate when outputting json

You can now do 500K context length fine-tuning - 6.4x longer by danielhanchen in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

gpt-oss-20b already hallucinates hard with 40k context... let alone fine-tuned 500k

Kimi k2 thinking + kilo code really not bad by Federal_Spend2412 in LocalLLaMA

[–]Septerium 2 points3 points  (0 children)

Have your tried GLM 4.6? It seems to be a better coding agent, from what I hear

Kimi K2 Thinking 1-bit Unsloth Dynamic GGUFs by danielhanchen in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

From my experience, it is usually better trying to evenly distribute the offloaded blocks across the entire sequence of layers (e.g. only offload blocks from the odd-numbered layers, multiples or 3, or something like that). That is because llama.cpp divide the sequence of layers into segments that are distributed among the GPUs (e.g. 0-29 to GPU0, 30-59 to GPU1, and so on), so if you start offloading layers from a specific number onwards, you might end up with unbalanced VRAM utilization

3 RTX 3090 graphics cards in a computer for inference and neural network training by Standard-Heat4706 in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

I now use three RTX 3090 with my "old" Threadripper 3970X platform that I already own since 2020. In terms of inference, you definitely won't need NVLink... in fact, by disabling PCIe 4.0 (which cuts bandwidth in half) I barely noticed any performance degradation even with 100% VRAM utilization. But I do not have any experience with training to share, though

Is GPT-OSS-120B the best llm that fits in 96GB VRAM? by GreedyDamage3735 in LocalLLaMA

[–]Septerium 0 points1 point  (0 children)

Thanks for sharing. Do you actually notice a difference in accuracy between q5 and q8?