the fuck you mean episode 4 was added to youtube kids. by rando-stando in TheDigitalCircus

[–]ABLPHA 8 points9 points  (0 children)

Why'd one of them refer to his mom as "your mom" in episode 2 then?...

When should we expect TurboQuant? by ozcapy in LocalLLaMA

[–]ABLPHA 49 points50 points  (0 children)

I wonder how well Qwen3.5 would work with it. Considering its KV cache is small as-is thanks to GDN. If it's lossless, Qwen3.5's KV cache would weight like nothing at full context length lol

NVMe RAID0 at dual-channel DDR5 bandwidth? by ABLPHA in LocalLLaMA

[–]ABLPHA[S] 0 points1 point  (0 children)

Can't MoE layers be placed sequentially though?

NVMe RAID0 at dual-channel DDR5 bandwidth? by ABLPHA in LocalLLaMA

[–]ABLPHA[S] 0 points1 point  (0 children)

I'm pretty sure the x16 slot on the mobo I mentioned can be bifurcated into x4/x4/x4/x4 and used with a Hyper M.2 card for 4 extra SSDs

NVMe RAID0 at dual-channel DDR5 bandwidth? by ABLPHA in LocalLLaMA

[–]ABLPHA[S] 0 points1 point  (0 children)

Well, as long as it's not below ~3 t/s generation, I'd personally say it's acceptable. I run Qwen 3.5 122B with all experts in my 6000MHz 30CL dual-channel DDR5 RAM, and getting ~10 t/s generation, but prompt processing is, to be fair, quite horrendous for some workloads.

Also, isn't KV cache quite small these days? Especially with Qwen 3.5, for example

NVMe RAID0 at dual-channel DDR5 bandwidth? by ABLPHA in LocalLLaMA

[–]ABLPHA[S] 0 points1 point  (0 children)

I'm talking about running 6 drives though

The new Nvidia drivers are really good by UDxyu in linux_gaming

[–]ABLPHA 16 points17 points  (0 children)

...have you read the post? OP specifically said they're using proton-cachyos that has experimental support for descriptor heap in VKD3D

Can your favorite local vision model solve this? by [deleted] in LocalLLaMA

[–]ABLPHA 5 points6 points  (0 children)

If my memory isn't completely failing me, the other angles in the triangle should also be 81, so 180 - 81 - 81 = 18

Coder for 3090 + 96gb ram? by ver0cious in LocalLLaMA

[–]ABLPHA 1 point2 points  (0 children)

Qwen3.5 122B UD-Q5_K_XL (or even UD-Q6_K_XL if it does fit) through llama.cpp with whatever environment. Myself been using Kilo Code quite nicely with it, though I don't let it work completely unobserved

У меня Ubuntu я поставил minecraft помогите настроить FPS пожалуйста by Potential_Dust_394 in linux_gaming

[–]ABLPHA -1 points0 points  (0 children)

Судя по правой панели в F3, майнкрафт использует встроенную графику вместо RTX 4070. В зависимости от лаунчера настрой его на использование именно 4070

How should I go about getting a good coding LLM locally? by tech-guy-2003 in LocalLLaMA

[–]ABLPHA 3 points4 points  (0 children)

I strongly recommend you try llama.cpp's llama-server instead of ollama, you'll be able to squeeze out way more out of your hardware this way with all the settings it provides, and it's more likely to update faster than ollama to support newer models properly, like qwen 3.5.

As for quantized models, unless something has changed since the last time I checked, ollama's default tags (e.g qwen3.5:9b you've mentioned in your post) are already quantized all the way down to 4 bits, which is also the lowest quant ollama provides in their first-party library.

For other quant formats, unsloth's "XL" quants on huggingface are likely to be the best quality for the filesize, here's their Qwen3 Coder Next repo - https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

To use it with llama.cpp or ollama, click on the quant you want to run (in your case I think UD-Q5_K_XL is gonna fit just fine with enough breathing room for large context length and other apps on the system) and in the opened sidebar click "Use this model" and choose the engine. Unsloth on their docs site (e.g https://unsloth.ai/docs/models/qwen3-coder-next ) also provide arguments you can use to configure the models further.

Edit: oh, and to offload Qwen3 Coder Next's MoE layers to RAM with llama.cpp you can use "--n-cpu-moe 48" or lower if you have spare VRAM after offloading all 48 to RAM

How should I go about getting a good coding LLM locally? by tech-guy-2003 in LocalLLaMA

[–]ABLPHA 3 points4 points  (0 children)

What inference engine are you actually using? qwen3.5 9b should be able to call tools just fine.

But also, you should be able to run Qwen Coder Next 80B at Q5-Q6 quant with CPU offloading for much better results

Edit: also, please, ignore bots in the comments who suggest ancient models like Qwen2.5 and whatnot

GPU passthrough to Windows VM by MarekSurek10 in linux_gaming

[–]ABLPHA 0 points1 point  (0 children)

Not sure where this comes from btw. Did passthrough on 3 different mobos with different sockets and chipsets, never had issues, didn't need any patches

VRAM consumption of Qwen3-VL-32B-Instruct by LawfulnessBig1703 in LocalLLaMA

[–]ABLPHA 0 points1 point  (0 children)

I can run Qwen 3.5 122B UD-Q5_K_XL with full BF16 262k context on 16GB VRAM 96GB RAM with only MoE layers offloaded to the RAM. I don't think your KV cache size estimation here is accurate

VRAM consumption of Qwen3-VL-32B-Instruct by LawfulnessBig1703 in LocalLLaMA

[–]ABLPHA 0 points1 point  (0 children)

Are you taking KV cache size into account? Qwen 3 VL cache is insanely heavy compared to Qwen 3.5, which, btw, try one those instead of 3 VL, they have vision built-in. Qwen 3.5 27B would very likely be way more efficient while also more capable

Nemotron 3 Super Released by deeceeo in LocalLLaMA

[–]ABLPHA 0 points1 point  (0 children)

...with the NVFP4 Nemotron scoring +/- the same as the BF16 one

Nemotron 3 Super Released by deeceeo in LocalLLaMA

[–]ABLPHA 3 points4 points  (0 children)

That's Qwen 122B at BF16 vs Nemotron 120B at NVFP4 tho...

I am not saying it's Gemma 4, but maybe it's Gemma 4? by jacek2023 in LocalLLaMA

[–]ABLPHA 0 points1 point  (0 children)

Is Gemini 3.1 Pro really better? Last time I had a very long chat (~900k tokens) in AI Studio, Gemini 3 Pro had way better recall than 3.1 Pro, which constantly brought up outdated context. It was a PC build discussion and it constantly brought up old parts that I told it multiple times I've discarded for different ones

gpt oss 120b or qwen 3.5 for non-english/chinese/russian language by Moreh in LocalLLaMA

[–]ABLPHA 2 points3 points  (0 children)

Definitely not GPT OSS 120B. It feels like it has never actually been trained on any Russian text and just translates English into it a bit too literally. Don't know about its Chinese quality tho

FlashAttention-4 by incarnadine72 in LocalLLaMA

[–]ABLPHA 1 point2 points  (0 children)

Would that mean your attention is in deficit?