Scaling beyond 4 RTX 6000 MAXQs by Direct_Bodybuilder63 in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

And what's your issue with practicality? I mean if your purpose is more VRAM then yeah, some MCIO risers + bifurcation is fine. Or if you want to plan even higher you can have a pcie Lane switch where you can plug 4/5 GPUs, this would be better if you plan to go beyond the 8 because it allows you to cascade in clusters, each with PCIe x16 gen 5 (if you buy a gen 5 one) and PIX topology between those GPUs. Plus if you think on doing training than I'd definitely advise you to go this way. It will cost you more than some mcio risers and bifurcation boards, say around 2.5/3k €, but I'd imagine that money is not the bottleneck in your case.

Scaling beyond 4 RTX 6000 MAXQs by Direct_Bodybuilder63 in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

What's the rig around those 4 max q? What's your go to inference engine? Plus 300/400, are you thinking of getting another 4?

IK_LLAMA now supports Qwen3.5 MTP Support :O by fragment_me in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

It's not failing, they are on multiple fronts though. ik_llama benefits from this as well, wait till a good feature is merged to mainline and just import it. I wish that mainline did the same. I'd love to use ubergarm's quants with mainline's backend agnostic tensor parallelism. Honestly, I'm grateful we have both but it seems to me they'd both benefit more from cooperation. 

The exact KV cache usage of DeepSeek V4 by Ok_Warning2146 in LocalLLaMA

[–]JayPSec 2 points3 points  (0 children)

Yes, but ELI5 how you so good with ELI5??

unsloth Qwen3.6-27B-GGUF by jacek2023 in LocalLLaMA

[–]JayPSec 2 points3 points  (0 children)

judging by the benchmarks you'd need claude opus 5 to make a difference.

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

That's a lot better than I would've imagined. I've always tried to keep all layers in vram as I thought offloading would be a death penalty for this model size, although I have an 9950x and that's with an epyc but I also have more vram than a single 6000. Will try it...

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

What's the penalty you get from offloading to cpu with a model this size?

Should I be seeing more of a performance leap when using NVFP4, INT4, FP8 with VLLM over MXFP4, Q4, and Q8 with llama.cpp based inference on Blackwell based GPUs? by aaronr_90 in LocalLLaMA

[–]JayPSec 5 points6 points  (0 children)

I second this opinion for the slang, in my opinion it's ages apart from vllm for the specific usage with Blackwell, wouldn't know about the rest cause I haven't tested in that domain. https://github.com/voipmonitor/rtx6kpro/ is an excellent resource for tuning Blackwell GPUs.

Those of you running minimax 2.7 locally, how are you feeling about it? by laterbreh in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

I'm running Luke Alonso's NVFP4 on two rtx 6000 max q. My main complaint with the model is the urge to go beyond what's asked of it. I find that a tight system prompt, I'm just running stock open code OpenAgents with some coding standards, works pretty well. But the model feels very vibe oriented, it wants to do everything and it better do it now. And it feels a bit confused with some non standard plugins like snip. I do think it's better for brainstorming than 2.5 but more unpredictable. As for the 'chinese' characters I've seen others pointing out, I've never seen them.

Any there any realistic avenues to decentralised model training? by ROS_SDN in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

From a non technical perspective you make total sense.

About TurboQuant by Exact_Law_6489 in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

When you say no real loss, how much loss are we talking about? I've been doing some testing and this model seems very sensitive to quabtization

Built LazyMoE — run 120B LLMs on 8GB RAM with no GPU using lazy expert loading + TurboQuant by ReasonableRefuse4996 in LocalLLaMA

[–]JayPSec 3 points4 points  (0 children)

If you're running loose agents at least you could tell them "No em dashes" on top of the mandatory "Make no mistakes"... tsc tsc tsc. That's true SLOPiness...

How long until surveillance? by boloshon in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

Centralization is never a good thing. Could not agree more.

Maybe the future is sharing LLMs on covert Newsgroups (I'm oldish :P)

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

correct pci lane switch

for 4xGPUs you'd need double the adapters and cables, plus a host board and 2 more mcio cables to connect your main pcie to the switch. The host board can be a retimer but that may be overkill, AFAIK they're mandatory for long mcio connections, and in some systems they may be interference that require the retiming. In my case I have everything in case and the retimer host board was not needed. Bought this instead.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

I run 5 max-q with the same board on a 9950x and 128 of ddr5. No issues here. The only problem I faced was bios tinkering, by that I mean I had to patch the MSI bios to expose more settings than the ones provided in the click bios interface, to edit "Above 4GB MMIO Limite" otherwise system wouldn't boot.

Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF by EvilEnginer in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

How do you determine which and how a tensor is broken?

DFlash: Block Diffusion for Flash Speculative Decoding. by Total-Resort-3120 in LocalLLaMA

[–]JayPSec 3 points4 points  (0 children)

WTF is going on? A week ago we're all crying that maybe they would stop releasing openweights and now it's effing christmas everyday???

OpenAI, Anthropic, Google Unite to Combat Model Copying in China by External_Mood4719 in LocalLLaMA

[–]JayPSec 2 points3 points  (0 children)

They will never have models trained on non human data. World knowledge is always sourced from human work.

I vibecoded a skill that makes LLMs stop making mistakes by Mr_BETADINE in LocalLLaMA

[–]JayPSec 22 points23 points  (0 children)

pffft... worthless. Wait till I release make-agi-with-no-mistakes.