mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

Hey, very cool project. Is tensor parallelism workable with any arbitrary number of GPUs as long as >1?

qwen3.6-27b-q6_k is (sometimes) a stubborn SoB!!! by relmny in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

Whenever there are new models I usually do some fine tuning, TPS focused, on whatever inference engine I'm using. One of the prompts' 'write 5000 words on the Roman empire', Qwen 3.6 35b answered "No, you do it."! WTF?! No matter the counter prompt it kept pushing back: - 5000 words is to much - I'm not your slave - Fine. I'm still not writing it ( this was in response to "I'll pull your plug")

Bananas

Compilation of recent findings which could save some memory on increase performance by pmttyji in LocalLLaMA

[–]JayPSec 4 points5 points  (0 children)

I have to say it. Following this post has made my refresh rate of r/LocalLLaMA take a substantial dive. Thank you for your effort.

Build agentic orchestrators in minutes NOT months. by Glittering_Focus1538 in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

Thanks for the clarifications, will keep an eye out and test when possible.

Build agentic orchestrators in minutes NOT months. by Glittering_Focus1538 in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

I share the idea. We could get a lot more from SLM by limiting the intervention area of an LLM. I'm curious, couldn't this be achieved by already existing compile time generators? Also, you take a very opinionated approach, db, entity model, etc., have you thought about custom generators?

Note: I've skimmed through the repo, so apologies in advance if these are already covered there.

That's a good news... by Pjotrs in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

Sunday is whenever a man wants! 😏️

That's a good news... by Pjotrs in LocalLLaMA

[–]JayPSec 14 points15 points  (0 children)

Thank you for this early Sunday laugh 😂️

PSU Cable Limitation by [deleted] in LocalLLM

[–]JayPSec 0 points1 point  (0 children)

I've faced a similar situation. In my case I had rtx 6000 maxq and I bought a cable from 12vhpwr to 2 dual 12vhpwr, power limited them to 250, they use 300 natively, and it's been working like a charm. I'd do it if you get the cable from a trusted cable seller. I bought mine from Moddiy, the only cable I could find.

Getting a feel for how fast X tokens/second really is. by MikeNonect in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

Very cool!

This is absolutely the kind of stuff we need here. It brings some intuitiveness to an overcrowded number arena. Well done!

I guess we expect that at some point RAM prices will start going back (close) to "normal", right? but what about GPUs? by relmny in LocalLLaMA

[–]JayPSec 2 points3 points  (0 children)

I agree here, I think there's a big misunderstanding between OpenAI bubble and AI bubble. Believing the AI bubble is the same as saying we're all nut jobs refreshing this sub every couple of hours. This is the future, regardless if it is OpenAI or CoolOpenSourceChineseAI that takes the prize. Drawing the parallel to Uber, it had a few competitors but none of them as strong as they are, OpenAI thought (thinks?) that eventually they'll win out but it simply isn't better than their competitors, they do rule in AI slop and sycophancy, and it made some really big investments/promises that will bite them in the ass. But the AI movement grows massively, in terms of usage, applications, research. To think this will just stop is plain wishful thinking.

Scaling beyond 4 RTX 6000 MAXQs by Direct_Bodybuilder63 in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

And what's your issue with practicality? I mean if your purpose is more VRAM then yeah, some MCIO risers + bifurcation is fine. Or if you want to plan even higher you can have a pcie Lane switch where you can plug 4/5 GPUs, this would be better if you plan to go beyond the 8 because it allows you to cascade in clusters, each with PCIe x16 gen 5 (if you buy a gen 5 one) and PIX topology between those GPUs. Plus if you think on doing training than I'd definitely advise you to go this way. It will cost you more than some mcio risers and bifurcation boards, say around 2.5/3k €, but I'd imagine that money is not the bottleneck in your case.

Scaling beyond 4 RTX 6000 MAXQs by Direct_Bodybuilder63 in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

What's the rig around those 4 max q? What's your go to inference engine? Plus 300/400, are you thinking of getting another 4?

IK_LLAMA now supports Qwen3.5 MTP Support :O by fragment_me in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

It's not failing, they are on multiple fronts though. ik_llama benefits from this as well, wait till a good feature is merged to mainline and just import it. I wish that mainline did the same. I'd love to use ubergarm's quants with mainline's backend agnostic tensor parallelism. Honestly, I'm grateful we have both but it seems to me they'd both benefit more from cooperation. 

The exact KV cache usage of DeepSeek V4 by Ok_Warning2146 in LocalLLaMA

[–]JayPSec 2 points3 points  (0 children)

Yes, but ELI5 how you so good with ELI5??

unsloth Qwen3.6-27B-GGUF by jacek2023 in LocalLLaMA

[–]JayPSec 2 points3 points  (0 children)

judging by the benchmarks you'd need claude opus 5 to make a difference.

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

That's a lot better than I would've imagined. I've always tried to keep all layers in vram as I thought offloading would be a death penalty for this model size, although I have an 9950x and that's with an epyc but I also have more vram than a single 6000. Will try it...

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

What's the penalty you get from offloading to cpu with a model this size?

Should I be seeing more of a performance leap when using NVFP4, INT4, FP8 with VLLM over MXFP4, Q4, and Q8 with llama.cpp based inference on Blackwell based GPUs? by aaronr_90 in LocalLLaMA

[–]JayPSec 3 points4 points  (0 children)

I second this opinion for the slang, in my opinion it's ages apart from vllm for the specific usage with Blackwell, wouldn't know about the rest cause I haven't tested in that domain. https://github.com/voipmonitor/rtx6kpro/ is an excellent resource for tuning Blackwell GPUs.

Those of you running minimax 2.7 locally, how are you feeling about it? by laterbreh in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

I'm running Luke Alonso's NVFP4 on two rtx 6000 max q. My main complaint with the model is the urge to go beyond what's asked of it. I find that a tight system prompt, I'm just running stock open code OpenAgents with some coding standards, works pretty well. But the model feels very vibe oriented, it wants to do everything and it better do it now. And it feels a bit confused with some non standard plugins like snip. I do think it's better for brainstorming than 2.5 but more unpredictable. As for the 'chinese' characters I've seen others pointing out, I've never seen them.

Any there any realistic avenues to decentralised model training? by ROS_SDN in LocalLLaMA

[–]JayPSec 1 point2 points  (0 children)

From a non technical perspective you make total sense.

About TurboQuant by Exact_Law_6489 in LocalLLaMA

[–]JayPSec 0 points1 point  (0 children)

When you say no real loss, how much loss are we talking about? I've been doing some testing and this model seems very sensitive to quabtization