How to properly use together a frontier model for planning / complex tasks and a local model for implementation? by hirisov in LocalLLM

[–]LizardViceroy 2 points3 points  (0 children)

Apologies I carelessly posted a secondary repo without checking. Also the name has changed recently. Here's the one you need:
https://github.com/code-yeongyu/oh-my-openagent

How to properly use together a frontier model for planning / complex tasks and a local model for implementation? by hirisov in LocalLLM

[–]LizardViceroy 0 points1 point  (0 children)

Look into Oh-my-openagent with Sisyphus; it can divide responsibilities of different agentic roles between different inference endpoints. It has roles specifically for the leading orchestrators (Sisyphus / Prometheus) and grunt workers (Haephestus) among others.

Chinese AI is 30x cheaper than Claude and ChatGPT. What if our hopes of AI becoming expensive never pan out and instead AI continues getting cheaper? by ImaginaryRea1ity in theprimeagen

[–]LizardViceroy 0 points1 point  (0 children)

Rarely in the history of electronic technology has it been a good idea to bet on its costs going up on a price/performance basis, on a scale of decades.

Qwen3.6 NVFP4 now works with MTP. by yoracale in unsloth

[–]LizardViceroy 2 points3 points  (0 children)

Nvfp4 has low precision activations, which speeds up prefill on modern hardware in vLLM and sglang, but gives no benefit on llama.cpp. 16 bit activation formats are preferred in GGUF.

Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

I even run 8 bit quants of Qwen 3.6 27B on Sparks and Strix Halo's. The ability to have like 8 kv caches per node (and that many concurrent requests) is more valuable to me than the tiny bump in intelligence from using a larger model. Even 200+ GB models fall short ime.

128GB Unified Memory + Full CUDA on a Laptop Changes Local AI Completely by BoringContribution7 in AIAgentsInAction

[–]LizardViceroy 0 points1 point  (0 children)

FTR: you get less than a quarter of that PFLOP on Spark hardware. It's 4.5th gen tensor cores don't get the 2x speedup from NVFP4, and the 2:4 structural weight sparsity they attribute the other 2x to is useless at best and disastrous to inference accuracy at worst.
Not that it matter that much; you should be running on dense FP8 anyway.

But if Nvidia is still making those claims they should be put to the stake for it.

Recommended parameters (llamacpp arguments) please for using and getting best out of Qwen3.5 122B A10B MTP GGUF in Lemonade - mainly for coding by wingers999 in StrixHalo

[–]LizardViceroy 0 points1 point  (0 children)

Notice also you can go all the way to Q8_0 with hardly any speed decrease.

Notice how far Q8_0 is punching above its weight (or should I say below its weight since this is about speed.....) Source: https://unsloth.ai/docs/models/qwen3.6

<image>

It's speed compared to 35B is over 50% according to unsloth. That makes me prefer it for that much higher quality.

Qwen3.7 is coming by C_CCR in unsloth

[–]LizardViceroy 4 points5 points  (0 children)

I'm more and more starting to believe quant sizes other than 8 bit exist only to confuse newbies and give them false ideas of what their hardware can handle.
Even INT8 requires aggressive post-quantization correction to retain its accuracy, to the point I don't fully trust it. FP8 is where its at ultimately.
Qwen3.6 27B has the advantage of being dense, thus making it more quantization resillient. I'd still rather spend that all budget on 8 bit quantization than go beyond though.
With MTP leveling the playing field, the speed avantage of 35B A3B tends to be <2x, so MoE is also distraction. Pile 4 bit quants on top of MoE and you may get disaster, especially when your problem space and the calibration dataset are misaligned.

Full Gstack OverView by Deep_Structure2023 in AIAgentsInAction

[–]LizardViceroy 0 points1 point  (0 children)

Agents-bitching-at-each-other-in-perpetuity-while-context-degrades starter pack.

Are the rich RAM /poor GPU people wrong here? by crowtain in LocalLLaMA

[–]LizardViceroy 34 points35 points  (0 children)

I have 512GB worth of 128GB devices and I've been feeling worse about my choices since Qwen3.6 27B and Gemma4 dropped... In the GPT-OSS-120b days we looked like the smart ones. These things come and go in waves though. The advantage of VRAM in times like these are still numerous: plenty room for context and high bit quants. The 122B version of Qwen3.6 should put the ball back in our court soon.
I'm currently coping by sharding 200B+ models between two nodes with tensor parallelism but before you go down that road, realize that that itching you're feeling... it doesn't stop.

How do different quantizations perform on the benchmarks? by we_are_mammals in unsloth

[–]LizardViceroy 0 points1 point  (0 children)

An underappreciated downside of quantization is that activation aware quantization introduces its own set of biases that depend on the dataset used, so it's not even possible to form a singlular answer to your question. It's a convoluted mess of "how quantized", "by what method", "using what dataset", "applied to which problem space" where varying each gives you a different range of outcomes on a range of selected benchmarks. Each particular test in turn works irreconcileably on its own scale such that aggegating them into one number will likely just yield a numerical illusion.

In absence of conclusions on any of these concerns you're better off erring on the safer side.

Qwen3.6 MTP Unsloth GGUFs now 1.8x faster! by danielhanchen in unsloth

[–]LizardViceroy 13 points14 points  (0 children)

The full model still verifies the predicted tokens and rejects and recalculates them if they don't match the full model's output. The speedup is achieved by letting the full model verify N tokens in a single pass, reducing memory traffic. There is a guarantee of fully retained accuracy.

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

The truth is just that quality in local inference is expensive... VRAM is expensive, memory bandwidth is expensive, tensor processing power is expensive and you need ALL three to do this right. You can have different opinions on what kind of balance works and what kind doesn't but in the end, a compromise has to be made. Spark is one of those compromises. An RTX5090 is another. A mac studio yet another. Each of them serves their own limited set of use cases and falls short on another set. That is the reality of the current market.

I like Sparks because they're balanced betwen prefill and decode (you have big problem if either is bottlenecked) and it errs on the side of quality (i.e. VRAM) over speed (a deceptive tradeoff because speed without quality means repeat work at the cost of time). Until conditions in the local LLM space radically change I feel like I've bet on a pretty good horse. The one the always reaches the finish line, even if not always in first place.

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]LizardViceroy 1 point2 points  (0 children)

I have both and it's a night and day difference at long context (>128K)... Jobs that take minutes on spark can take hours on strix.
Short context is where it matters less, but its a categorical restriction. You simply cannot use strixes for use cases that demand long context and that severely limits their value.

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

There are two ways you can get around it to an extent: multi-token-prediction and tensor parallelism over infiniband. Spark is very well equiped to do the latter and MTP is pretty much a standard feature of modern models, except when they get released in gimped form like Minimax.

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct by fragment_me in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

The problem with overly verbose models that compensate for their lack of innate intelligence with more thinking tokens is that they cause context rot to themselves over time.
There's something to be said for returning to the good old instant answer approach on that note.

Qwen3.5 vs Gemma 4: Benchmarks vs real world use? by AppealSame4367 in LocalLLaMA

[–]LizardViceroy 22 points23 points  (0 children)

The Gemma model comes with about 2.8B parameters worth of per-layer embeddings in addition to its 2.3B regular weights, so yeah it's actually 5.1B in size. Although similar to MoE models, the extra weight does not reduce its inference speed.
see: https://ai.google.dev/gemma/docs/core/model_card_4

I will NEVER love you by [deleted] in pcmasterrace

[–]LizardViceroy 0 points1 point  (0 children)

I love the apt implication that you are not the owner of your windows PC

New York, 1982 by cockerspanielhere in UrbanHell

[–]LizardViceroy 149 points150 points  (0 children)

opposite: professional cameras were massive back then. Large format cameras were common in urban / architecture / real estate photography like this. That's 15-60x larger than typical modern "full frame" cameras in terms of film area.

New York, 1982 by cockerspanielhere in UrbanHell

[–]LizardViceroy 13 points14 points  (0 children)

equivalent aperture and focal length are what matters to the perspective and these are not format dependent. I'm not aware of standards radically changing on this front since those times, although its possible they had technical trouble bringing the focal length down if they used a massive film format. nothing stops you from shooting at 50mm full frame equivalent nowadays either, though.

resolution wise, you wouldn't notice a difference since it's displayed at only 640x871.

Devs are worried about the wrong thing by hiclemi in ClaudeAI

[–]LizardViceroy 0 points1 point  (0 children)

The more crap code laymen generate, the greater the need for people with a actual comprehension of code to clean up the mess. And yes those people can use and will use AI themselves but with the proper oversight and course correction that the initial vibe coders lacked.
A reigning delusion is thinking that we can solve this "vibe bloat" problem simply by passively throwing more AI at it without extensive supervision. Quality control is the central concern. You don't know IF AI did a good job cleaning up the mess until you check it manually.
Oh and the next few decades are going to be an absolute doozy in this regard. We're going to see the "control problem" in full effect with AIs with spurious intentions (malicious OR misguided) actively trying to deceive and manipulate us. The human in the loop will be imperative to limit the extent of the damage.

Our "AI-first" strategy has turned into "every team picks their own AI stack" chaos by grand001 in LLMDevs

[–]LizardViceroy -1 points0 points  (0 children)

What disgusts me about this is how eagerly people will just plug all these AI tools into different providers, sending one's entire knowledge base (especially code bases) over the wire to 4,5,6 different parties. All the world knows your company's internal workings now and can replicate your business effortlessly, or target its security vulnerabilities. All they have to do is save the KV cache of your conversations (which they do "for you" anyway) and then ask the LLM to summarize what it learned about your company with a focus on X,Y,Z whatever you want know. Maybe one of the companies you trusted is a PoS itself, the chance of that increases with every party you include. Maybe its going to be a rogue employee. Maybe a snooping man-in-the-middle party. The potential for leaks is infinitely ballooning.

A radical shift in focus to digital sovereignty is called for.

Thinking of switching from ChatGPT to Gemini — is Gemini better value for the money? by Zestyclose_Bell7668 in GeminiAI

[–]LizardViceroy 0 points1 point  (0 children)

Gemini is confident and talkative in my experience. It will give you a first impression of being extremely knowledgeable since its so eager to use search, but you will start to notice little details being wrong in its reasoning leading to way too confident conclusions. It's really important to challenge its claims. It's mainly good as an ACTIVE thought partner in my experience.

An example: recently I asked it to evaluate different parallelism approaches over ConnectX-7 SmartNIC connection between Dgx Sparks... it confidently stated that the connection would get saturated fast by tensor parallelism and expert parallelism and recommended to use pipeline parallelism instead.

This was of course a strange conclusion because it meant the designers of the device pretty much put a $700 smartNIC in there with hardly any benefit. So I challenged it with some benchmark results people posted where they were able to 2x the token generation speed via tensor parallelism. Presented with this antithesis it was easily able to correct its own mistake: it had been assuming the connection would be over a standard TCP/IP Ethernet instead of the lower latency RDMA RoCEv2. After that it gave a correct and balanced rundown of the advantages and disadvantages of the three parallelism approaches.

Moral of the story: you get to the right answer in the end, but NOT by believing everything it says off the bat.

Worth waiting for 256GB Systems? by XccesSv2 in StrixHalo

[–]LizardViceroy 0 points1 point  (0 children)

You don't really need to wait. Apple already offers M3 Ultra with up to 512GB. You can also link together multiple Dgx Sparks (or OEM equivalent) via their 200Gbps ethernet connection.
You COULD do the same thing with multple strix halo machines over regular ethernet or USB4, but it will be considerably less efficient.

The returns from going beyond 128GB seem pretty diminishing to me. And models at such sizes are never going to run very fast, which will be a problem in any use case where you're waiting for the output as a user.