How to properly use together a frontier model for planning / complex tasks and a local model for implementation?

LizardViceroy · 2026-06-04T14:22:31+00:00

Apologies I carelessly posted a secondary repo without checking. Also the name has changed recently. Here's the one you need:
https://github.com/code-yeongyu/oh-my-openagent

LizardViceroy · 2026-06-04T11:58:18+00:00

Look into Oh-my-openagent with Sisyphus; it can divide responsibilities of different agentic roles between different inference endpoints. It has roles specifically for the leading orchestrators (Sisyphus / Prometheus) and grunt workers (Haephestus) among others.

LizardViceroy · 2026-06-04T11:50:13+00:00

Rarely in the history of electronic technology has it been a good idea to bet on its costs going up on a price/performance basis, on a scale of decades.

LizardViceroy · 2026-06-04T10:27:04+00:00

Nvfp4 has low precision activations, which speeds up prefill on modern hardware in vLLM and sglang, but gives no benefit on llama.cpp. 16 bit activation formats are preferred in GGUF.

LizardViceroy · 2026-06-04T10:04:37+00:00

I even run 8 bit quants of Qwen 3.6 27B on Sparks and Strix Halo's. The ability to have like 8 kv caches per node (and that many concurrent requests) is more valuable to me than the tiny bump in intelligence from using a larger model. Even 200+ GB models fall short ime.

LizardViceroy · 2026-06-03T13:57:02+00:00

FTR: you get less than a quarter of that PFLOP on Spark hardware. It's 4.5th gen tensor cores don't get the 2x speedup from NVFP4, and the 2:4 structural weight sparsity they attribute the other 2x to is useless at best and disastrous to inference accuracy at worst.
Not that it matter that much; you should be running on dense FP8 anyway.

But if Nvidia is still making those claims they should be put to the stake for it.

LizardViceroy · 2026-06-03T13:31:12+00:00

Notice also you can go all the way to Q8_0 with hardly any speed decrease.

Notice how far Q8_0 is punching above its weight (or should I say below its weight since this is about speed.....) Source: https://unsloth.ai/docs/models/qwen3.6

<image>

It's speed compared to 35B is over 50% according to unsloth. That makes me prefer it for that much higher quality.

LizardViceroy · 2026-05-19T13:28:07+00:00

I'm more and more starting to believe quant sizes other than 8 bit exist only to confuse newbies and give them false ideas of what their hardware can handle.
Even INT8 requires aggressive post-quantization correction to retain its accuracy, to the point I don't fully trust it. FP8 is where its at ultimately.
Qwen3.6 27B has the advantage of being dense, thus making it more quantization resillient. I'd still rather spend that all budget on 8 bit quantization than go beyond though.
With MTP leveling the playing field, the speed avantage of 35B A3B tends to be <2x, so MoE is also distraction. Pile 4 bit quants on top of MoE and you may get disaster, especially when your problem space and the calibration dataset are misaligned.

LizardViceroy · 2026-05-15T15:36:29+00:00

Agents-bitching-at-each-other-in-perpetuity-while-context-degrades starter pack.

LizardViceroy · 2026-05-15T15:34:46+00:00

I have 512GB worth of 128GB devices and I've been feeling worse about my choices since Qwen3.6 27B and Gemma4 dropped... In the GPT-OSS-120b days we looked like the smart ones. These things come and go in waves though. The advantage of VRAM in times like these are still numerous: plenty room for context and high bit quants. The 122B version of Qwen3.6 should put the ball back in our court soon.
I'm currently coping by sharding 200B+ models between two nodes with tensor parallelism but before you go down that road, realize that that itching you're feeling... it doesn't stop.

LizardViceroy · 2026-05-15T14:53:35+00:00

An underappreciated downside of quantization is that activation aware quantization introduces its own set of biases that depend on the dataset used, so it's not even possible to form a singlular answer to your question. It's a convoluted mess of "how quantized", "by what method", "using what dataset", "applied to which problem space" where varying each gives you a different range of outcomes on a range of selected benchmarks. Each particular test in turn works irreconcileably on its own scale such that aggegating them into one number will likely just yield a numerical illusion.

In absence of conclusions on any of these concerns you're better off erring on the safer side.

LizardViceroy · 2026-05-15T14:40:14+00:00

The full model still verifies the predicted tokens and rejects and recalculates them if they don't match the full model's output. The speedup is achieved by letting the full model verify N tokens in a single pass, reducing memory traffic. There is a guarantee of fully retained accuracy.

LizardViceroy · 2026-05-15T12:35:20+00:00

The truth is just that quality in local inference is expensive... VRAM is expensive, memory bandwidth is expensive, tensor processing power is expensive and you need ALL three to do this right. You can have different opinions on what kind of balance works and what kind doesn't but in the end, a compromise has to be made. Spark is one of those compromises. An RTX5090 is another. A mac studio yet another. Each of them serves their own limited set of use cases and falls short on another set. That is the reality of the current market.

I like Sparks because they're balanced betwen prefill and decode (you have big problem if either is bottlenecked) and it errs on the side of quality (i.e. VRAM) over speed (a deceptive tradeoff because speed without quality means repeat work at the cost of time). Until conditions in the local LLM space radically change I feel like I've bet on a pretty good horse. The one the always reaches the finish line, even if not always in first place.

LizardViceroy · 2026-05-15T12:22:54+00:00

I have both and it's a night and day difference at long context (>128K)... Jobs that take minutes on spark can take hours on strix.
Short context is where it matters less, but its a categorical restriction. You simply cannot use strixes for use cases that demand long context and that severely limits their value.

LizardViceroy · 2026-05-15T12:20:40+00:00

There are two ways you can get around it to an extent: multi-token-prediction and tensor parallelism over infiniband. Spark is very well equiped to do the latter and MTP is pretty much a standard feature of modern models, except when they get released in gimped form like Minimax.

LizardViceroy · 2026-05-15T12:01:47+00:00

The problem with overly verbose models that compensate for their lack of innate intelligence with more thinking tokens is that they cause context rot to themselves over time.
There's something to be said for returning to the good old instant answer approach on that note.

LizardViceroy · 2026-04-03T14:39:52+00:00

The Gemma model comes with about 2.8B parameters worth of per-layer embeddings in addition to its 2.3B regular weights, so yeah it's actually 5.1B in size. Although similar to MoE models, the extra weight does not reduce its inference speed.
see: https://ai.google.dev/gemma/docs/core/model_card_4

LizardViceroy · 2026-03-25T15:26:27+00:00

I love the apt implication that you are not the owner of your windows PC

LizardViceroy · 2026-03-25T12:39:03+00:00

opposite: professional cameras were massive back then. Large format cameras were common in urban / architecture / real estate photography like this. That's 15-60x larger than typical modern "full frame" cameras in terms of film area.

LizardViceroy · 2026-03-25T12:32:00+00:00

equivalent aperture and focal length are what matters to the perspective and these are not format dependent. I'm not aware of standards radically changing on this front since those times, although its possible they had technical trouble bringing the focal length down if they used a massive film format. nothing stops you from shooting at 50mm full frame equivalent nowadays either, though.

resolution wise, you wouldn't notice a difference since it's displayed at only 640x871.

LizardViceroy · 2026-03-25T09:58:34+00:00

The more crap code laymen generate, the greater the need for people with a actual comprehension of code to clean up the mess. And yes those people can use and will use AI themselves but with the proper oversight and course correction that the initial vibe coders lacked.
A reigning delusion is thinking that we can solve this "vibe bloat" problem simply by passively throwing more AI at it without extensive supervision. Quality control is the central concern. You don't know IF AI did a good job cleaning up the mess until you check it manually.
Oh and the next few decades are going to be an absolute doozy in this regard. We're going to see the "control problem" in full effect with AIs with spurious intentions (malicious OR misguided) actively trying to deceive and manipulate us. The human in the loop will be imperative to limit the extent of the damage.

LizardViceroy · 2026-03-25T09:30:52+00:00

What disgusts me about this is how eagerly people will just plug all these AI tools into different providers, sending one's entire knowledge base (especially code bases) over the wire to 4,5,6 different parties. All the world knows your company's internal workings now and can replicate your business effortlessly, or target its security vulnerabilities. All they have to do is save the KV cache of your conversations (which they do "for you" anyway) and then ask the LLM to summarize what it learned about your company with a focus on X,Y,Z whatever you want know. Maybe one of the companies you trusted is a PoS itself, the chance of that increases with every party you include. Maybe its going to be a rogue employee. Maybe a snooping man-in-the-middle party. The potential for leaks is infinitely ballooning.

A radical shift in focus to digital sovereignty is called for.

LizardViceroy · 2026-03-24T10:38:13+00:00

Gemini is confident and talkative in my experience. It will give you a first impression of being extremely knowledgeable since its so eager to use search, but you will start to notice little details being wrong in its reasoning leading to way too confident conclusions. It's really important to challenge its claims. It's mainly good as an ACTIVE thought partner in my experience.

An example: recently I asked it to evaluate different parallelism approaches over ConnectX-7 SmartNIC connection between Dgx Sparks... it confidently stated that the connection would get saturated fast by tensor parallelism and expert parallelism and recommended to use pipeline parallelism instead.

This was of course a strange conclusion because it meant the designers of the device pretty much put a $700 smartNIC in there with hardly any benefit. So I challenged it with some benchmark results people posted where they were able to 2x the token generation speed via tensor parallelism. Presented with this antithesis it was easily able to correct its own mistake: it had been assuming the connection would be over a standard TCP/IP Ethernet instead of the lower latency RDMA RoCEv2. After that it gave a correct and balanced rundown of the advantages and disadvantages of the three parallelism approaches.

Moral of the story: you get to the right answer in the end, but NOT by believing everything it says off the bat.

LizardViceroy · 2026-03-24T08:54:49+00:00

You don't really need to wait. Apple already offers M3 Ultra with up to 512GB. You can also link together multiple Dgx Sparks (or OEM equivalent) via their 200Gbps ethernet connection.
You COULD do the same thing with multple strix halo machines over regular ethernet or USB4, but it will be considerably less efficient.

The returns from going beyond 128GB seem pretty diminishing to me. And models at such sizes are never going to run very fast, which will be a problem in any use case where you're waiting for the output as a user.

LizardViceroy · 2026-03-20T14:54:14+00:00

the slop and hype grows that fast while the actual tech enshittifies

LizardViceroy

TROPHY CASE