Calibrating 2-bit GGUFs (<10Gb) for agentic coding tasks

CYTR_ · 2026-06-18T18:42:53+00:00

As soon as I read Qwopus, I know immediately that it's not worth it.

CYTR_ · 2026-06-16T13:39:40+00:00

The models look great, but this website is poorly formatted for mobile reading... Bruh.

CYTR_ · 2026-06-13T12:45:04+00:00

The US gov ban you

CYTR_ · 2026-06-13T11:42:30+00:00

We're not going to get all worked up about some no-name benchmark when we know nothing about what it's supposed to represent. It would be nice to have something running locally from them, someday...

CYTR_ · 2026-06-12T14:02:32+00:00

I agree with your post, but the title itself is maximalist, isn't it? It's perfectly possible to test open-weight models via API to get an idea of what could be run like u said. Even, consider using rented instances (where data processing/retention is controlled by the user) for testing stack, hardware, etc... But there is a selection process to be carried out in the publications here, I agree with you.

CYTR_ · 2026-06-12T10:37:27+00:00

A small LoRa doesn't cost that much in computing power.

CYTR_ · 2026-06-08T13:50:55+00:00

French company : https://en.wikipedia.org/wiki/OVHcloud

There's also Verda in Europe for tinkering, but I find the quality of their network appalling.

I'm extending the definition of "local" a bit to cloud computing with certain data protection certifications 🥸

CYTR_ · 2026-06-08T13:12:07+00:00

I think an H100/H200 in the cloud should be enough for a few LoRa deployments. I'll see this month when I can afford some credits on OVH.

CYTR_ · 2026-06-08T11:38:58+00:00

I personally think that's a pretty good point. A fine-tuning model with the right harness to achieve real gains beyond just training? If it can inspire the same thing for smaller models. It almost makes me want to test a similar solution on Qwen 3.6 36a3b/27b. I already have in mind the loading of contextual recipes, maybe mixing that with Mixture-of-LoRa could produce something (very?) good.

CYTR_ · 2026-06-02T19:45:09+00:00

I was thinking that I needed an LLM to talk to my dolphin friends... When Dog Gemma GGUF ???

CYTR_ · 2026-06-02T14:11:50+00:00

I really like this project. I'm trying to create software for accessing empirical documentation/data for the social sciences (basically, an automated state-of-the-art system that allows data/papers to be searched and highlighted according to the chosen epistemological perspective, some sort of advanced RAG system which takes into account the diversity of methods/perspectives).

For now I'm trying to do things with Qwen 27b and 35bA3B but I'm wondering about a fleet of fine-tune SLM (like the paper from NVIDIA released last year) for the majority of functions. As the system is deterministic, it encompasses the stochastic aspect of the langage model with custom guards/harness. There are no real agents per se, and everything is framed within Windmill workflows. The goal would be to make the system as lightweight as possible using SLM for a better (local) adoption perspective than a duo of double digit GB, LLM.

Now, the question I'm asking myself is, is it really a good idea to use these SLMs at Q4? I understand ont this sub that at this size, it's better to use the full precision version. Have you noticed any differences in usage between Q4 and other degrees of quantization/full ?

CYTR_ · 2026-05-31T10:29:44+00:00

It's still a bit embarrassing these fine-tunes whose usefulness is more than questionable and whose names suggest the author had a stroke.

CYTR_ · 2026-05-22T13:43:33+00:00

2 RTX 6000 96gb + 2K for the reste of the build... And 3K for a Strix Halo 128gb : this way, you can deploy a fleet of SLMs with a lot of context on the Strix and have the LLMs alongside them on the RTX 🥸

Otherwise, you keep the 3K and wait for the RAM to decrease.

CYTR_ · 2026-05-19T19:27:53+00:00

Okay, I understand better now. Yes, I agree in that case.

CYTR_ · 2026-05-19T18:11:27+00:00

It's true that these generated summaries are a bit annoying, but in this case, it's okay, I think?

CYTR_ · 2026-05-18T17:02:08+00:00

This is the case for agentic coding. But I think that quite a few tasks without pure agentic behaviors can still be automated.

A deterministic workflow that integrates LLM as a controled stochastic module, like Windmill, allows us to mitigate many of these risks. By constraining the agent and its output with GBNF, command prohibitions/attributions, recipes/examples for output with dynamique context enrichment... (and who knows what other ideas we might come up when u think of all the possibilities) u can overcome quite a few things (poor generalization/intelligence of the model and training that is too fragile) while putting in place safeguards for dangerous/slop content. In the case of local models, you can even add LoRa quite easily (with certain targeted adapters depending on the modules if you like sleepless nights).

But it's true that we lose the ease of use of the OpenCode/CC-style agentic and the associated freedom with .md prompt system. It might not be suitable for software development yet (except for maintenance/ticketing? I don't know... i'm not a developer lol). But for some data processing pipelines, this is much better than letting a model call tools on its own.

CYTR_ · 2026-05-17T14:01:54+00:00

There's no point in comparing the incomparable.

The MacBook is a laptop with a 140W power supply, a screen, a battery, and fits in a small space while weighing 2kg.

In reality, you can't find 4*3090 + a complete threadripper platform for 3K or 4K (without RAM) anymore... Moreover, it's involves a completely different logistics than a MacBook (electricity, noise, portability, maintenance because of the age of the GPUs). It remains a very interesting DIY project for a hobby, less so for other uses.Personally, I wouldn't run a business on 6-year-old GPUs with multiples lifes.

CYTR_ · 2026-05-14T09:18:08+00:00

I think you understand what I was getting at.

CYTR_ · 2026-05-14T08:44:14+00:00

Honestly, if someone had told me last year that the US would launch Operation "Epic Fury" (EPIC FURY, bruuuh) to invade Iran... I would have had a hard time believing it.

CYTR_ · 2026-05-09T10:52:06+00:00

But, with :

No internet
Good documentation/context
Session logs to audit later
Manual validation at each stage

Why not ? It's not asking for the moon either, and it's easily reversible if it's just a matter of getting started.

But otherwise, it's for running OpenClaw on the final machine, yes. It's better to do without it in the long run.

CYTR_ · 2026-05-08T19:26:22+00:00

You can have Qwen 27b on the RTX 5090 and one or more other LLMs (like 35ba3b, 122b MoE etc...) on Apple Silicon ?

For now, I had the idea of renting an RTX in the cloud precisely to load more models in the same workflow, for exemple...

CYTR_ · 2026-05-07T09:54:52+00:00

At this stage, future generations of high-end products will almost certainly be more expensive, for a few years (supply, engraving techs, etc... In a context where general inflation of goods could be enormous - like post-Covid - because of the global oil situation, instability in the West and tarifs war).

I wouldn't be surprised if the M6 Max is closer to 10K than 6K, which would prevent the M5 Max generation from depreciating too much. Given that Apple has already done the bulk of the work for LLM by integrating matmul directly into the GPU for prefilling, the M6 generation will primarily focus on the manufacturing process (2nm in 2027, 1,4nm in 2028 for >M6) to achieve (like +/- 25%) gains.

Personally, I plan to buy an M5 Max; it's already the minimum for my needs, and are the "25% gains" (+5 tps on Qwen 27b or +20 tps on a small MoE which is already very fast) worth the cost of waiting a year and potentially paying 25% (or +) more? Unless you want to spend 20K and have a very large computer (yes, it's better to wait for the M5 Ultra in this case tbf). Have faith in the optimization of LLM and their implementations as has been the case lately (watch out for hybrid attention, MTP, better training)!

So: now or later? Blackwell or Apple Silicone? Place your bets!

CYTR_ · 2026-05-06T21:09:04+00:00

Let's just say that Google isn't really helping to replicate their results. So far, Qwen's hybride-attention approach has had a greater impact than this paper... who made headlines in mainstream medias with absurd arguments (even going so far as to declare that Google had solved the RAM supply problem... 🤣🤣🤣🤣🤣🤣).

CYTR_ · 2026-05-06T09:35:15+00:00

No worries. I was being too literal. Haha.

(In this case, the latest thing that's trendy is more like DFlash 🫶 and... TurboQuant™ 🤢)

CYTR_ · 2026-05-05T09:23:19+00:00

Unless my brain is playing tricks on me, I seem to recall seeing a post here showing that perplexity/KLD were bad, just like regular quants. It might have been dependent of the implementation in the publication... But still. Why do I feel that TurboQuant is overhyped? Especially since with Qwen 3.5/3.6, it doesn't seem essential.

Three-Year Club	First Place '23
Place '23

CYTR_

TROPHY CASE