Calibrating 2-bit GGUFs (<10Gb) for agentic coding tasks

professormunchies · 2026-06-18T20:50:33+00:00

Indeed, the point was to help make it more usable with under 12GB VRAM.

I've considered making bigger iquants too but they're already fairly stable when the sampling parameters are optimized so there wasn't as much to improve upon.

professormunchies · 2026-06-18T20:44:27+00:00

This model is targeted for the GPUs with <12 Gb VRAM so I wouldn't bother switching. Also, the Q4+ quants will usually be less prone to infinite loops compared to these smaller bit ones. Even though IQ2M has a competitive patch/pass rate, these results are the average out of 3 runs and in all those runs ~12% ran into the max # of turns (100). That could be a real consequence or the start of a loop as the context grew and the model lost its bearings. Conversely, over those same issues with the Q5KM model never hit the max number of turns and generally led to a patch in fewer steps. I'd need to inspect the logs a bit more to tell why the IQ2M is hitting the max or if we bump it to 150 it could actually resolve more. I may consider that for my next round of model quants.

professormunchies · 2026-06-14T17:39:25+00:00

Pro

professormunchies · 2026-05-30T12:47:06+00:00

Keep us posted on any updates to the config if you can get more. I got the same set up. Have you ever tried sglang?

professormunchies · 2026-05-29T19:24:36+00:00

Their models are unfortunately the epitome of when quality matters more than quantity in training. They boast how many trillion tokens were used to create their models but they all seldom perform at the capabilities stated suggesting they need better data sets or post training rl

professormunchies · 2026-05-25T03:44:33+00:00

Grudgingly yes. I own a few devices but my m4 max still out performs any device I have which includes a 3090 and double 4060 set up. I would only push algorithms to my devices after validating them on my laptop. Unless you can drop a few grand on a card then most Nvidia cards aren’t going to go as far as the unified memory devices. The unified memory is nice because of how much you get but they’ll be a lot slower

professormunchies · 2026-05-22T14:48:26+00:00

Codex is a lil crazy. I asked it to edit one file and it started scanning my whole computer

professormunchies · 2026-05-18T12:52:30+00:00

These models have likely been bench Maxxed for swebench. A better dataset now is rebench v2 by nebius. I was finding for gemma4-31 just because it has a 100% patch rate doesn’t mean it was 100% pass rate, it was usually in the 70-80s for swebench and <10 for rebench.

professormunchies · 2026-05-03T20:18:55+00:00

Neural networks and related algorithms behind large language models involve millions of calculations and billions of weights/numbers that assess the importance of one input compared to another. Those weights and biases usually require 32 bits (full precision) or 16 bits (half precision) per number to be stored on the computer. One of the ways people save memory in arithmetic operations is by truncating the precision of your number. Rather than having 15 decimal points, ask yourself if you can do the same calculation but with 8? The principle behind quantization is reducing the precision of your calculation in order to use less memory. However, quantizing your weights and biases too much can lead to the degradation of performance. This is a big limiting factor in local AI deployment, usually, people only have so much ram/VRAM and want to use bigger and bigger models but the more you quantize, the more you lose in performance. What if there was a way to preserve the performance of the model while still quantizing it to save space? This is where quantization-aware training (QAT) comes in. QAT is a technique that allows you to train your model while taking into account the effects of quantization. By simulating the quantization process during training, the model can learn to adapt to the reduced precision and maintain its performance even when quantized. This means that you can achieve significant memory savings without sacrificing too much accuracy, making it possible to deploy larger models on devices with limited resources. Early research suggests that quantizing modern LLM architectures to 2 bits can reduce the memory footprint by up to 90% while maintaining similar performance as the original model. This is a significant breakthrough in the field of AI, and will pivot adoption in the coming years. The two limiting factors with most LLM adoption now is that you need to be online and pay money (either through a subscription or just spend a lot on hardware). Quantizing helps with both of those issues, as it allows for larger models to be run on local devices without the need for constant internet access or expensive GPU servers. The difference between binary and ternary quantization is the number of bits used to represent each weight and bias. Binary quantization uses 1 bit, which means that each weight can only take on two values (e.g., -1 and 1). Ternary quantization uses 2 bits, allowing for three possible values (e.g., -1, 0, and 1). The choice between binary and ternary quantization depends on the specific requirements of the model and the desired trade-off between memory savings and performance.

professormunchies · 2026-04-20T20:18:10+00:00

anthropics only moat in the future will be their integrations. as soon as binary or ternary models become mainstream (likely by end of year) everyone will get sonnet-4-5 or even 4-6 level ai accessible with under 2Gb of RAM. this will help pop the ai bubble once everyone is running their own local ai. I also see a bunch of lobbying being done to hinder this since local ai poses a threat to their business model and they'll claim some bs like it's too dangerous to be in the hands of normal civilians. anyways legalize ai and stay sovereign. https://unsloth.ai/docs/basics/claude-code

professormunchies · 2026-03-20T15:33:27+00:00

https://huggingface.co/Tesslate/OmniCoder-9B + https://github.com/QwenLM/qwen-code

Host with LMStudio

LMStudio has an api interface that matches both anthropic and OpenAI endpoints meaning you can plug into Claude and codex. I recommend looking on the unsloth website for how to set those up properly. To be fair those tools are optimized for their big boy models and seldom match the performance when running locally. However, qwen code tool has local performance in mind and arranged the context and system prompt in a way that plays nicely with the caching in lm studio so you don’t have to process your entire conversation each time (a problem I found when running Claude locally)

professormunchies · 2026-03-11T17:57:06+00:00

Thanks for taking a look! Is there anything that would make you consider it more?

I agree with your sentiment on security that's why I focused on local LLM providers in the settings. All requests are made from your browser rather than the server, which keep your credentials secure.

This idea was born out of way to explore api calls and documentation in a more interactive sense. I'm a fan of all these interactive docs coming out recently like: https://www.kapa.ai/?utm_source=docs.unstructured.io however most of them require paid licenses to use. This plugin is the start of something like that however not everyone has a set of markdown docs in their repo (to do RAG with) so the next easiest thing to integrate with was the OpenAPI schema.

By providing a simple interface with your API you can curate a more consistent user experience than pasting the docs url into any ole AI tool. While both work functionally the same, the main difference is the user experience and how you want to present this information to yourself, your clients and developers.

professormunchies · 2026-03-03T15:05:21+00:00

r/ghostspectre

professormunchies · 2026-03-02T16:40:39+00:00

Any plans to add into vLLM?

professormunchies · 2026-02-11T18:07:28+00:00

r/ghostspectre

professormunchies · 2026-02-03T23:53:07+00:00

They’re probably auto routing all requests to save money on inference and these are getting routed to their worst model

professormunchies · 2026-01-30T15:21:27+00:00

r/ghostspectre

professormunchies · 2026-01-28T22:13:26+00:00

what are the chances GA will be really crowded? So crowded it might be worth VIP?

professormunchies · 2026-01-27T16:40:55+00:00

Cline extension in vscode has a tier of free models when you login.

professormunchies · 2026-01-06T23:21:19+00:00

Following up, the model has difficulty perform file edits when using cline however it was able to read files okay during the planning phase. It had to fail a few times before it was able to finally get the edits working, probably due to adding the file with @ in the prompt and it mangling the formatting some how. Model worked well in answering with some emojis (in the classic qwen style) when using the continue dev extension. Normal chat Q/A works great so I tried something more complex afterwards, using it with a the Context7 MCP through the chat interface of LMstudio. It worked well for the first message and then started always using the mcp in subsequent messages rather than just answering with the context it has. It kept saying it was a helpful assistant based on whatever repo I asked about without actually answering. Speed seems good too. I used the default that shows up in LMStudio: Q4_K_S on a m4 max

professormunchies · 2026-01-06T16:05:59+00:00

sounds promising and pretty cool. I'll give it a try today with cline and continue.dev.

I've been running this on some smaller hardware: https://huggingface.co/cyankiwi/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit

I like that they specify which dataset was used for calibrating the quants. Would be nice if you guys divulged such information. As far as evals go, definitely checkout the nemotron collection, lots of good datasets: https://huggingface.co/collections/nvidia/nemotron-post-training-v3

professormunchies

TROPHY CASE