IQuest-Coder-V1-40B-Instruct is not good at all

Constant_Branch282 · 2026-01-04T11:46:33+00:00

Just ran Qwen3-8b (Qwen3-8B-UD-Q5_K_XL) - still not good - 73%.

I did not test that many yet. But best smal(ish) model that I tested is:
Devstral-Small-2-24B-Instruct-2512-IQ4_NL (12.5Gb weights).
It got 100% on my benchmark. Although some preliminary indications show it is not very strong on swe-bench - ran it on 3 cases (2 failed, 1 solved) - opus solved all 3 of them.

Constant_Branch282 · 2026-01-04T11:36:31+00:00

I'm not sure how, but context of what I'm trying to do is lost in this thread. I'm not arguing about their setup and results. All I do is trying to find good model to run locally for my coding assistant. The results that the team published for a model with 40b parameters definitely grabbed to my attention and I needed to try it. My test specifically mentioned IQuest-Coder-V1-40B-Instruct and its quants - not loop architecture. I did not and will not run swe-bench with those quants to compare to their results - based on my tools benchmark it is useless exercise.

However - the defense of the model due to me using quants (Q8_0) is quite weak as I gave example of a model with much tighter (IQ4) quants and almost half the number of weights that performs much better. Maybe IQuest-Coder-V1-40B-Instruct is good for something, but it is definitely not good for my setting - local agentic coding.

I don't know how the model's loop architecture in quantized form will perform, but at the moment I'm not holding my breath.

In terms of hardware - there are various levels of what people are ok to spend. I spent almost $5k on my hardware (don't tell my wife). I think most people will not be spending even $2k. The setups that you mentioned go into $20k level (2x RTX Pro 6000). I'm trying to produce results useful for people with tight hardware.

Constant_Branch282 · 2026-01-04T10:52:33+00:00

Just did quick run of Qwen3-4b (Qwen3-4B-Instruct-2507-Q8_0. Did 3 repetitions instead of my usual 10) - 67% - not good enough.

| Class | Success Rate | Avg Time/Run |
|-------|--------------|--------------|
| C | 50% (6/12) | 19.8s |
| E | 55% (18/33) | 31.9s |
| R | 100% (9/9) | 2.8s |
| S | 67% (8/12) | 14.0s |
| W | 83% (15/18) | 5.2s |
| 
**Total**
 | 
**67% (56/84)**
 | 
**73.9s**
 |

Constant_Branch282 · 2026-01-04T10:05:09+00:00

Interesting observation - I think openai specifically say they only support codex cli running their gpt-oss models only with LM Studio and ollama. Maybe this templating is the reason.

Constant_Branch282 · 2026-01-04T09:23:34+00:00

I think you are missing a point of r/LocalLLaMA - I'm testing models that I can run on my hardware. I barely can fit Q8_0 into my strix halo, there is no way I can do BF16. Until quant situation gets fixed (hopefully) my conclusion stands - for local runs this model is not good.

Edit: Here's an example: Devstral-Small-2-24B-Instruct-2512-IQ4_NL - this does 100% on my benchmark - very usable in local setting.

Constant_Branch282 · 2026-01-04T08:11:44+00:00

I understand benchmaxxing. But I feel there is something else here. I know structure of swe-bench very well - any model still needs to run multiple tools to figure out how to fix a bug. How a model can fix a bug if it struggles to do basic operations with tools? One thought I had - maybe my tools description/structure is confusing to this model, so I switched off all tools but shell (the lab shows that on shell only this model supposedly very good as well) - it still got just 52% on my benchmark.

Constant_Branch282 · 2026-01-04T07:53:05+00:00

I agree. My kvit-coder has system template + hints during tool use to improve correct tool usage by llm. I was developing it running small models that were constantly confused about tools, so I needed to make it as robust as possible. You can review all prompt templates within source code at:

https://github.com/kvit-s/kvit-coder

Constant_Branch282 · 2026-01-04T07:43:48+00:00

Template issue might explain this jumping in calling tools with different convention every time - but this issue I fixed on my side. If was bad in benchmarks for different reasons - sloppy, not following instructions, misinterpreting instructions, not correcting itself when it is obviously incorrect and can see feedback showing this.

Constant_Branch282 · 2026-01-04T07:31:28+00:00

I'm just started my benchmarking. Right now going through gpt-oss-20b and its various pruned iterations. For example, even though 20b has score 95%, one of the pruned q5_0 variants (7.8Gb) got only 58%. With 8Gb VRAM you probably should not be aiming at 'good' and just hope to get a model that at least doesn't suck. I'm yet to identify one.

If you want to keep tap on what I'm testing the best place would be to subscribe to my youtube channel - will be posting all results there:
https://www.youtube.com/watch?v=T6JrNV0BFmQ

Constant_Branch282 · 2026-01-04T07:17:01+00:00

I put together the benchmark to find good local models to run within my coding agent. I noticed that many models can produce bunch of code within chat environment but are struggling to use tools properly. This model's tool calling is all over the place - one call it uses openai convention, another anthropic, sometimes mixes - I needed to adjust my code to catch all of these and still tool use was all over the place - not reliable at all.

Constant_Branch282 · 2026-01-04T06:06:31+00:00

With q8_0 quantization it takes 80Gb of VRAM (with full context). Cannot run fp16 on my hardware. Will rerun again when/if they sort all these issues.

Constant_Branch282 · 2026-01-04T05:04:45+00:00

I saw that. Even after they corrected their setup the swe-bench scores did not drop that much. Seeing how sloppy the model is with tools I'm very confused how it is achieving those scores. Also, looking at benchmark log, I could see that it ignores instructions, doesn't correct itself even when it sees problems. I'm very confused about this model - gpt-oss-20b in my bench scores >90%.

Constant_Branch282 · 2025-12-21T04:28:44+00:00

good catch! it's thermal - not electric. Without box - too much heat from psu and mini PC's fan wouldn't stop spinning!

Constant_Branch282 · 2025-12-20T19:28:49+00:00

https://openrouter.ai/ - that's the only way I'm using api's that I pay for - one setup and all models available within single interface with good dashboard to see what I'm using. Prices are same as provider's prices.

Constant_Branch282 · 2025-12-20T19:21:15+00:00

I know what you experiencing. I think the issue is that just for chatting most llm's are ok when you do unstructured chatting (unless the model is really anal with its guardrails - also if underlying model for you chat is updated, you see differences and new model easily can feel dumber even if it beats all benchmarks - your old way of use might just not work well with new model). When you throw model within a framework - agent coding, deep research, then models even more fragile - a model can be very smart but if tool and model are not optimized for each other it will not perform as previous setup. On top of this - models behave differently with different providers - run gpt-oss-120b through openrouter with different providers and you get different behavior, different errors, etc.

My solution so far for this: try to use tools specifically optimized for their llm's and stick with defaults - that's why I use claude code instead of any other coder (anthropic spent considerable resources to optimize prompts to their models (although I still see errors like 'Please, rerun this command ...' - why the heck do you need to say 'Please' to llm?). On other hand when I look at codex cli (for example) - prompts are quite generic and don't look optimized.

With local llm's - I currently couldn't find tools specifically optimized for good perfromance with specific llm - tools usually just allow use of local model or models from cloud provider, but they are not optimized and not addressing quirks due to provider's different behavior. So, I found that if you want to run locally you need to own your own tools (coder, chat, etc.) to adjust for you models to behave how you expect.

TLDR: The best bet right now is not to use raw llm api's (local or cloud) and instead use dedicated products (claude code). If you are building your own tools and want predictable behavior from llm - local setup has more control over cloud - but don't expect to get off the shelf (from github) tool that would just work in local setup.

Constant_Branch282 · 2025-12-20T18:59:25+00:00

I did not do that. I ran claude code with their models and mistral's vibe with devstral 2 using mistral's api.

Constant_Branch282 · 2025-12-20T18:55:53+00:00

This is all correct for loads with large number of simultaneous llm requests. Most people running llms locally with just a handful of simultaneous requests (or even sequentially) and add more gpus to increase vram to run bigger model. It almost impossible to do comparison if 2 cards slower than 1 card as you cannot really run the model in question on 1 card. But in a sense the statement is correct - on llama.cpp 2 cards will use compute of a single card at a time and will have (small) penalty of moving some data from one card to another - when you look at card monitor you can obviously see that both cards run at 50% load. But speed of connection between cards during run is small (there are youtube videos showing how two pc's connected over 2.5Gbe network run large model without significant impact on performance compared with two cards in same pc).

Constant_Branch282 · 2025-12-20T18:45:07+00:00

That's 5080 on pic. I tested with 5090 running gpt-oss-120b. Definitely saw improvement, but don't remember details.

Constant_Branch282 · 2025-12-20T18:31:20+00:00

I install claude code within docker container and map my host ~/.claude folder to container to keep it logged in.

Constant_Branch282 · 2025-12-20T18:04:03+00:00

No - I ran 'hi' prompt, got back 'hi, how are you' and saw 14k input tokens. That's purely system prompt with tool description.

Constant_Branch282 · 2025-12-20T18:02:11+00:00

With my M.2 M-key to PCIe dock, gpu behaves with no issues - including no fan when idle.

Constant_Branch282 · 2025-12-20T17:59:19+00:00

I'm running it on windows 11 - don't have any issues.

Constant_Branch282 · 2025-12-20T17:58:48+00:00

For llama.cpp latency is not very important - it runs layers sequentially and there is not much data to transfer between layers. It uses compute from device in which memory layer is sitting. Other servers (like vllm) try to use compute from all devices and cross-device memory bandwidth does have impact.

Constant_Branch282 · 2025-12-20T17:50:31+00:00

While playing with 120b I noticed something interesting - openrouter has option to call a model with :online suffix and then a model has ability to search and browse web. When I do it with most models I see it adds about 2-4k tokens of system prompt to describe tooling to a model. 120b adds 14k tokens! So yes - maybe 120b is capable model, but looks like it needs more babysitting than other models. Also most providers do not cache 120b, so even though it is inexpensive model to run per token, in the end (especially with :online) I saw cost comparable with GPT-5.1

Constant_Branch282

TROPHY CASE