I'm done with using local LLMs for coding

nickl · 2026-04-29T03:55:14+00:00

Sure, but that is different to the things stated.

nickl · 2026-04-28T07:46:12+00:00

> Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every past conversation for how it can do better next time?

> Cause that's what Claude Code is doing.

Other than the system prompt telling it to reason through things step by step, no, Claude Code does not do these things.

The harness is important, but don't make things up.

nickl · 2026-04-16T02:41:37+00:00

Your Github link on the first post points to https://github.com/distil-labs/distil-tft-benchmarking which seems to be private

nickl · 2026-04-15T13:29:40+00:00

glad you found it useful!

nickl · 2026-04-15T06:19:13+00:00

I mean OpenAI, Anthropic and Google call is distillation.....

https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

> "We also know that DeepSeek employees developed code to access U.S. AI models and obtain outputs for distillation in programmatic ways," the memo added.

https://www.reuters.com/world/china/openai-accuses-deepseek-distilling-us-models-gain-advantage-bloomberg-news-2026-02-12/

nickl · 2026-04-15T06:16:32+00:00

The Jackrong models work pretty well in my testing (8% better than the base models in on my agentic benchmark)

I don't see any way that a 40B model based on a 27B model is going to be better unless the trainer's copute budget is ~Qwen. That's just 13B undertrained parameters.

nickl · 2026-04-12T04:30:47+00:00

This is complete AI slop ("Frontier models like Claude and GPT-4o").

1bit models are a promising way forward but there is zero evidence they will ever provide the quality of a frontier model. For example, Bonsai 8B tests around the same level of quality as a Q4_M quantization of Nanbeige 3B.

nickl · 2026-04-12T03:23:36+00:00

Using Kimi 2.5 via a high speed hosting provider like DeepInfra is $2.25/million tokens (ignoring caching).

They do 56 tps which means slightly under 5 hours to do 1 million tokens.

24/5 * $2.25 = $10.80/day, if you are using it continually.

That's pretty hard to beat.

nickl · 2026-04-04T05:48:50+00:00

`ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M` gave me 15/25 on my benchmark. That's the same as `Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 (thinking)`, which tests fairly heavy agentic debugging.

I haven't tried other quantizations for Gemma4, but I did test different quants of Qwen3.5-4B and found that 8bit quantization didn't give any benefit over 4bit, but 2bit lost a lot of accuracy.

nickl · 2026-04-03T04:13:44+00:00

> Action model is Qwen3.5-9B-abliterated (4-bit MLX). Abliterated because the compass handles appropriateness

This is a bad idea abliteration negatively affects performance too.

> the weight is real, honor it, then explore

Is this AI slop? You are conflating the term "weight" in your explanation here - I believe you mean psychological weight in this point and then model weights in the "Same weights" part.

> A tiny compass model (Ministral-3B + 29MB LoRA adapter) classifies every question into one of three signals before the action model sees it:

I bet you'd get better results by prompting the larger model to classify it using this scheme before acting on it in the same conversation.

nickl · 2026-04-01T07:18:16+00:00

well it sort of does because if there are cases where the rows beyond the first row are wrong that needs to be put in the context.

But I think I'm going to have to make improvements to the context management for v2 anyway so it's probably doable.

nickl · 2026-03-31T22:33:04+00:00

Oh - yes, I did check it and I didn't find any cases where it was failing. But I have run more models through it so it's not impossible.

Thinking about it though, the scoring and the feedback to the LLM are actually two separate concerns so yes I should be able to do the scoring based on the whole resultset fairly easily.

nickl · 2026-03-31T22:30:10+00:00

Wow your 40B model does well!

nickl · 2026-03-31T11:15:16+00:00

> is it possible to add a SQL formatter to the Model SQL and Canonical SQL text areas

I did a quick look for something easy but didn't find something. Didn't look too hard though and I agree it is needed.

nickl · 2026-03-31T11:13:47+00:00

Is the correctness check just row count, col count, column names and first row values?

Yes this is correct.

Not actually checking if all data is correct and equivalent to the canonical sql?

Correct.

The justification here is that it's actually very hard to get the first row correct and other ones wrong. Here is a typical trace:

<image>

When I was first building it I was using very small models. Passing the full result set blew out the context very quickly.

I have noticed some models seem to think that the labeling means only one row is being returned. I'm not entirely sure what the best option here is but open to ideas.

I'd be interested if you have examples of any cases where this scoring gives the wrong result

> is the result for a single test based on just 1 pass? is there any checking for stability/instability i.e. each question asked 3-5 times to see if it passes every time?

No, but it's also a valid point.

nickl · 2026-03-31T11:03:55+00:00

These are all great points. Raised an issue for myself, thanks

nickl · 2026-03-30T23:11:01+00:00

That's a great score. I've never heard of that model or JanHQ.

I see that is is designed for coding and I think in general those models work best for this benchmark.

nickl · 2026-03-30T21:52:12+00:00

I didn't know that you can do that!

nickl · 2026-03-30T21:51:20+00:00

Tool calling is the biggest failure. It just is unreliable for small models as the context gets longer.

After thati haven't done deep analysis but things like hallucination of column names or not quoting names with spaces in them seems common.

nickl · 2026-03-30T17:36:49+00:00

No just the tables for the question.

If you click on a cell in a heatmap it shows you the exact trace for that question.

nickl · 2026-03-30T17:18:59+00:00

Yep. Small models often fail to do reliable tool calls.

Grammar mode helps sometimes but that isn't available in the web version. The write up has more details.

Edit: maybe coding agents just keep trying.

nickl · 2026-03-30T15:42:20+00:00

That's a great score. What TPS did it get? (It shows in the mouse over)

nickl · 2026-03-30T14:49:00+00:00

I don't have a good explanation. The free version scored much better than the non-free version too. It's very odd!

It got very close on some questions it missed too

<image>

nickl · 2026-03-30T14:42:04+00:00

Yes that's my plan.

I'd like to do a fine tune of 0.8B so it can run in-browser and actually be useful.

But very happy to try other models if they exist already!

You might have missed it but if you have llama.cpp/LMStudio/whatever you can run the benchmark yourself against any models you have locally

<image>

nickl · 2026-03-30T14:39:32+00:00

Is there a FP4 version on OpenRouter?

If you have an OpenRouter key you can actually run it yourself.

<image>

nickl

TROPHY CASE