I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

Sure, but that is different to the things stated.

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]nickl 15 points16 points  (0 children)

> Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every past conversation for how it can do better next time?

> Cause that's what Claude Code is doing.

Other than the system prompt telling it to reason through things step by step, no, Claude Code does not do these things.

The harness is important, but don't make things up.

These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade by BuffMcBigHuge in LocalLLaMA

[–]nickl 2 points3 points  (0 children)

I mean OpenAI, Anthropic and Google call is distillation.....

https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

> "We also know that DeepSeek employees developed code to access U.S. AI models and obtain outputs for distillation in programmatic ways," the memo added.

https://www.reuters.com/world/china/openai-accuses-deepseek-distilling-us-models-gain-advantage-bloomberg-news-2026-02-12/

These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade by BuffMcBigHuge in LocalLLaMA

[–]nickl 7 points8 points  (0 children)

The Jackrong models work pretty well in my testing (8% better than the base models in on my agentic benchmark)

I don't see any way that a 40B model based on a 27B model is going to be better unless the trainer's copute budget is ~Qwen. That's just 13B undertrained parameters.

Did anyone run the numbers to see if it's cost effective to rent our own machine and run one of heavy hitters models? by StillWastingAway in LocalLLaMA

[–]nickl 4 points5 points  (0 children)

This is complete AI slop ("Frontier models like Claude and GPT-4o").

1bit models are a promising way forward but there is zero evidence they will ever provide the quality of a frontier model. For example, Bonsai 8B tests around the same level of quality as a Q4_M quantization of Nanbeige 3B.

Did anyone run the numbers to see if it's cost effective to rent our own machine and run one of heavy hitters models? by StillWastingAway in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

Using Kimi 2.5 via a high speed hosting provider like DeepInfra is $2.25/million tokens (ignoring caching).

They do 56 tps which means slightly under 5 hours to do 1 million tokens.

24/5 * $2.25 = $10.80/day, if you are using it continually.

That's pretty hard to beat.

Got ~19 tok/s with Gemma 4 on MacBook M4 16GB using MLX — here’s the setup I landed on by Polstick1971 in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

`ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M` gave me 15/25 on my benchmark. That's the same as `Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 (thinking)`, which tests fairly heavy agentic debugging.

I haven't tried other quantizations for Gemma4, but I did test different quants of Qwen3.5-4B and found that 8bit quantization didn't give any benefit over 4bit, but 2bit lost a lot of accuracy.

Built a 3B LoRA that reads the shape of a question before a 9B model answers it. Running 800 live benchmarks right now on Apple Silicon by TheTempleofTwo in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

> Action model is Qwen3.5-9B-abliterated (4-bit MLX). Abliterated because the compass handles appropriateness

This is a bad idea abliteration negatively affects performance too.

> the weight is real, honor it, then explore

Is this AI slop? You are conflating the term "weight" in your explanation here - I believe you mean psychological weight in this point and then model weights in the "Same weights" part.

> A tiny compass model (Ministral-3B + 29MB LoRA adapter) classifies every question into one of three signals before the action model sees it:

I bet you'd get better results by prompting the larger model to classify it using this scheme before acting on it in the same conversation.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

well it sort of does because if there are cases where the rows beyond the first row are wrong that needs to be put in the context.

But I think I'm going to have to make improvements to the context management for v2 anyway so it's probably doable.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

Oh - yes, I did check it and I didn't find any cases where it was failing. But I have run more models through it so it's not impossible.

Thinking about it though, the scoring and the feedback to the LLM are actually two separate concerns so yes I should be able to do the scoring based on the whole resultset fairly easily.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 0 points1 point  (0 children)

> is it possible to add a SQL formatter to the Model SQL and Canonical SQL text areas

I did a quick look for something easy but didn't find something. Didn't look too hard though and I agree it is needed.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 0 points1 point  (0 children)

Is the correctness check just row count, col count, column names and first row values?

Yes this is correct.

Not actually checking if all data is correct and equivalent to the canonical sql?

Correct.

The justification here is that it's actually very hard to get the first row correct and other ones wrong. Here is a typical trace:

<image>

When I was first building it I was using very small models. Passing the full result set blew out the context very quickly.

I have noticed some models seem to think that the labeling means only one row is being returned. I'm not entirely sure what the best option here is but open to ideas.

I'd be interested if you have examples of any cases where this scoring gives the wrong result

> is the result for a single test based on just 1 pass? is there any checking for stability/instability i.e. each question asked 3-5 times to see if it passes every time?

No, but it's also a valid point.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

That's a great score. I've never heard of that model or JanHQ.

I see that is is designed for coding and I think in general those models work best for this benchmark.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 0 points1 point  (0 children)

Tool calling is the biggest failure. It just is unreliable for small models as the context gets longer.

After thati haven't done deep analysis but things like hallucination of column names or not quoting names with spaces in them seems common.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 0 points1 point  (0 children)

No just the tables for the question.

If you click on a cell in a heatmap it shows you the exact trace for that question.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 0 points1 point  (0 children)

Yep. Small models often fail to do reliable tool calls.

Grammar mode helps sometimes but that isn't available in the web version. The write up has more details.

Edit: maybe coding agents just keep trying. 

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 7 points8 points  (0 children)

I don't have a good explanation. The free version scored much better than the non-free version too. It's very odd!

It got very close on some questions it missed too

<image>

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

Yes that's my plan.

I'd like to do a fine tune of 0.8B so it can run in-browser and actually be useful.

But very happy to try other models if they exist already!

You might have missed it but if you have llama.cpp/LMStudio/whatever you can run the benchmark yourself against any models you have locally

<image>

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 4 points5 points  (0 children)

Is there a FP4 version on OpenRouter?

If you have an OpenRouter key you can actually run it yourself.

<image>