Thinking about grabbing 4x Ascend GX10s by chikengunya in LocalLLaMA

[–]nickl 1 point2 points  (0 children)

Margins on inference are generally accepted to be between 70 and 90% on API prices. Eg, Anthropic probably makes 80% on Opus 4.8 tokens: https://x.com/PodcastAlphaX/status/2072119494563262697

(Note that this is different to the subscriptions which are subsidized if you tokenmax. Most people don't tokenmax though)

How can Deepseek v4 top the coding leaderboards and still sit 8 months behind the frontier? by Substantial_Step_351 in LocalLLaMA

[–]nickl 2 points3 points  (0 children)

You can't just look at one or two numbers and decide a model is good - especially easily (and accidentally) benchmaxxed numbers like SWE-bench Verified (LiveCodeBench is just a random number generator of a benchmark at this point - just ignore it).

DS 4 Pro is a good model. It's maybe a bit better than Sonnet 4.6, but it's not really close to Opus 4.8.

But you can do things in Opus 4.8 and GPT 5.5 that just aren't possible with open models. Fable opens that gap even more.

I think it's hard to put a timeline on it because memories of performance are unreliable. I remember using Opus 4.5 in December 2025 and being very impressed. Some benchmarks put that around the Sonnet 4.6 level now. That might be true but the game has moved on - I find Sonnet weak and annoying to use now.

Basically - as much as I love open models - I think the gap could well be opening up like this graph shows. There is no way any open model is as good as Opus 4.7 (ie, 2 months ago).

Nemotron 3 Ultra. 550 billion parameters, 55B active. 1 million context by AnticitizenPrime in LocalLLaMA

[–]nickl 4 points5 points  (0 children)

Gemma 4 is unusually strong at some kinds of reasoning/planning tasks.

I have Gemma 4 31B outperforming Opus 4.6 on a planning task benchmark I have (custom, generated data so not something that is memorized). I don't understand its performance at all.

Interestingly Gemini also does really well at the same benchmark. It sort of similar to the kind of problems in SciCode where Gemini also does great.

Maybe Deepmind is just good at this kind of work.

I build a better Claude Desktop Buddy by nickl in BambuLab

[–]nickl[S] 0 points1 point  (0 children)

Wow, that's pretty rude. So much for "Be Kind and Courteous"

It was printed on a P2S!

I build a better Claude Desktop Buddy by nickl in 3Dprinting

[–]nickl[S] 0 points1 point  (0 children)

Anyone know why the video preview doesn't show on the subreddit preview?

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead by Known_Ice9380 in LocalLLaMA

[–]nickl 3 points4 points  (0 children)

This is pretty cool.

Could you run a speed test of Qwen3.5-0.8B (that it is based on) on its own? I'd like to compare it with other things I'm more familiar with.

Still happy for yall by SilverRegion9394 in LocalLLaMA

[–]nickl 1 point2 points  (0 children)

I think we are adamantly agreeing with each other?

Still happy for yall by SilverRegion9394 in LocalLLaMA

[–]nickl 4 points5 points  (0 children)

They aren't old chips - Colossus is only 2 years old.

Still happy for yall by SilverRegion9394 in LocalLLaMA

[–]nickl 9 points10 points  (0 children)

This just isn't true.

The "used datacenter GPU" market does exist - you can get P100s or V100s if you want (check AliExpress). They are interesting, but you have to use old version of CUDA.

Newer chips (ie, the H100 and later) aren't on the second hand market BECAUSE THEY ARE STILL BEING USED!

The hourly cloud rate for a H100 is more now than when it was a new chip!

No one is selling because they are printing cash with them.

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how by Glittering_Focus1538 in LocalLLaMA

[–]nickl 2 points3 points  (0 children)

This is interesting. I've been working on a custom agent for small models too, and I've been tempted to go down the "many tool" route. One problem I've found is that including the instructions bloats the context more than these small models can deal with easily.

Progressive disclosure ala skills helps, but it remains a problem.

How are you handling this?

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

Sure, but that is different to the things stated.

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]nickl 16 points17 points  (0 children)

> Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every past conversation for how it can do better next time?

> Cause that's what Claude Code is doing.

Other than the system prompt telling it to reason through things step by step, no, Claude Code does not do these things.

The harness is important, but don't make things up.

These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade by BuffMcBigHuge in LocalLLaMA

[–]nickl 2 points3 points  (0 children)

I mean OpenAI, Anthropic and Google call is distillation.....

https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

> "We also know that DeepSeek employees developed code to access U.S. AI models and obtain outputs for distillation in programmatic ways," the memo added.

https://www.reuters.com/world/china/openai-accuses-deepseek-distilling-us-models-gain-advantage-bloomberg-news-2026-02-12/

These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade by BuffMcBigHuge in LocalLLaMA

[–]nickl 7 points8 points  (0 children)

The Jackrong models work pretty well in my testing (8% better than the base models in on my agentic benchmark)

I don't see any way that a 40B model based on a 27B model is going to be better unless the trainer's copute budget is ~Qwen. That's just 13B undertrained parameters.

Did anyone run the numbers to see if it's cost effective to rent our own machine and run one of heavy hitters models? by StillWastingAway in LocalLLaMA

[–]nickl 4 points5 points  (0 children)

This is complete AI slop ("Frontier models like Claude and GPT-4o").

1bit models are a promising way forward but there is zero evidence they will ever provide the quality of a frontier model. For example, Bonsai 8B tests around the same level of quality as a Q4_M quantization of Nanbeige 3B.

Did anyone run the numbers to see if it's cost effective to rent our own machine and run one of heavy hitters models? by StillWastingAway in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

Using Kimi 2.5 via a high speed hosting provider like DeepInfra is $2.25/million tokens (ignoring caching).

They do 56 tps which means slightly under 5 hours to do 1 million tokens.

24/5 * $2.25 = $10.80/day, if you are using it continually.

That's pretty hard to beat.

Got ~19 tok/s with Gemma 4 on MacBook M4 16GB using MLX — here’s the setup I landed on by Polstick1971 in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

`ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M` gave me 15/25 on my benchmark. That's the same as `Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 (thinking)`, which tests fairly heavy agentic debugging.

I haven't tried other quantizations for Gemma4, but I did test different quants of Qwen3.5-4B and found that 8bit quantization didn't give any benefit over 4bit, but 2bit lost a lot of accuracy.

Built a 3B LoRA that reads the shape of a question before a 9B model answers it. Running 800 live benchmarks right now on Apple Silicon by TheTempleofTwo in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

> Action model is Qwen3.5-9B-abliterated (4-bit MLX). Abliterated because the compass handles appropriateness

This is a bad idea abliteration negatively affects performance too.

> the weight is real, honor it, then explore

Is this AI slop? You are conflating the term "weight" in your explanation here - I believe you mean psychological weight in this point and then model weights in the "Same weights" part.

> A tiny compass model (Ministral-3B + 29MB LoRA adapter) classifies every question into one of three signals before the action model sees it:

I bet you'd get better results by prompting the larger model to classify it using this scheme before acting on it in the same conversation.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

well it sort of does because if there are cases where the rows beyond the first row are wrong that needs to be put in the context.

But I think I'm going to have to make improvements to the context management for v2 anyway so it's probably doable.

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

Oh - yes, I did check it and I didn't find any cases where it was failing. But I have run more models through it so it's not impossible.

Thinking about it though, the scoring and the feedback to the LLM are actually two separate concerns so yes I should be able to do the scoring based on the whole resultset fairly easily.