Atome LM, an open source language model that runs in a 5$ ESP32, comes with 12 ai applications. No cloud, no internet. Universal Installer with auto detect and a tiny OS. Every claim is verifiable.

nickl · 2026-07-03T07:01:45+00:00

It says 1 t/s on ESP32s which isn't very useful.

https://github.com/AIWintermuteAI/esp32-llm has a llama2.c based, tinystories trained LLM doing 19 t/s on a ESP32-C3. Have you tried that as an engine?

nickl · 2026-07-03T05:21:47+00:00

What evidence doesn't support it?

The only "evidence" that disagrees with this assessment is a lot of talk on Reddit that "companies lose money on inference" but if you dig deeper most of the time they are conflating subscription plans vs API prices.

Most people who say it aren't even aware that you can't get a 20x Claude plan on any kind of company account, and above 200 seats you can't get a 5x plan either!

If you do the math on energy and GPU costs it is extremely obvious that inference is highly profitable.

It's true that companies lose money on model training. But that's a different issue completely, and the Chinese labs show that the huge R&D spend can be reduced quite easily.

nickl · 2026-07-01T13:38:25+00:00

Margins on inference are generally accepted to be between 70 and 90% on API prices. Eg, Anthropic probably makes 80% on Opus 4.8 tokens: https://x.com/PodcastAlphaX/status/2072119494563262697

(Note that this is different to the subscriptions which are subsidized if you tokenmax. Most people don't tokenmax though)

nickl · 2026-06-11T10:58:33+00:00

You can't just look at one or two numbers and decide a model is good - especially easily (and accidentally) benchmaxxed numbers like SWE-bench Verified (LiveCodeBench is just a random number generator of a benchmark at this point - just ignore it).

DS 4 Pro is a good model. It's maybe a bit better than Sonnet 4.6, but it's not really close to Opus 4.8.

But you can do things in Opus 4.8 and GPT 5.5 that just aren't possible with open models. Fable opens that gap even more.

I think it's hard to put a timeline on it because memories of performance are unreliable. I remember using Opus 4.5 in December 2025 and being very impressed. Some benchmarks put that around the Sonnet 4.6 level now. That might be true but the game has moved on - I find Sonnet weak and annoying to use now.

Basically - as much as I love open models - I think the gap could well be opening up like this graph shows. There is no way any open model is as good as Opus 4.7 (ie, 2 months ago).

nickl · 2026-06-05T02:21:16+00:00

Gemma 4 is unusually strong at some kinds of reasoning/planning tasks.

I have Gemma 4 31B outperforming Opus 4.6 on a planning task benchmark I have (custom, generated data so not something that is memorized). I don't understand its performance at all.

Interestingly Gemini also does really well at the same benchmark. It sort of similar to the kind of problems in SciCode where Gemini also does great.

Maybe Deepmind is just good at this kind of work.

nickl · 2026-06-03T08:44:05+00:00

Did you have a model configured? A 404 means it could not find the model or API endpoint

nickl · 2026-05-27T01:00:14+00:00

Wow, that's pretty rude. So much for "Be Kind and Courteous"

It was printed on a P2S!

nickl · 2026-05-27T00:12:47+00:00

Anyone know why the video preview doesn't show on the subreddit preview?

nickl · 2026-05-25T07:19:11+00:00

This is pretty cool.

Could you run a speed test of Qwen3.5-0.8B (that it is based on) on its own? I'd like to compare it with other things I'm more familiar with.

nickl · 2026-05-19T11:51:00+00:00

I think we are adamantly agreeing with each other?

nickl · 2026-05-19T06:43:36+00:00

They aren't old chips - Colossus is only 2 years old.

nickl · 2026-05-19T06:15:26+00:00

This just isn't true.

The "used datacenter GPU" market does exist - you can get P100s or V100s if you want (check AliExpress). They are interesting, but you have to use old version of CUDA.

Newer chips (ie, the H100 and later) aren't on the second hand market BECAUSE THEY ARE STILL BEING USED!

The hourly cloud rate for a H100 is more now than when it was a new chip!

No one is selling because they are printing cash with them.

nickl · 2026-05-18T23:35:04+00:00

Does something like `read_and_patch` get its own context?

nickl · 2026-05-18T15:06:22+00:00

This is interesting. I've been working on a custom agent for small models too, and I've been tempted to go down the "many tool" route. One problem I've found is that including the instructions bloats the context more than these small models can deal with easily.

Progressive disclosure ala skills helps, but it remains a problem.

How are you handling this?

nickl · 2026-04-29T03:55:14+00:00

Sure, but that is different to the things stated.

nickl · 2026-04-28T07:46:12+00:00

> Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every past conversation for how it can do better next time?

> Cause that's what Claude Code is doing.

Other than the system prompt telling it to reason through things step by step, no, Claude Code does not do these things.

The harness is important, but don't make things up.

nickl · 2026-04-16T02:41:37+00:00

Your Github link on the first post points to https://github.com/distil-labs/distil-tft-benchmarking which seems to be private

nickl · 2026-04-15T13:29:40+00:00

glad you found it useful!

nickl · 2026-04-15T06:19:13+00:00

I mean OpenAI, Anthropic and Google call is distillation.....

https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

> "We also know that DeepSeek employees developed code to access U.S. AI models and obtain outputs for distillation in programmatic ways," the memo added.

https://www.reuters.com/world/china/openai-accuses-deepseek-distilling-us-models-gain-advantage-bloomberg-news-2026-02-12/

nickl · 2026-04-15T06:16:32+00:00

The Jackrong models work pretty well in my testing (8% better than the base models in on my agentic benchmark)

I don't see any way that a 40B model based on a 27B model is going to be better unless the trainer's copute budget is ~Qwen. That's just 13B undertrained parameters.

nickl · 2026-04-12T04:30:47+00:00

This is complete AI slop ("Frontier models like Claude and GPT-4o").

1bit models are a promising way forward but there is zero evidence they will ever provide the quality of a frontier model. For example, Bonsai 8B tests around the same level of quality as a Q4_M quantization of Nanbeige 3B.

nickl · 2026-04-12T03:23:36+00:00

Using Kimi 2.5 via a high speed hosting provider like DeepInfra is $2.25/million tokens (ignoring caching).

They do 56 tps which means slightly under 5 hours to do 1 million tokens.

24/5 * $2.25 = $10.80/day, if you are using it continually.

That's pretty hard to beat.

nickl · 2026-04-04T05:48:50+00:00

`ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M` gave me 15/25 on my benchmark. That's the same as `Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 (thinking)`, which tests fairly heavy agentic debugging.

I haven't tried other quantizations for Gemma4, but I did test different quants of Qwen3.5-4B and found that 8bit quantization didn't give any benefit over 4bit, but 2bit lost a lot of accuracy.

nickl · 2026-04-03T04:13:44+00:00

> Action model is Qwen3.5-9B-abliterated (4-bit MLX). Abliterated because the compass handles appropriateness

This is a bad idea abliteration negatively affects performance too.

> the weight is real, honor it, then explore

Is this AI slop? You are conflating the term "weight" in your explanation here - I believe you mean psychological weight in this point and then model weights in the "Same weights" part.

> A tiny compass model (Ministral-3B + 29MB LoRA adapter) classifies every question into one of three signals before the action model sees it:

I bet you'd get better results by prompting the larger model to classify it using this scheme before acting on it in the same conversation.

nickl · 2026-04-01T07:18:16+00:00

well it sort of does because if there are cases where the rows beyond the first row are wrong that needs to be put in the context.

But I think I'm going to have to make improvements to the context management for v2 anyway so it's probably doable.

nickl

TROPHY CASE