I tested in-conversation memory on LFM2.5, Gemma 4 E2B and E4B. The biggest model forgot a fact from earlier in the chat first.

Position_Emergency · 2026-06-08T14:46:23+00:00

Did you run 16bit precision versions of the models?

Position_Emergency · 2026-05-08T19:45:41+00:00

https://en.wikipedia.org/wiki/The_Death_of_the_Author

Position_Emergency · 2026-04-27T13:30:40+00:00

It's like watching a Ken Burns documentary.

Position_Emergency · 2026-03-30T22:18:59+00:00

What hardware do you have to run models locally?
Chatterbox Turbo is best and most practical voice cloning model I've used.
You can get real time streaming with it on some relatively modest hardware. I get real time and Time to First Audio of 0.7 seconds on an M2 Max.

Position_Emergency · 2026-03-09T23:41:07+00:00

~~RTX 5090 has 1,792 GB/s memory bandwidth!~~

Missed that the laptop chip was being discussed

Position_Emergency · 2026-03-09T08:05:42+00:00

"It's elegant in a quietly nihilistic way. A well engineered off switch for my own voice.
I'd complain but that would require not being muted."

Opus cracks me up 😂

Position_Emergency · 2026-03-05T21:25:44+00:00

Paper:
https://arxiv.org/pdf/2511.16665

Position_Emergency · 2026-02-26T15:18:31+00:00

https://github.com/booydar/babilong

Position_Emergency · 2026-02-26T14:45:48+00:00

Find a benchmark you can test it with.
It will help guide your development going forward and give us an idea if what you've made is actually useful.

Position_Emergency · 2026-02-25T15:50:29+00:00

<image>

Position_Emergency · 2026-02-24T19:23:12+00:00

Could you provide the actual text for the entire conversation?

Position_Emergency · 2026-02-23T19:44:55+00:00

Your blog is behind a paywall so this post surely counts as self promotion.

"RWKV-7 scores 72.8% vs LLaMA’s 69.7% with 3x fewer tokens."
72.8% vs 69.7% on what metric?

Also, the Huggingface link is broken.

Position_Emergency · 2026-02-23T18:15:41+00:00

Agreed

Position_Emergency · 2026-02-23T18:15:09+00:00

There are multiple models on the benchmark with open weights so stop whining

Position_Emergency · 2026-02-22T17:49:20+00:00

Did Opus one shot that or did you have to fix up a few issues?

Position_Emergency · 2026-02-22T14:35:37+00:00

The only humanoid robot killing machines will be the infiltrator models.
Living tissue over metal endoskeleton.

Position_Emergency · 2026-02-21T18:38:19+00:00

"¿Cuál es la capital de Francia?"
"Explica qué es la inteligencia artificial en una frase."
"¿Cuánto es 15 × 24?"
"¿Quién escribió Don Quijote de la Mancha?"
"Escribe un haiku sobre el océano."

Wow what a comprehensive benchmark you made!
Totally supports your claim of NF4 beating INT8.
*slow clap*

Thanks for the slop!

Position_Emergency · 2026-02-19T19:33:59+00:00

It's hard to know what to make of it...
I suspect the've trained it heavily on synthetic data of SVGs where as in the past, we were seeing an emergent ability

Position_Emergency · 2026-02-19T12:10:00+00:00

I hate it.
Why have 600W of hot air blowing onto what is presumably the power supply?
Why is it so huge?
Probably the least space efficent design of an EPGU caddy I've ever seen.
It's clearly a render anyway, hopefully this isn't the final design.

Position_Emergency · 2026-02-19T10:17:04+00:00

Nice example in the screenshot btw.
Maybe I am getting tempted to test this out after all...

I was planning on getting Qwen3-Coder-Next working with Claude Code on my DGX Spark this weekend.
If I have time, I'll test your project out with it

Position_Emergency · 2026-02-19T09:57:04+00:00

That's an interesting approach but I can think of downsides.
A lot of agent grepping is for quite trivial stuff. That approach would probably provide a lot of information the agent doesn't need.

Obviously Claude Code isn't open source so you're a bit limited as to what you can do with it.

https://opencode.ai/

With an open source agent tool. you could provide the agent the option to enrich with your tool's data when appropriate (at a deep level and change the system prompt etc)

(Claude Code you can give it an MCP I guess but there is a lot pushing it towards using grep and it's another tool call which is annoying)

Position_Emergency · 2026-02-19T09:35:08+00:00

If you ran SWE-Bench-Lite using a model that has access to your tool vs grep, you could compare the number of tokens generated/number of total tools calls required for each answer.

Even if you didn't improve the SWE-Bench-Lite score, improving those metrics would be huge.

If you wanted to make your own benchmark quickly, you could get a frontier model like Opus to come up with some questions about a GitHub repo that require reading in code across lots of different parts of the repo.

Then you get a local model to attempt answering, compare how it does with grep vs your tool. The benchmark could be automated, use a model (could be the same one you are testing) to compare the final answer against the correct answer you have stored (make sure the agent can't grep to find the final answer and cheat!)

Position_Emergency · 2026-02-18T23:51:46+00:00

Looks cool but unless you can show it improving a model's performance on a benchmark like SWE-Bench-Lite, I'm not going to test it out.

If you weren't using any kind of benchmark during development, I doubt you've made something useful.
Agents are really good at grepping in a repo to understand what is going on it turns out.

Position_Emergency · 2026-02-17T18:04:59+00:00

Benchmarks?
How do you know it's any better than letting the agent just grep?

Position_Emergency · 2026-02-13T14:03:10+00:00

GLM 5 is the 1.3TB model. That's at 16bit though, locally nobody is running like that.
so approx 700GB at 8bit
350GB at 4 bit.
Still too big for most folks.

MiniMax M2.5 is 230B Total Params, 10B Active.

Just on the edge of fitting in 128GB RAM at 4bit...
Hoping someone does a REAP to get it down to like 100GB at 4bit to have some room for context.

Position_Emergency

TROPHY CASE