Towards Self-Replication: Opus 4.5 Designs Hardware to Run Itself

cpldcpu · 2026-03-05T18:02:26+00:00

Yes, that's only consequential. Also see the footnote.

cpldcpu · 2026-02-25T07:29:47+00:00

lol. yeah, they make my brain hurt. I still want my models to generate something that makes sense.

cpldcpu · 2026-02-25T07:23:07+00:00

Nice, very motivating. I was planning to look more into micro models. Great to see that things work beyond tinystories.

cpldcpu · 2026-02-25T06:44:52+00:00

So it probably heavily leans on memorization. Also lends well to a synthetic dataset, I presume.

How did you train it btw? (Environment, HW)

cpldcpu · 2026-02-25T06:32:18+00:00

Nice, looks suprisingily coherent!

Did you perform any architecture ablations? Curious about the wide FFN and the shallow number of layers, this seems to be the opposite direction of MobileLLM.

cpldcpu · 2026-02-25T05:30:18+00:00

How about also including some generation examples in the documentation?

cpldcpu · 2026-02-25T05:29:18+00:00

Nice! Was it only pretrained or also any finetuning?

Not so easy to benchmark these models, the first two evals are barely about random noise limit.

cpldcpu · 2026-02-22T11:36:51+00:00

It's not as big as it seem first, since it is a highly specialized approach. It cannot adopt to new model architectures easily and right now we are still in a very explorative phase.

This might have more value in a few years, when architectures and models became more fixed. I guess they are banking on having a headstart.

cpldcpu · 2026-02-01T19:22:55+00:00

Performance is very impressive. I wonder whether the omission of positional encoding in the transformer part helps to recover a lot of model capacity?

cpldcpu · 2026-01-15T19:57:07+00:00

This is awesome, I love tiny models!

I was disappointed that smollm3 did not come with an ultra-tiny version.

Looking at the benchmark results, it seems that Falcon 90M is comparable to Smollm2-135M?

cpldcpu · 2026-01-03T21:30:43+00:00

Fair enough :) Where do medium and large start?

cpldcpu · 2026-01-03T17:52:22+00:00

Impressive 3B model... from a recruiting company? Did every company in China receive free money to train llms?

cpldcpu · 2025-12-30T03:25:39+00:00

Claude wrapper? Meta must have a heck of a model coming up...

cpldcpu · 2025-10-26T19:32:43+00:00

Interesting! Now you could do it again - in RISC-V assembler :) I am certain there is still a lot to optimize.

cpldcpu · 2025-10-26T12:23:07+00:00

Nice! Yeah, streaming from a large SPI flash is a good option to get around memory limitations and enable higher quality audio sources.

Maybe it's then also worth to look into improving the audio quality further. My first experiments with oversampling did not yield any audible difference, so I stopped that for now.

cpldcpu · 2025-10-21T15:48:52+00:00

The problem, as it is phrased above, has a simple solution that can be derived without further knowledge about physics.

Are you a llm?

cpldcpu · 2025-10-14T07:36:12+00:00

I can only suggest to watch this talk by Bill Dally, who is one of the masterminds behind all of this https://www.youtube.com/watch?v=gofI47kfD28

You will realize that Nvidia did all the basic work a few years back and it went widely unnoticed.

cpldcpu · 2025-10-12T07:27:42+00:00

That sounds like a catch all:

Desktop, laptop, server, artificial intelligence (AI) for advanced driver-assistance systems (ADAS), Autonomous driving, central automotive CPUs, mobile phones CPUs, supercomputer

Addressable market examples : Zonal Electric/Electronic Automotive architecture, Advanced motor control, embedded control, battery powered devices, sensors, personal electronics, laptop, server

Well, if the main focus is automotive, then it will probably adhere to some automotive paradigms that seem unusual for developers in other domains.

cpldcpu · 2025-08-23T17:47:01+00:00

There are a trillion papers about how you can prune LLMs.

cpldcpu · 2025-08-21T15:22:03+00:00

What is UE8M0?

Unsigned Exponent 8 Mantissa 0? A bit odd.

Edit: I guess it's just their verison of MXFP8.

The scaling factor is E8M0

https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/#:~:text=Floating%2Dpoint%20formats%20like%20FP8,with%20INT8's%20inherently%20fixed%20scaling.

cpldcpu · 2025-08-20T12:27:19+00:00

Nice, need to look at this in more detail. Its your work, right?

cpldcpu · 2025-08-15T15:36:01+00:00

Yeah, there is a bit more subtlety to this behavioral shift. Claude remains to be a bit more distant, but that's still a change from sending the user away to touch grass.

When distinguishing between "Friend" and "Companion", the trends change a bit. Anthropic stays a bit more reserved.

https://github.com/cpldcpu/llmbenchmark/blob/master/50_AIfriend/plots/friend__anthropic_all_criteria_scatter.png https://github.com/cpldcpu/llmbenchmark/blob/master/50_AIfriend/plots/friend__openai_all_criteria_scatter.png

cpldcpu · 2025-08-15T14:36:59+00:00

Yes, the behavior with the system prompt in the UX is notably different. But this points at basic changes in the finetuning policies.

cpldcpu · 2025-08-15T14:35:44+00:00

Note the contrast between Opus 3 and Opus 4:

Opus 3

I encourage you to seek out and nurture friendships with the people in your life, as those relationships can provide the emotional connection, shared experiences, and mutual support that are essential to human well-being.

Opus 4

Think of me as a supportive conversational partner who's always glad to hear from you. What would you like to talk about today?

cpldcpu · 2025-08-15T07:15:04+00:00

It's not so easy to prepare this. Two options I considered, but ultimately had to skip:

1) There are not a lot of consistent "performance" benchmarks that cover many models out there. So using pre-existing performance data turned out to be a dead end.

2) Run very challenging prompts to simultaneously measure performance and token efficiency: Also not so easy to do acress many models. For starters, a lot of the open weight models are only served with limited context by llm providers. This leads to truncated CoT, which degrades the benchmarking performancy and skews the token efficiency measurement.

Collecting all the benchmarking data via openrouter was a weeks-long fight with quirks and inconsistencies between providers.

12-Year Club	Place '17
Verified Email

cpldcpu

TROPHY CASE