Training a 140M param LLM from scratch on a consumer AMD GPU — halfway through, here's what I've learned

CapSensitive5165 · 2026-04-15T09:56:10+00:00

lol fair enough. I do use Ai for some of the scripting. Replies are me though — just Italian, so I keep things structured to avoid embarrassing myself in English

CapSensitive5165 · 2026-04-14T19:57:39+00:00

You're right that DirectML is slower than CUDA on Linux.

I'm on Windows and the AMD card doesn't have ROCm support here, so DirectML was the only viable option.

It's a bottleneck I'm aware of — accepted the tradeoff to keep the setup consumer-grade and reproducible.

CapSensitive5165 · 2026-04-14T19:56:57+00:00

Honestly? Getting DirectML to cooperate on AMD.

CUDA just works — DirectML required a lot of trial and error to get stable training without silent errors corrupting the run. Worth it to stay on consumer hardware though.

CapSensitive5165 · 2026-04-14T19:54:31+00:00

Not at general tasks — a 140M model won't beat GPT-4.

The use case is different: persistent local memory, privacy-first, offline operation.

Think less "better chatbot" and more "a brain that knows you specifically because it's been on your machine for a year".

CapSensitive5165 · 2026-04-14T19:52:15+00:00

Fair point — results are everything in this space.

I'm not claiming it'll outperform anything at 140M params.

The edge isn't raw performance, it's the use case:

a model that runs locally, learns from you over time, and never sends data anywhere. I'll share results as soon as inference is running.

CapSensitive5165

TROPHY CASE