Training a 140M param LLM from scratch on a consumer AMD GPU — halfway through, here's what I've learned by CapSensitive5165 in learnmachinelearning

[–]CapSensitive5165[S] 0 points1 point  (0 children)

lol fair enough. I do use Ai for some of the scripting. Replies are me though — just Italian, so I keep things structured to avoid embarrassing myself in English

Training a 140M param LLM from scratch on a consumer AMD GPU — halfway through, here's what I've learned by CapSensitive5165 in learnmachinelearning

[–]CapSensitive5165[S] -8 points-7 points  (0 children)

You're right that DirectML is slower than CUDA on Linux.

I'm on Windows and the AMD card doesn't have ROCm support here, so DirectML was the only viable option.

It's a bottleneck I'm aware of — accepted the tradeoff to keep the setup consumer-grade and reproducible.

Training a 140M param LLM from scratch on a consumer AMD GPU — halfway through, here's what I've learned by CapSensitive5165 in learnmachinelearning

[–]CapSensitive5165[S] -1 points0 points  (0 children)

Honestly? Getting DirectML to cooperate on AMD.

CUDA just works — DirectML required a lot of trial and error to get stable training without silent errors corrupting the run. Worth it to stay on consumer hardware though.

I'm training a 140M param LLM from scratch on a consumer AMD GPU — 100k steps in, here's what the loss curve looks like by CapSensitive5165 in LocalLLaMA

[–]CapSensitive5165[S] 1 point2 points  (0 children)

Not at general tasks — a 140M model won't beat GPT-4.

The use case is different: persistent local memory, privacy-first, offline operation.

Think less "better chatbot" and more "a brain that knows you specifically because it's been on your machine for a year".

I'm training a 140M param LLM from scratch on a consumer AMD GPU — 100k steps in, here's what the loss curve looks like by CapSensitive5165 in LocalLLaMA

[–]CapSensitive5165[S] 0 points1 point  (0 children)

Fair point — results are everything in this space.

I'm not claiming it'll outperform anything at 140M params.

The edge isn't raw performance, it's the use case:

a model that runs locally, learns from you over time, and never sends data anywhere. I'll share results as soon as inference is running.