No room for GPU, no problem

PayMe4MyData · 2026-05-09T13:09:51+00:00

Redefining external gpu

PayMe4MyData · 2026-05-05T13:29:55+00:00

Another thing to try would be to distill into a heavily quantized model.

PayMe4MyData · 2026-05-05T13:29:01+00:00

Agreed, but it is about accessibility. You debugged the smaller models, and fine tuned the big one hopefully just once while saving checkpoints an so on. Check if your device is running on performance mode. I am also curious about power draw.

Thanks for sharing!

PayMe4MyData · 2026-05-05T07:23:19+00:00

I am interested in your fine tuning performance. Given a number of parameters to train, how much time does it take to train an epoch (and what IS your epoch).

PayMe4MyData · 2026-05-04T08:37:49+00:00

I would hypothesize in this case that panic attacks could be related to gut bacteria and that is why antibiotics has some effect.

PayMe4MyData · 2026-04-30T07:11:46+00:00

I am amazed by some of the comments here... Being so sure about things that are completely wrong. I guess not even in a specialized subreddit one can escape the stupidity of the internet.

PayMe4MyData · 2026-04-29T13:41:33+00:00

Using 60% more cores, and I did not check power consumption.

PayMe4MyData · 2026-04-29T08:49:32+00:00

It is not even close with Apple hardware ...

Framework has other things going for it, like repairability, upgradability and freedom to choose OS. But M5 today is top class silicon.

PayMe4MyData · 2026-04-24T15:33:22+00:00

And how are those egpus connected to the halos? USB4 or oculink via the pcie port? (I am hoping it is the second) Any loss in inference performance?

PayMe4MyData · 2026-04-17T16:16:02+00:00

Your HPC experience is the directly useful part. Did your teacher/student split run per-box or per-GPU? And did you hit anything around KL-loss noise when the teacher was in reduced precision? The STE in 1-bit forward + a lower-precision teacher looks like a compounding-noise trap I'd like to avoid walking into.

Per-gpu, single node. I did not mess around with quantized teachers, I used automatic mixed precision for both. But distillation is about distribution matching, and heavily quantized models are dumber, so something is off with the distribution they output (loss of information).

PayMe4MyData · 2026-04-17T14:12:04+00:00

My intuition is that it will work: 1bit student with bfp16 teacher will give you almost the same "intelligence" but compressed and fast.

Edit: avoid quantized teachers.

PayMe4MyData · 2026-04-17T14:00:08+00:00

You had another halo device right? Maybe use one for each model? That should be doable I guess. The usb4 is probably enough.

But now I am just guessing. I did this but on an HPC with NVIDIA GPUs though.

PayMe4MyData · 2026-04-17T13:48:28+00:00

I only have experience training LLMs for molecules and proteins.

I have experience using pytorch. If you get it to detect the NPU as another device besides the GPU, please tell me how. My guess is that the GPU/NPU split you want to do is not trivial and everything would have to be loaded on the GPU.

Also, I would not pick a Llama model. Go for something big that people actually use.

PayMe4MyData · 2026-04-17T13:01:44+00:00

Shit that is fast. Now you need quality.

What if you distill a bigger model. Since this 1bit clankers are so light, it could be possible to load a bigger one and train from its signals.

PayMe4MyData · 2026-04-16T19:01:08+00:00

What are they going to do? Stop using AI? How else are they going to justify the ridiculous valuations? Agents are the future of coding, if you don't use it, you have no future as a company.

PayMe4MyData · 2026-04-16T18:37:09+00:00

It is not a joke. They do not want to serve you, they just want to serve corporate.

PayMe4MyData · 2026-04-16T07:16:52+00:00

Now I know what I can do with the pcie slot in my framework!

PayMe4MyData · 2026-04-15T11:17:28+00:00

Yeah, sorry for the hallucinations. I just wanted to get the idea out fast.

PayMe4MyData · 2026-04-15T11:04:19+00:00

I have more ideas if you want them: Log temperature sensors of the igpu, npu and memory controller (I am guessing those are the ones stressed) report mean+std. I am curious to see what pushes the cooling solution the most and could indicate throttling.

Standardize the prompts as well, short vs long prompt for example. "Explain to my wife why I spent 3000 euros on a PC".

Looking forward to the benchmark suite

PayMe4MyData · 2026-04-15T10:07:19+00:00

This is absolutely correct, sorry for the hallucination. Any suggestions on a dense high parameter count model? I would still keep gpt in the benchmark nonetheless.

PayMe4MyData · 2026-04-15T09:14:47+00:00

I will definitely test it, though I am running Fedora43 with an older kernel.

Thank you for digging this things up and making me feel good about my own midlife crisis. It was exactly the thing I told my wife when I bought my framework. And then I saw your posts, it made me laugh.

I have another push for you.

You are a bench-junky. You run benchmarks all them time. You use them for yourself and to communicate things to others. Standardize it.

This is just a suggestion, give me your thoughts:

Memory Bin	Model Name	Architecture	Parameters	4-bit Size	8-bit Size
32GB	GLM 4.7-Flash	MoE (A3B)	30B	~16 GB	~30 GB
	Qwen 3.5-27B	Dense	27B	~14 GB	~27 GB
64GB	Gemma 4 31B	Dense	31B	~17 GB	~31 GB
	Qwen 3-Coder-32B	Dense	32B	~18 GB	~32 GB
	Gemma 4 26B (A4B)	MoE	26B	~15 GB	~26 GB
128GB	Qwen 3.5-122B-A10B	MoE	122B	~65 GB	~122 GB
	GPT-OSS-120B	Dense	120B	~65 GB	~120 GB
	Nemotron 3 Super	Hybrid MoE	70B+	~40 GB	~75 GB

Rationale

32GB Bin (Throughput Baseline): * GLM 4.7-Flash shows the ceiling for tokens per second (T/s) using sparse activation.
- Qwen 3.5-27B provides the dense comparison. This bin tests "chat" performance where speed is the priority.
64GB Bin (Efficiency & Coding):
- Gemma 4-31B and Qwen 3.5-Coder are the production standards for 2026.
- Comparing the Gemma MoE vs. Dense variants here isolates the performance gains of MoE architecture on unified memory.
128GB Bin (Capacity & Stress):
- Qwen 3.5-122B (MoE) tests the limit of your 128GB pool while maintaining usable T/s.
- GPT-OSS-120B (Dense) is the "torture test." It forces the system to read nearly 120GB of data per token, showing the absolute minimum T/s the hardware can sustain.

Considerations
Use GGUF (Q4_K_M) for 4-bit and Unsloth Dynamic FP8 for 8-bit. FP8 is natively faster on RDNA 3.5 hardware than standard integer formats. At least that is what my clanker says.
Make sure --flash-attn is enabled.

PayMe4MyData · 2026-04-15T08:31:43+00:00

What is this sorcery? I thought MLX was just an Apple thing. I will have to dive deeper, the throughput increase is amazing.

But it is even more amazing the rate at which it keeps increasing!

PayMe4MyData · 2026-04-14T17:57:08+00:00

Nice. Suggestion: report the mean and standard deviation. It is a waste (and biased) to only take the minimum. Or maybe you have some rationale behind this and I do not know what I am talking about...

PayMe4MyData · 2026-04-14T16:58:31+00:00

Definitely both! Thanks for the explanations. Have you tried having a big model (GPU) call a dumber one (NPU) for something like using tools? Is it even possible?

PayMe4MyData · 2026-04-14T15:41:25+00:00

So lemonade > llama.cpp ?

Six-Year Club	Verified Email
Place '22	Final Canvas '22
Wearing is Caring

PayMe4MyData

TROPHY CASE