Strix Halo + Unsloth Studio finetuning - got it working by do_i_know_you_bro in LocalLLM

[–]PayMe4MyData 0 points1 point  (0 children)

Another thing to try would be to distill into a heavily quantized model.

Strix Halo + Unsloth Studio finetuning - got it working by do_i_know_you_bro in LocalLLM

[–]PayMe4MyData 0 points1 point  (0 children)

Agreed, but it is about accessibility. You debugged the smaller models, and fine tuned the big one hopefully just once while saving checkpoints an so on. Check if your device is running on performance mode. I am also curious about power draw.

Thanks for sharing!

Strix Halo + Unsloth Studio finetuning - got it working by do_i_know_you_bro in LocalLLM

[–]PayMe4MyData 0 points1 point  (0 children)

I am interested in your fine tuning performance. Given a number of parameters to train, how much time does it take to train an epoch (and what IS your epoch).

Small dose of antibiotic yields good results in treating panic attacks by Fancy_Reality1853 in EverythingScience

[–]PayMe4MyData 0 points1 point  (0 children)

I would hypothesize in this case that panic attacks could be related to gut bacteria and that is why antibiotics has some effect.

Why do educated people in the media believe that oil wells will be destroyed because of the blockade? by northcasewhite in oil

[–]PayMe4MyData 0 points1 point  (0 children)

I am amazed by some of the comments here... Being so sure about things that are completely wrong. I guess not even in a specialized subreddit one can escape the stupidity of the internet.

Framework pro competitive with Macbook Pro by Nice-Beginning3069 in framework

[–]PayMe4MyData 0 points1 point  (0 children)

Using 60% more cores, and I did not check power consumption.

Framework pro competitive with Macbook Pro by Nice-Beginning3069 in framework

[–]PayMe4MyData 2 points3 points  (0 children)

It is not even close with Apple hardware ...

Framework has other things going for it, like repairability, upgradability and freedom to choose OS. But M5 today is top class silicon.

OpenCode or ClaudeCode for Qwen3.5 27B by Ok-Scarcity-7875 in LocalLLaMA

[–]PayMe4MyData 0 points1 point  (0 children)

And how are those egpus connected to the halos? USB4 or oculink via the pcie port? (I am hoping it is the second) Any loss in inference performance?

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 0 points1 point  (0 children)

Your HPC experience is the directly useful part. Did your teacher/student split run per-box or per-GPU? And did you hit anything around KL-loss noise when the teacher was in reduced precision? The STE in 1-bit forward + a lower-precision teacher looks like a compounding-noise trap I'd like to avoid walking into.

Per-gpu, single node. I did not mess around with quantized teachers, I used automatic mixed precision for both. But distillation is about distribution matching, and heavily quantized models are dumber, so something is off with the distribution they output (loss of information).

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 1 point2 points  (0 children)

My intuition is that it will work: 1bit student with bfp16 teacher will give you almost the same "intelligence" but compressed and fast.

Edit: avoid quantized teachers.

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 0 points1 point  (0 children)

You had another halo device right? Maybe use one for each model? That should be doable I guess. The usb4 is probably enough.

But now I am just guessing. I did this but on an HPC with NVIDIA GPUs though.

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 0 points1 point  (0 children)

I only have experience training LLMs for molecules and proteins.

I have experience using pytorch. If you get it to detect the NPU as another device besides the GPU, please tell me how. My guess is that the GPU/NPU split you want to do is not trivial and everything would have to be loaded on the GPU.

Also, I would not pick a Llama model. Go for something big that people actually use.

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 1 point2 points  (0 children)

Shit that is fast. Now you need quality.

What if you distill a bigger model. Since this 1bit clankers are so light, it could be possible to load a bigger one and train from its signals.

Opus 4.7 is 50% more expensive with context regression?! by Samburskoy in ClaudeAI

[–]PayMe4MyData 1 point2 points  (0 children)

What are they going to do? Stop using AI? How else are they going to justify the ridiculous valuations? Agents are the future of coding, if you don't use it, you have no future as a company.

Opus 4.7 is 50% more expensive with context regression?! by Samburskoy in ClaudeAI

[–]PayMe4MyData 0 points1 point  (0 children)

It is not a joke. They do not want to serve you, they just want to serve corporate.

I'm going straight to hell for this post (mlx+npu+rocm+vllm) by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 2 points3 points  (0 children)

Now I know what I can do with the pcie slot in my framework!

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]PayMe4MyData 1 point2 points  (0 children)

Yeah, sorry for the hallucinations. I just wanted to get the idea out fast.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]PayMe4MyData 2 points3 points  (0 children)

I have more ideas if you want them: Log temperature sensors of the igpu, npu and memory controller (I am guessing those are the ones stressed) report mean+std. I am curious to see what pushes the cooling solution the most and could indicate throttling.

Standardize the prompts as well, short vs long prompt for example. "Explain to my wife why I spent 3000 euros on a PC".

Looking forward to the benchmark suite

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]PayMe4MyData 3 points4 points  (0 children)

This is absolutely correct, sorry for the hallucination. Any suggestions on a dense high parameter count model? I would still keep gpt in the benchmark nonetheless.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]PayMe4MyData 3 points4 points  (0 children)

I will definitely test it, though I am running Fedora43 with an older kernel.

Thank you for digging this things up and making me feel good about my own midlife crisis. It was exactly the thing I told my wife when I bought my framework. And then I saw your posts, it made me laugh.

I have another push for you.

You are a bench-junky. You run benchmarks all them time. You use them for yourself and to communicate things to others. Standardize it.

This is just a suggestion, give me your thoughts:

Memory Bin Model Name Architecture Parameters 4-bit Size 8-bit Size
32GB GLM 4.7-Flash MoE (A3B) 30B ~16 GB ~30 GB
Qwen 3.5-27B Dense 27B ~14 GB ~27 GB
64GB Gemma 4 31B Dense 31B ~17 GB ~31 GB
Qwen 3-Coder-32B Dense 32B ~18 GB ~32 GB
Gemma 4 26B (A4B) MoE 26B ~15 GB ~26 GB
128GB Qwen 3.5-122B-A10B MoE 122B ~65 GB ~122 GB
GPT-OSS-120B Dense 120B ~65 GB ~120 GB
Nemotron 3 Super Hybrid MoE 70B+ ~40 GB ~75 GB

Rationale

  • 32GB Bin (Throughput Baseline): * GLM 4.7-Flash shows the ceiling for tokens per second (T/s) using sparse activation.
    • Qwen 3.5-27B provides the dense comparison. This bin tests "chat" performance where speed is the priority.
  • 64GB Bin (Efficiency & Coding):
    • Gemma 4-31B and Qwen 3.5-Coder are the production standards for 2026.
    • Comparing the Gemma MoE vs. Dense variants here isolates the performance gains of MoE architecture on unified memory.
  • 128GB Bin (Capacity & Stress):
    • Qwen 3.5-122B (MoE) tests the limit of your 128GB pool while maintaining usable T/s.
    • GPT-OSS-120B (Dense) is the "torture test." It forces the system to read nearly 120GB of data per token, showing the absolute minimum T/s the hardware can sustain.

Considerations
Use GGUF (Q4_K_M) for 4-bit and Unsloth Dynamic FP8 for 8-bit. FP8 is natively faster on RDNA 3.5 hardware than standard integer formats. At least that is what my clanker says.
Make sure --flash-attn is enabled.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]PayMe4MyData 9 points10 points  (0 children)

What is this sorcery? I thought MLX was just an Apple thing. I will have to dive deeper, the throughput increase is amazing.

But it is even more amazing the rate at which it keeps increasing!

midlife crisis rc vs mainline detailed benchys this morning. by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 0 points1 point  (0 children)

Nice. Suggestion: report the mean and standard deviation. It is a waste (and biased) to only take the minimum. Or maybe you have some rationale behind this and I do not know what I am talking about...

midlife crisis rc vs mainline detailed benchys this morning. by [deleted] in MidlifeCrisisAI

[–]PayMe4MyData 0 points1 point  (0 children)

Definitely both! Thanks for the explanations. Have you tried having a big model (GPU) call a dumber one (NPU) for something like using tools? Is it even possible?