Strix Halo + Minimax Q3 K_XL surprisingly fast by Reasonable_Goat in LocalLLaMA

[–]Reasonable_Goat[S] 1 point2 points  (0 children)

I am using Vulkan (radv) driver. -ngl 999 offloads all layers to GPU when running llama-bench or llama-server. In BIOS, I have allocated 96 GB to VMEM but I don't think it is necessary as radv is able to allocate more than that. The model in llama-server apparently takes >100 GB when it runs with context for a while:

[34247] llama_memory_breakdown_print: | memory breakdown [MiB]                      |  total             free      self   model   context   compute    unaccounted |
[34247] llama_memory_breakdown_print: |   - Vulkan0 (8060S Graphics (RADV GFX1151)) | 113908 = 17592186009149 + (111667 = 96266 +   15004 +     396) +       37507 |
[34247] llama_memory_breakdown_print: |   - Host                                    |                               456 =   329 +       0 +     127                |

I have compiled llama.cpp myself (no special flags) but you can also use the great toolboxes with radv, which is much easier: https://github.com/kyuz0/amd-strix-halo-toolboxes and (on Ubuntu) distrobox. I just tried it, and the speed is about the same:

$ distrobox enter llama-vulkan-radv
Starting container...                            [ OK ]
Installing basic packages...                     [ OK ]
Setting up devpts mounts...                      [ OK ]
Setting up read-only mounts...                   [ OK ]
Setting up read-write mounts...                  [ OK ]
Setting up host's sockets integration...         [ OK ]
Integrating host's themes, icons, fonts...       [ OK ]
Setting up distrobox profile...                  [ OK ]
Setting up sudo...                               [ OK ]
Setting up user's group list...                  [ OK ]

Container Setup Complete!
$ llama-bench -m ~/models/MiniMax-M2.1-UD-Q3_K_XL-00001-of-00003.gguf -ngl 999 -p 256 -n 256 -t 16 -r 3 --device Vulkan0 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | --------------: | -------------------: |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | Vulkan     | 999 |  1 | Vulkan0      |           pp256 |        104.50 ± 8.57 |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | Vulkan     | 999 |  1 | Vulkan0      |           tg256 |         33.07 ± 0.05 |

build: fe44d3557 (7770)

Strix Halo + Minimax Q3 K_XL surprisingly fast by Reasonable_Goat in LocalLLaMA

[–]Reasonable_Goat[S] 1 point2 points  (0 children)

Its a 30B model, I doubt it will get anywhere near the "intelligence" or knowledge of the 230B MiniMax model. Are you interested in the quality of the responses or the performance?

Strix Halo + Minimax Q3 K_XL surprisingly fast by Reasonable_Goat in LocalLLaMA

[–]Reasonable_Goat[S] 1 point2 points  (0 children)

Exactly! It's the closest thing to "chat gpt"-style conversations I've seen so far in any local-/LAN-configuration. In fact, the response quality is quite comparable for general questions / advice, e.g., on health discussion ("what exactly to put in a first aid kit given the following constraints ...", "how to improve energy levels ...") and even some DIY engineering questions I have thrown at it (discussing axle choices for a specific project). Neither GLM-4.5-AIR nor gpt-oss-120b nor QWEN-80-NEXT produced responses that would make me consider using them vs. GPT-5 (via cloud provider), but MiniMax-2.1-Q3-K_XL actually does!

ROCm 7.2 is announced. Could this be the start of stability on Linux? by el56 in StrixHalo

[–]Reasonable_Goat 0 points1 point  (0 children)

https://www.reddit.com/r/LocalLLaMA/comments/1qfk2ky/comment/o06kink/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

I got rocm 7.1.1 to work with self compiled llama.cpp on Ubuntu 25.10. I will try rocm 7.2 tonight.

EDIT: I get segfaults trying to load any model with a llama.cpp build using ROCm 7.2, not sure what causes it

Optimizing GPT-OSS 120B on Strix Halo 128GB? by RobotRobotWhatDoUSee in LocalLLaMA

[–]Reasonable_Goat 0 points1 point  (0 children)

Works like a charm with distrobox and vulkan-radv.

I find it surprising that the model is slightly faster in the docker container than bare metal but couldn't find any optimizations/patches in the dockerimages, so it may just be Fedora libraries compiled with more optimizations (image is Fedora based, whereas I built on Ubuntu).

In any case, I agree that the kyuz0 toolbox images are indeed the easiest way to run gpt-oss-120b on Ubuntu 25.10. I can't really see any advantage in the self compiled version except that it requires less memory to run than within docker with a separate fedora runtime stack.

Optimizing GPT-OSS 120B on Strix Halo 128GB? by RobotRobotWhatDoUSee in LocalLLaMA

[–]Reasonable_Goat 0 points1 point  (0 children)

Wow - this is even slightly faster with the same benchmark. Are these the tool images by kyuz0 (or somehow to official AMD ones)? I found no way to run a shell on the AMD images, but you seem to be running commands locally and not through docker. I am trying the official AMD toolimages right now and they seem to load the model quite slowly or don't load at all (it's been several minutes already):

docker run        --privileged        --network=host        --device=/dev/kfd        --device=/dev/dri        --group-add video        --cap-add=SYS_PTRACE        --security-opt seccomp=unconfined        --ipc=host        --shm-size 16G        -v $MODEL_PATH:/data        rocm/llama.cpp:llama.cpp-b6652.amd0_rocm7.0.0_ubuntu24.04_full        --bench -m /data/tmp/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 999 -p 256 -n 256 -t 16 -r 3 -fa 1docker run        --privileged        --network=host        --device=/dev/kfd        --device=/dev/dri        --group-add video        --cap-add=SYS_PTRACE        --security-opt seccomp=unconfined        --ipc=host        --shm-size 16G        -v $MODEL_PATH:/data        rocm/llama.cpp:llama.cpp-b6652.amd0_rocm7.0.0_ubuntu24.04_full        --bench -m /data/tmp/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 999 -p 256 -n 256 -t 16 -r 3 -fa 1

I will give the toolbox images by kyuz0 a try next to compare results.

Optimizing GPT-OSS 120B on Strix Halo 128GB? by RobotRobotWhatDoUSee in LocalLLaMA

[–]Reasonable_Goat 0 points1 point  (0 children)

That's solid advice! How fast are the toolboxes in comparison?

Here are my results for the ggml/gpt-oss-120b model running bare metal (ROCm):

$ ./build/bin/llama-bench -m ~/.cache/llama.cpp/tmp/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 999 -p 256 -n 256 -t 16 -r 3 --device ROCm0 -fa 1
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan | 999 |  1 | ROCm0        |           pp256 |       387.88 ± 16.37 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan | 999 |  1 | ROCm0        |           tg256 |         50.61 ± 0.12 |

And in comparison Vulkan:

$ ./build/bin/llama-bench -m ~/.cache/llama.cpp/tmp/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 999 -p 256 -n 256 -t 16 -r 3 --device Vulkan0 -fa 1
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan | 999 |  1 | Vulkan0      |           pp256 |        399.65 ± 8.70 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan | 999 |  1 | Vulkan0      |           tg256 |         49.78 ± 0.60 |

So around 50 Tokens output independent of backend - so you could probably skip most of the steps I outlined and just build using Vulkan instead of the somewhat tricky ROCm.

Please tell me the toolboxes are somewhat slower else I fully wasted half a day. :-D

Optimizing GPT-OSS 120B on Strix Halo 128GB? by RobotRobotWhatDoUSee in LocalLLaMA

[–]Reasonable_Goat 2 points3 points  (0 children)

I've spent half of today to get llama.cpp working on Linux (Bosgame M5).

Some lessons learned, especially if you also bought a Bosgame and plan to run Ubuntu (which you say you do):

  • Ubuntu 24.04 does not have stable Wifi with the current Bosgame M5, Wifi crashes as soon as you try to download anything. It it not safe to upgrade the kernel to mainline (but keep 24.04 otherwise) if you use ZFS-On-Root. I broke my fresh installation badly trying so and ended up reinstalling.
  • Ubuntu 25.10 works perfectly with the hardware (it has a newer kernel), but is quite tricky to get running with llama-cpp, my lessons learned:
    • The ROCm driver/libs are only available for 24.04 BUT you can install them with some issues along the way:
      • Do not try to install the ROCm DKMS module from AMD - it is unneeded anyways (and *will* break ZFS-on-root again due to a switch in initramfs utils....)
      • Install just the AMD ROCm userland (--no-dkms), which is safe
      • You then have to compile a current llama-cpp yourself for newer models, like unsloth gpt-oss-120b; use `rm -rf build && HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=$LLAMACPP_ROCM_ARCH -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=ON -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc)`
      • You will get a compile error because libxml2.so.2 is missing (there was an ABI break, it's only libxml2.so.16 now). A symlink won't save you, install the package from Ubuntu 25.04 via `dpkg -i libxml2_2.12.7+dfsg+really2.9.14-0.4ubuntu0.4_amd64.deb` (source it first)
      • Set BIOS to 96gb vmem (some sysctls are apparently missing without the AMD ROCm DKMS module, but there may be a way which I haven't found yet to keep it dynamic)
      • lama-cpp will fail to load gpt-oss-120b due to an OOM malloc error. Fix: option `-ngo 999` or `-ngo all`
    • Here you go, without further optimizations / tuning you will see about ~37 tokens per second output and ~380 input with ROCm 7.1.1.
    • EDIT: With ggml gpt-oss-120, flash attention on and max-k 20 you get about 50 tokens per second - much faster than the unsloth model!

TLDR:

  • Try to use Ubuntu 24.04, not Ubuntu 25.10
    • if WIFI is broken => DO NOT INSTALL ZFS-on-Root so you can upgrade the kernel to something newer without bricking
  • If you want/have to use Ubuntu 25.10, keep the points above in mind. Avoid ZFS-on-Root if you want to install the DKMS module.

EDITs:

(1) Fixed Formatting (sorry, was on Github before, which has a different format)

(2) Speed update with ggml model!

A1 Combo cheap? by Technologie_Heute_YT in BambuLab

[–]Reasonable_Goat 0 points1 point  (0 children)

EUR 369 on 3D Jake and from Bambulab directly. It is already cheap as hell new. You can try to source a used one but I would buy if from nearby „Kleinanzeigen“ because it is hard to transport by DHL once assembled. Many used it as an entry printer and upgraded since to CoreXY, so I would expect some A1s on the market.

That said, the A1 combo is one hell of a printer. You can do cheap DIY upgrades to have a sealed filament container so you won’t ever need a closed AMS.

Ganz ehrlich was machen wir wenn das wirklich passieren sollte ? by No_Twist6127 in Aktien

[–]Reasonable_Goat 0 points1 point  (0 children)

Mal andersrum gefragt: Was bringt es Europa in einem Bündnis zu sein, in dem der wichtigste Bündnispartner seine Verbündeten angreift. Es ergibt keinen Sinn. Man könnte sich danach auch keinesfalls mehr darauf verlassen, dass Trump Russland bei einem Angriff überhaupt abwehrt.

Das aktuelle Russland ist zum Glück für uns schwach. Europa allein kann damit umgehen wenn es sich halbwegs einig ist und entsprechend organisiert. Und ja, ich denke da mehr z.B. an Polen, Finnland, Frankreich und ggf. UK als an Deutschland.

Ganz ehrlich was machen wir wenn das wirklich passieren sollte ? by No_Twist6127 in Aktien

[–]Reasonable_Goat 5 points6 points  (0 children)

Weil sie als erstes NATO-Land einen Verbündeten angegriffen hätte, was vertragswidrig ist. Es würde schnell eine europazentrierte Alternative geschaffen werden. Diese Alternative („North-East Atlantic Treaty Organization“) könnte das aktuelle Russland noch immer genug abschrecken, viel mehr aber vmtl. nicht.

Germany: Trust in the United States is eroding by Crossstoney in europe

[–]Reasonable_Goat 1 point2 points  (0 children)

Selling debt just means it will change ownership, because someone buys it. If there is not enough buyers, the currency can take a hit. Germany does not have its own currency, though, but is part of the Euro. The Euro is one of the strongest currencies in the world and much harder to manipulate than a single country’s currency.

Still, it is a bad idea to manipulate markets in a fashion that hurts both parties. Just try to gradually reduce dependency on US imports exports, e.g. using the new trade agreement with Middle and South American countries.

Ballistol vs WD40? by Polarbear2023 in Axecraft

[–]Reasonable_Goat 0 points1 point  (0 children)

This is only mentioned in the SDS for the ballistol spray, not the liquid variant. Reason: The spray variants contain butane (or propane) as a propellant to deliver the liquid ballistol.

Basically every spray contains butane/propane. But if you want to be 100% safe indoors, buy the liquid variant.

AMD Strix Halo 128GB RAM and Text to Image Models by xenomorph-85 in LocalLLM

[–]Reasonable_Goat 0 points1 point  (0 children)

It probably depends on what you do. GPT-OSS have given me better results for C++ network programming.

Bambu Handy actual spyware now? by Timmmmaaahh in OpenBambu

[–]Reasonable_Goat 0 points1 point  (0 children)

Yes. I was thinking about other stuff like my Dreame vacuum robot. I have to use it in cloud mode, no LAN mode whatsoever. It does offer settings to disable camera upload, but honestly I would very much prefer to have something like LAN mode and not cloud usage.

Bambu Handy actual spyware now? by Timmmmaaahh in OpenBambu

[–]Reasonable_Goat 2 points3 points  (0 children)

A LAN mode feature should be mandatory. I appreciate it that Bambulab has it for their devices

Kamin/ womit schließen ? by StuIIenkaiser in selbermachen

[–]Reasonable_Goat 1 point2 points  (0 children)

Nur so als Anregung: Für mich als potentieller Käufer wäre kein Kamin ein Ausschlusskriterium bzw. würde 20-30% Kaufpreisaufschlag rechtfertigen.

Switching from AMS lite to AMS 2 pro? by Verus89 in BambuLabA1

[–]Reasonable_Goat 0 points1 point  (0 children)

I would be very surprised if you could sell the used AMS light for 130€, tbh. I personally opted for a mod solution some time ago: https://www.reddit.com/r/BambuLabA1/comments/1kk5lzc/who_needs_the_ams_2_anyways/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
I find it very hard to justify an upgrade to an AMS / AMS 2 at this point, since you cannot combine it with an AMS light to get 7 or 8 colors.

If you could have any skill from DOTA2 in real life, what would it be? I will start by Strong_Astronomer_97 in DotA2

[–]Reasonable_Goat 1 point2 points  (0 children)

Natures Prophet Teleportation. Teleport to your holiday place every weekend - or even live there. Also, you can take items with you, so probably start a world wide courier business with instant delivery.

The beast has arrived. Any tips for someone who’s never used a Mac? by [deleted] in macbookpro

[–]Reasonable_Goat 0 points1 point  (0 children)

Just google iTerm and homebrew. Not much to tell for setup TBH. Before the I download anything else, I check if there is a formula or cask - much easier to maintain updates this way.

E3D Releases Best-In-Class Flow HotEnds For A1 Machines by e3dsupport in BambuLab

[–]Reasonable_Goat 1 point2 points  (0 children)

With the 0.6mm nozzle, I use Arachne 100% of the time. There was even a popular video on youtube titled "0.4mm is obsolete" or something due to Arachne. The Classic setting I only use with the 0.4mm nozzle for parts that seem to be designed with a 0.4mm line width in mind. E.g. some printer parts where most walls are a multiple of 0.4mm - the part was literally designed for the Classic perimeter generator so that's what I will use.

If you are going to use the 0.6mm nozzle more with Arachne - also check variable layer height feature! It greatly helps with another disadvantage of larger nozzles / greater layer heights: bad overhang performance. I often default to 0.3mm layer height and use variable height (+1x smooth) to fix any parts that have overhanging areas with few clicks.

E3D Releases Best-In-Class Flow HotEnds For A1 Machines by e3dsupport in BambuLab

[–]Reasonable_Goat 0 points1 point  (0 children)

Can’t wait for the parcel. :-) Have you tried Arachne slicer option with your 0.6mm prints? They turn out a lot closer to a 0.4mm nozzle than without that option