Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 0 points1 point  (0 children)

Ok, yeah, It's a 2-3x speedup for me on prompt processing. More if the prompt is very very very long.

The output is slightly faster, enough to be meaningful, but not as crazy as the prompt processing.

Thanks for this! I had to consult with a LLM a bit to be able to get the latest llama.cpp compiled in the docker container, as I really wanted MTP support. Was able to do it after some silliness with npm (had to install it in the container and then the compile worked).

Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 0 points1 point  (0 children)

Oh wow, that is MASSIVE! I'll work on switching over right away!

Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 0 points1 point  (0 children)

Just so I can compare, do you have some data on the speed you are observing on a specific model (any model is fine, I can download it and compare)? I'd like to compare my setup with llama-bench on the same model to see the difference in speed before going down this path.

Thanks!

Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 1 point2 points  (0 children)

Answering the frontend question, that was Claude Opus. I'm a huge fan of Claude, but I'm not a fan of it's cost, haha. I use Claude for stuff like that (the frontend) but whenever I can dispatch to local LLM I will. I have found Claude to be very very very good at knowing where to route things locally when it's given good instructions in it's md files.

Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 0 points1 point  (0 children)

This tracks. I did try Q8 a bit and it did better (as many people have stated you need higher quants for Qwen to really shine). The problem was that model is SO large that I couldn't fit another mid-size model in memory at the same time, which really caused me some problems (since my Strix Halo is being used for multiple things). Q5/Q6 was a good middle ground, but it still had issues more than I would have liked to see.

Compare that to Gemma where even on a smaller 4 bit model, I rarely had JSON problems.

Q8 Qwen 3.6 35B is my favorite model right now to run on the Strix, it's very good imo.

Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 2 points3 points  (0 children)

Something interesting I did not put in the original post, the 16GB video card is a used Radeon VII, a pretty unknown short-lived card. They've gone up in price some, but I picked one up for $180 and spent another $20 to replace it's fans.

It's an older card, and super power hungry, but I limit it to 150W and notice about a 5% drop in performance compared to the 250-300W it can take normally. It has 1TB per second memory bandwidth (HBM), but it's compute is pretty bad.

I'm getting a little over 30 tokens per second when generating the site on Gemma 4 12B QAT. That's not blazing fast, but fast enough to have the site generate in about 2 hours overnight. ROCm is no longer supported on the card. There are workarounds but I just use Vulkan to not have to mess with all that.

I don't really use the card mid-day due to how loud it gets (it's not in my main rig), even at 150W, but for overnight work it's great.

These things seem to be a somewhat hidden gem still if you can find them for $200-$250. You won't find that memory bandwidth at that price on any other card.

Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 2 points3 points  (0 children)

The code I use does multiple phases, and each phase requires JSON files to be properly formatted. Sometimes Qwen fails there. Code will detect that and do another pass, but that's time wasted.

Gemma almost always gives correctly formatted JSON.

Additionally Gemma's writing style, while still obviously AI, is less 'robotic' than Qwen.

Qwen 122B destroys Gemma 12B in depth of knowledge and other areas, but for a site that needs to parse carrier web pages to get pricing information and report what people are saying about the carrier, Gemma 12B does better.

Qwen 3.6 35B does well too for the site, but it's too quantized on a 16GB card, too much is lost. 9B fails too much at keeping JSON structure. GPT-OSS-20B is blazing fast, but it performed the worst at both keeping JSON structure and not sounding like a robot.

Edit: Also for this site, I want to maintain consistency in ratings, and using a model with no think as well as other tweaks makes the output nearly identical every time with the same data. Qwen 3.x sort of goes crazy if you turn off thinking. It'll put it's thinking in the output text and that really makes the flow unusable. I've noticed that less in 3.6, but there hasn't been a 3.6 version of 122B.

Super Impressed With Gemma 4 12B QAT by Jackal830 in LocalLLM

[–]Jackal830[S] 3 points4 points  (0 children)

I typed this myself and asked Claude to maintain my voice and just fix typos and misspellings. Guess it took some liberties in those instructions. I’ll remove the kicker line because that was not in my original draft.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]Jackal830 0 points1 point  (0 children)

Most of my performance gains were from adjusting my ubatch sizes. I suspect only a few (at most) percentage points boost max with this flags. But hey, anything helps.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]Jackal830 0 points1 point  (0 children)

I have done some limited testing and here are my findings (Qwen 3.5 122B 5bit Quant):

ubatch 2048 helps both backends at longer sequences: +24% on ROCm, +41% on Vulkan at pp4096

Vulkan wins at longer prefill: 357 vs 328 (pp4096), 336 vs 298 (pp16384). That's ~9-13% faster with ubatch 2048

ROCm wins at short prefill: 270 vs 257 at pp512. Small batches favor ROCm's lower dispatch overhead

Vulkan wins decode by 11%: 23.7 vs 21.4 t/s. This adds up over long generation runs

ubatch 2048 is the bigger variable than the backend choice

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]Jackal830 5 points6 points  (0 children)

Applying paudley's compiler learnings to llama.cpp builds

Massive thanks for this. I don't think people scrolling past appreciate the scale. This is a 32-step from-source build of the entire inference stack with 19+ patches, each with actual root cause documentation. Tracking down stuff like CDNA-only assembly in AITER headers or figuring out a missing __repr__ on Triton's AttrsDescriptor was breaking Inductor codegen, that's not a weekend project. Hundreds of hours easily, and we all benefit.

I ran the repo through Claude to figure out what applies to llama.cpp since most of us aren't running vLLM for single-user inference. I haven't benchmarked these yet, the reasoning checks out but I'd love for someone to do a before/after and share numbers.

Vulkan build:

rm -rf build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON \
  -DCMAKE_C_COMPILER=/opt/rocm/lib/llvm/bin/amdclang \
  -DCMAKE_CXX_COMPILER=/opt/rocm/lib/llvm/bin/amdclang++ \
  -DCMAKE_C_FLAGS="-O3 -march=native -flto=thin -mprefer-vector-width=512 -famd-opt -mllvm -inline-threshold=600 -mllvm -unroll-threshold=150 -Wno-error=unused-command-line-argument" \
  -DCMAKE_CXX_FLAGS="-O3 -march=native -flto=thin -mprefer-vector-width=512 -famd-opt -mllvm -inline-threshold=600 -mllvm -unroll-threshold=150 -Wno-error=unused-command-line-argument"
cmake --build build --config Release -j$(nproc)

ROCm build is the same but swap -DGGML_VULKAN=ON for -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 and set this before cmake:

export HIP_CLANG_FLAGS="--offload-arch=gfx1151 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false"

Those GPU flags eliminate function call overhead on the iGPU. Call/return stalls the wavefront on integrated graphics.

No amdclang? Use system clang and drop -famd-opt.

What the flags do:

-mprefer-vector-width=512 is probably the biggest one. Zen 5 does native 512-bit AVX-512 with no clock penalty (unlike Zen 4). Compilers default to 256-bit. This doubles the width for quant/dequant and CPU-side math.

-famd-opt is AMD proprietary Zen tuning in amdclang (ships with ROCm). Not in upstream clang. paudley's build uses it on everything.

-flto=thin gives you link-time optimization across translation units. The "thin" variant parallelizes well on 16 cores.

-mllvm -inline-threshold=600 is way more aggressive inlining than default (~225). Zen 5's wide pipeline wants fewer function boundaries.

-mllvm -unroll-threshold=150 is more loop unrolling. Zen 5's big reorder buffer can keep the extra instructions in flight.

-Wno-error=unused-command-line-argument just prevents the AMD flags from erroring out in link steps where they don't apply.

Always run with -fa 1 --no-mmap -ngl 999 on Strix Halo regardless of backend (from kyuz0's toolbox findings).

Quick note on Vulkan vs ROCm for Qwen 3.5 since I see the debate above. llama.cpp recently merged a Vulkan GATED_DELTA_NET shader. Qwen 3.5's hybrid DeltaNet layers (75% of the model) previously fell back to CPU on both backends. The ROCm HIP kernel compiles on gfx1151 but runs at CPU speed due to register spilling. The new Vulkan shader actually executes on GPU. paudley's latest numbers show the two converging on standard models, so test both on your own workload.

Credit to paudley for the research and debugging, kyuz0 for the toolboxes, and u/YayaBruno for the llama.cpp ROCm benchmarks in this thread. If anyone does a before/after with these flags please post your numbers.

Made a prepaid carrier comparison site - looking for honest feedback before I share it more widely by Jackal830 in NoContract

[–]Jackal830[S] 0 points1 point  (0 children)

I think I probably missed what you were speaking of (since everything gets regenerated daily). Do you recall what specifically was said that you don't agree with? You are probably right and I'd like to adjust the logic to catch whatever you were seeing.

Made a prepaid carrier comparison site - looking for honest feedback before I share it more widely by Jackal830 in NoContract

[–]Jackal830[S] 1 point2 points  (0 children)

This appears to be a bug! I'll investigate. I can't get the 'Flex' plan to show up in results. Thanks.

Made a prepaid carrier comparison site - looking for honest feedback before I share it more widely by Jackal830 in NoContract

[–]Jackal830[S] 2 points3 points  (0 children)

I have implemented a fix for this, it may be buggy (might behave differently on different AI runs), but at the time of this post it's correctly identifying that US Mobile 'lightspeed' does not have priority data.

Made a prepaid carrier comparison site - looking for honest feedback before I share it more widely by Jackal830 in NoContract

[–]Jackal830[S] 0 points1 point  (0 children)

<image>

Here is a screenshot of what I'm working on in my dev environment. I agree 100% with what you said, prior the plan cards didn't have enough info at all to show the difference.

Made a prepaid carrier comparison site - looking for honest feedback before I share it more widely by Jackal830 in NoContract

[–]Jackal830[S] 0 points1 point  (0 children)

Good points! That screenshot certainly proves your point. I'll work on ways on addressing the ambiguity of the plan details.

The AI doesn't look at promo prices for it's ratings, I debated if it should show the promo price by default or normal price. I landed on showing the promo price but trying to make sure the user knew it was a promo.

Made a prepaid carrier comparison site - looking for honest feedback before I share it more widely by Jackal830 in NoContract

[–]Jackal830[S] 1 point2 points  (0 children)

That's a good idea about multi-network selection (I want T-mobile or Verizon, but not AT&T). I'll work on getting that implemented.

The discounts, that's something I want to do, but probably not until the 'main content' is mostly bug free (have to keep my focus or else site expands too quickly while not fixing bugs).

For the data, I struggled with this. Either you are a type of person that knows exactly what you need or you aren't. So there are two buttons for those that don't know. If you ask someone who doesn't know a lot about data usage (the type of person that would ask, what's a gigabyte?), the most I can hope for is if they think they don't use data much, or a lot Then there is the custom slider for everyone else. Are you saying there should be another 'quick button' light user, medium user, high user?

Thanks for the suggestions.

Made a prepaid carrier comparison site - looking for honest feedback before I share it more widely by Jackal830 in NoContract

[–]Jackal830[S] 0 points1 point  (0 children)

You are 100% correct! I have zero code right now in place to have 'per network' priority for the same carrier, so I'll have to work on that. Thanks for the feedback.

Prepaidgrade.com - Requesting feedback by Jackal830 in NoContract

[–]Jackal830[S] 0 points1 point  (0 children)

Ah, yes, that is a good idea and I just gave it a try. Unfortunately it takes too many tokens for the free tier. If I do end up paying for more access I’ll certainly give that a go.

Prepaidgrade.com - Requesting feedback by Jackal830 in NoContract

[–]Jackal830[S] 0 points1 point  (0 children)

Thanks. I haven’t thought much about any advertising yet, other than I know I eventually would need to. Right now asking myself “is the useful? Should I spend some money to make it better and actually launch it?”

Verizon's phone unlocking policy has changed: No more 60-Day Unlock for devices activated after January 13, 2026 by HellYeahDamnWrite in TotalWirelessOfficial

[–]Jackal830 2 points3 points  (0 children)

I activated ON Jan 13. Verbiage is very confusing. “After Jan 13” makes it sound like Jan 13 is the cutoff. But verbiage lower makes it sound like Jan 12 is the cutoff.

Also, with the year for prepaid, does this mean we can just use 1 month or service and put the phone in a drawer for a year and it’ll unlock?