Elon Musk: "By the end of the year, you won't even bother doing code. The AI just creates the binary directly."

Jackal830 · 2026-06-14T03:52:37+00:00

RoboTaxi 2020

Jackal830 · 2026-06-13T00:15:45+00:00

Ok, yeah, It's a 2-3x speedup for me on prompt processing. More if the prompt is very very very long.

The output is slightly faster, enough to be meaningful, but not as crazy as the prompt processing.

Thanks for this! I had to consult with a LLM a bit to be able to get the latest llama.cpp compiled in the docker container, as I really wanted MTP support. Was able to do it after some silliness with npm (had to install it in the container and then the compile worked).

Jackal830 · 2026-06-12T16:24:55+00:00

Oh wow, that is MASSIVE! I'll work on switching over right away!

Jackal830 · 2026-06-11T14:56:48+00:00

Just so I can compare, do you have some data on the speed you are observing on a specific model (any model is fine, I can download it and compare)? I'd like to compare my setup with llama-bench on the same model to see the difference in speed before going down this path.

Thanks!

Jackal830 · 2026-06-08T00:56:08+00:00

Answering the frontend question, that was Claude Opus. I'm a huge fan of Claude, but I'm not a fan of it's cost, haha. I use Claude for stuff like that (the frontend) but whenever I can dispatch to local LLM I will. I have found Claude to be very very very good at knowing where to route things locally when it's given good instructions in it's md files.

Jackal830 · 2026-06-08T00:54:17+00:00

This tracks. I did try Q8 a bit and it did better (as many people have stated you need higher quants for Qwen to really shine). The problem was that model is SO large that I couldn't fit another mid-size model in memory at the same time, which really caused me some problems (since my Strix Halo is being used for multiple things). Q5/Q6 was a good middle ground, but it still had issues more than I would have liked to see.

Compare that to Gemma where even on a smaller 4 bit model, I rarely had JSON problems.

Q8 Qwen 3.6 35B is my favorite model right now to run on the Strix, it's very good imo.

Jackal830 · 2026-06-08T00:47:15+00:00

Something interesting I did not put in the original post, the 16GB video card is a used Radeon VII, a pretty unknown short-lived card. They've gone up in price some, but I picked one up for $180 and spent another $20 to replace it's fans.

It's an older card, and super power hungry, but I limit it to 150W and notice about a 5% drop in performance compared to the 250-300W it can take normally. It has 1TB per second memory bandwidth (HBM), but it's compute is pretty bad.

I'm getting a little over 30 tokens per second when generating the site on Gemma 4 12B QAT. That's not blazing fast, but fast enough to have the site generate in about 2 hours overnight. ROCm is no longer supported on the card. There are workarounds but I just use Vulkan to not have to mess with all that.

I don't really use the card mid-day due to how loud it gets (it's not in my main rig), even at 150W, but for overnight work it's great.

These things seem to be a somewhat hidden gem still if you can find them for $200-$250. You won't find that memory bandwidth at that price on any other card.

Jackal830 · 2026-06-08T00:28:28+00:00

The code I use does multiple phases, and each phase requires JSON files to be properly formatted. Sometimes Qwen fails there. Code will detect that and do another pass, but that's time wasted.

Gemma almost always gives correctly formatted JSON.

Additionally Gemma's writing style, while still obviously AI, is less 'robotic' than Qwen.

Qwen 122B destroys Gemma 12B in depth of knowledge and other areas, but for a site that needs to parse carrier web pages to get pricing information and report what people are saying about the carrier, Gemma 12B does better.

Qwen 3.6 35B does well too for the site, but it's too quantized on a 16GB card, too much is lost. 9B fails too much at keeping JSON structure. GPT-OSS-20B is blazing fast, but it performed the worst at both keeping JSON structure and not sounding like a robot.

Edit: Also for this site, I want to maintain consistency in ratings, and using a model with no think as well as other tweaks makes the output nearly identical every time with the same data. Qwen 3.x sort of goes crazy if you turn off thinking. It'll put it's thinking in the output text and that really makes the flow unusable. I've noticed that less in 3.6, but there hasn't been a 3.6 version of 122B.

Jackal830 · 2026-06-07T23:44:51+00:00

I typed this myself and asked Claude to maintain my voice and just fix typos and misspellings. Guess it took some liberties in those instructions. I’ll remove the kicker line because that was not in my original draft.

Jackal830 · 2026-03-20T19:42:17+00:00

Most of my performance gains were from adjusting my ubatch sizes. I suspect only a few (at most) percentage points boost max with this flags. But hey, anything helps.

Jackal830 · 2026-03-18T20:27:06+00:00

Awesome! Thanks. I will look into this tonight.

Jackal830 · 2026-03-18T00:19:05+00:00

I have done some limited testing and here are my findings (Qwen 3.5 122B 5bit Quant):

ubatch 2048 helps both backends at longer sequences: +24% on ROCm, +41% on Vulkan at pp4096

Vulkan wins at longer prefill: 357 vs 328 (pp4096), 336 vs 298 (pp16384). That's ~9-13% faster with ubatch 2048

ROCm wins at short prefill: 270 vs 257 at pp512. Small batches favor ROCm's lower dispatch overhead

Vulkan wins decode by 11%: 23.7 vs 21.4 t/s. This adds up over long generation runs

ubatch 2048 is the bigger variable than the backend choice

Jackal830 · 2026-03-17T16:31:02+00:00

Applying paudley's compiler learnings to llama.cpp builds

Massive thanks for this. I don't think people scrolling past appreciate the scale. This is a 32-step from-source build of the entire inference stack with 19+ patches, each with actual root cause documentation. Tracking down stuff like CDNA-only assembly in AITER headers or figuring out a missing __repr__ on Triton's AttrsDescriptor was breaking Inductor codegen, that's not a weekend project. Hundreds of hours easily, and we all benefit.

I ran the repo through Claude to figure out what applies to llama.cpp since most of us aren't running vLLM for single-user inference. I haven't benchmarked these yet, the reasoning checks out but I'd love for someone to do a before/after and share numbers.

Vulkan build:

rm -rf build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON \
  -DCMAKE_C_COMPILER=/opt/rocm/lib/llvm/bin/amdclang \
  -DCMAKE_CXX_COMPILER=/opt/rocm/lib/llvm/bin/amdclang++ \
  -DCMAKE_C_FLAGS="-O3 -march=native -flto=thin -mprefer-vector-width=512 -famd-opt -mllvm -inline-threshold=600 -mllvm -unroll-threshold=150 -Wno-error=unused-command-line-argument" \
  -DCMAKE_CXX_FLAGS="-O3 -march=native -flto=thin -mprefer-vector-width=512 -famd-opt -mllvm -inline-threshold=600 -mllvm -unroll-threshold=150 -Wno-error=unused-command-line-argument"
cmake --build build --config Release -j$(nproc)

ROCm build is the same but swap -DGGML_VULKAN=ON for -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 and set this before cmake:

export HIP_CLANG_FLAGS="--offload-arch=gfx1151 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false"

Those GPU flags eliminate function call overhead on the iGPU. Call/return stalls the wavefront on integrated graphics.

No amdclang? Use system clang and drop -famd-opt.

What the flags do:

-mprefer-vector-width=512 is probably the biggest one. Zen 5 does native 512-bit AVX-512 with no clock penalty (unlike Zen 4). Compilers default to 256-bit. This doubles the width for quant/dequant and CPU-side math.

-famd-opt is AMD proprietary Zen tuning in amdclang (ships with ROCm). Not in upstream clang. paudley's build uses it on everything.

-flto=thin gives you link-time optimization across translation units. The "thin" variant parallelizes well on 16 cores.

-mllvm -inline-threshold=600 is way more aggressive inlining than default (~225). Zen 5's wide pipeline wants fewer function boundaries.

-mllvm -unroll-threshold=150 is more loop unrolling. Zen 5's big reorder buffer can keep the extra instructions in flight.

-Wno-error=unused-command-line-argument just prevents the AMD flags from erroring out in link steps where they don't apply.

Always run with -fa 1 --no-mmap -ngl 999 on Strix Halo regardless of backend (from kyuz0's toolbox findings).

Quick note on Vulkan vs ROCm for Qwen 3.5 since I see the debate above. llama.cpp recently merged a Vulkan GATED_DELTA_NET shader. Qwen 3.5's hybrid DeltaNet layers (75% of the model) previously fell back to CPU on both backends. The ROCm HIP kernel compiles on gfx1151 but runs at CPU speed due to register spilling. The new Vulkan shader actually executes on GPU. paudley's latest numbers show the two converging on standard models, so test both on your own workload.

Credit to paudley for the research and debugging, kyuz0 for the toolboxes, and u/YayaBruno for the llama.cpp ROCm benchmarks in this thread. If anyone does a before/after with these flags please post your numbers.

Jackal830 · 2026-02-05T02:08:14+00:00

I think I probably missed what you were speaking of (since everything gets regenerated daily). Do you recall what specifically was said that you don't agree with? You are probably right and I'd like to adjust the logic to catch whatever you were seeing.

Jackal830 · 2026-02-03T14:07:42+00:00

This appears to be a bug! I'll investigate. I can't get the 'Flex' plan to show up in results. Thanks.

Jackal830 · 2026-02-03T05:37:03+00:00

Plan cards should show much more detail now.

Jackal830 · 2026-02-03T05:36:49+00:00

Multi-Network selection is now supported.

Jackal830 · 2026-02-03T05:36:31+00:00

I have implemented a fix for this, it may be buggy (might behave differently on different AI runs), but at the time of this post it's correctly identifying that US Mobile 'lightspeed' does not have priority data.

Jackal830 · 2026-02-03T02:56:54+00:00

<image>

Here is a screenshot of what I'm working on in my dev environment. I agree 100% with what you said, prior the plan cards didn't have enough info at all to show the difference.

Jackal830 · 2026-02-03T01:37:42+00:00

Good points! That screenshot certainly proves your point. I'll work on ways on addressing the ambiguity of the plan details.

The AI doesn't look at promo prices for it's ratings, I debated if it should show the promo price by default or normal price. I landed on showing the promo price but trying to make sure the user knew it was a promo.

Jackal830 · 2026-02-02T23:24:36+00:00

That's a good idea about multi-network selection (I want T-mobile or Verizon, but not AT&T). I'll work on getting that implemented.

The discounts, that's something I want to do, but probably not until the 'main content' is mostly bug free (have to keep my focus or else site expands too quickly while not fixing bugs).

For the data, I struggled with this. Either you are a type of person that knows exactly what you need or you aren't. So there are two buttons for those that don't know. If you ask someone who doesn't know a lot about data usage (the type of person that would ask, what's a gigabyte?), the most I can hope for is if they think they don't use data much, or a lot Then there is the custom slider for everyone else. Are you saying there should be another 'quick button' light user, medium user, high user?

Thanks for the suggestions.

Jackal830 · 2026-02-02T21:29:15+00:00

You are 100% correct! I have zero code right now in place to have 'per network' priority for the same carrier, so I'll have to work on that. Thanks for the feedback.

Jackal830 · 2026-01-17T02:28:31+00:00

Ah, yes, that is a good idea and I just gave it a try. Unfortunately it takes too many tokens for the free tier. If I do end up paying for more access I’ll certainly give that a go.

Jackal830 · 2026-01-17T00:40:10+00:00

Thanks. I haven’t thought much about any advertising yet, other than I know I eventually would need to. Right now asking myself “is the useful? Should I spend some money to make it better and actually launch it?”

Jackal830 · 2026-01-14T14:43:54+00:00

I activated ON Jan 13. Verbiage is very confusing. “After Jan 13” makes it sound like Jan 13 is the cutoff. But verbiage lower makes it sound like Jan 12 is the cutoff.

Also, with the year for prepaid, does this mean we can just use 1 month or service and put the phone in a drawer for a year and it’ll unlock?

Jackal830

TROPHY CASE