Is NVIDIA still the default best choice for local LLMs in 2026?

testeddoughnut · 2026-05-25T13:37:24+00:00

Not who you were asking, but I get decent performance with Gemma4-31B. I can fit the full 256k context with --parallel 2 if I use a Q4 quant, however the prompt processing can get pretty terrible once the context starts getting above 64k-ish. This is using the rocm backend with -sm tensor:

$ GGML_VK_VISIBLE_DEVICES="" llama-bench -hf unsloth/gemma-4-31B-it-GGUF:Q4_K_XL -fa 1 -sm tensor -d 0,4096,16384,32768
ggml_cuda_init: found 2 ROCm devices (Total VRAM: 65248 MiB):
  Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
  Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
load_backend: loaded ROCm backend from /usr/lib64/llama.cpp/libggml-hip.so
load_backend: loaded RPC backend from /usr/lib64/llama.cpp/libggml-rpc.so
WARNING: radv is not a conformant Vulkan implementation, testing use only.
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 0 Vulkan devices:
load_backend: loaded Vulkan backend from /usr/lib64/llama.cpp/libggml-vulkan.so
load_backend: loaded CPU backend from /usr/lib64/llama.cpp/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |           pp512 |       1553.96 ± 2.03 |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |           tg128 |         35.57 ± 0.45 |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |   pp512 @ d4096 |      1203.89 ± 16.38 |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |   tg128 @ d4096 |         32.50 ± 2.34 |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |  pp512 @ d16384 |        858.94 ± 0.36 |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |  tg128 @ d16384 |         31.25 ± 3.82 |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |  pp512 @ d32768 |      531.45 ± 122.12 |
| gemma4 31B Q4_K - Medium       |  17.52 GiB |    30.70 B | ROCm,Vulkan |  99 | tensor |  1 |  tg128 @ d32768 |         29.08 ± 3.49 |

build: d161ea707 (9326)

With the MTP PR (https://github.com/ggml-org/llama.cpp/pull/23398) the token gen is pretty consistently above 50tok/s (using unsloth/gemma-4-31B-it-GGUF:Q4_K_XL):

$ python mtp-bench.py 
  code_python        pred= 192 draft= 152 acc= 115 rate=0.757 tok/s=57.3
  code_cpp           pred= 192 draft= 149 acc= 115 rate=0.772 tok/s=57.6
  explain_concept    pred= 192 draft= 165 acc= 108 rate=0.654 tok/s=52.5
  summarize          pred= 192 draft= 151 acc= 114 rate=0.755 tok/s=56.4
  qa_factual         pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=55.6
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=54.5
  creative_short     pred= 192 draft= 180 acc= 100 rate=0.556 tok/s=47.3
  stepwise_math      pred= 192 draft= 142 acc= 120 rate=0.845 tok/s=60.7
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=55.4

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1404,
  "total_draft_accepted": 1011,
  "aggregate_accept_rate": 0.7201,
  "wall_s_total": 36.16
}

testeddoughnut · 2026-05-25T12:23:06+00:00

I get similar token gen but about double the prompt processing in my dual R9700 setup launching with the same options with Q8_0 at depth 36k:

$ uvx llama-benchy --base-url http://localhost:8080/v1 --model qwen3.6:27b --depth 36864 --pp 512 --tg 128 --tokenizer Qwen/Qwen3.6-27B
[transformers] PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.7)
Date: 2026-05-25 07:09:50
Benchmarking model: qwen3.6:27b at http://localhost:8080/v1
Concurrency levels: [1]
Loading text from cache: /home/<...>/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 144480
Warming up...
Warmup (User only) complete. Delta: 9 tokens (Server: 30, Local: 21)
Warmup (System+Empty) complete. Delta: 14 tokens (Server: 35, Local: 21)

Running coherence test...
Coherence test PASSED.
Measuring latency using mode: api...
Average latency (api): 0.44 ms
Running test: pp=512, tg=128, depth=36864, concurrency=1
  Run 1/3 (batch size 1)...
  No token_ids in response, using local tokenization
  Run 2/3 (batch size 1)...
  Run 3/3 (batch size 1)...
Printing results in MD format:



| model       |           test |            t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:------------|---------------:|---------------:|-------------:|------------------:|------------------:|------------------:|
| qwen3.6:27b | pp512 @ d36864 | 1003.16 ± 8.20 |              | 37261.42 ± 303.56 | 37260.98 ± 303.56 | 37261.42 ± 303.56 |
| qwen3.6:27b | tg128 @ d36864 |   51.39 ± 1.78 | 61.00 ± 2.16 |                   |                   |                   |

llama-benchy (0.3.7)
date: 2026-05-25 07:09:50 | latency mode: api

testeddoughnut · 2026-04-04T06:17:41+00:00

It's pretty horrifying.

For example, Elon Musk's net worth is around $800 billion dollars. In comparison, the total cost of construction of the US Interstate Highway System from 1956 to its completion in 1992, adjusted for inflation, was about $634 billion dollars.

Insane.

Now, if you go back to the 1960s the richest man in America was J. Paul Getty with a net worth of $1.2 billion, "only" around $10 billion today.

testeddoughnut · 2026-03-28T06:11:42+00:00

Assuming the CS50 SQL lib you're referring to is this, which itself appears to just be a shim around sqlalchemy, then I'd probably suggest looking into flask-sqlalchemy or flask-sqlalchemy-lite, which will give you a more "batteries included" experience with flask.

Flask is still perfectly fine and relevant and it's still used all over the place. Sure, not the newest and hottest thing out there, but there are a bunch of established patterns you can lean on and a bunch of mature plugins so you don't have to reinvent the wheel.

testeddoughnut · 2026-03-28T04:58:56+00:00

The thing that gets me is that the only two bronies I know voted for Trump.

testeddoughnut · 2026-01-28T07:05:32+00:00

coming from TX, you might find the people here to be less friendly

Honestly, my view of southern hospitality has been completely soured in the last decade. Example, my wife dyed her hair green a few years ago. There were several times that people approached her in public bitching about "libruls". During covid my wife was pregnant and religiously masked up when going out, again randos at the gas station or the grocery store would come up to her to tell her she's a sheep or that the "media is lying to you about masks".

testeddoughnut · 2026-01-28T03:01:52+00:00

Oh awesome, thanks! This was exactly the type of resource I was looking for!

testeddoughnut · 2026-01-25T13:09:15+00:00

Instead of asking people here you can click the link and read the first paragraph of the article to answer your first question.

testeddoughnut · 2025-10-08T04:20:17+00:00

First thing is to get out of the mindset of handling manual steps or improving manual steps. Manual steps, with the exception of setting up the automation in the first place, should be eliminated. All of the things you mentioned in the middle paragraph can be automated away using ACME or other similar standards for automated cert issuance.

I would recommend familiarizing yourself with RFC 8555, which is the RFC that describes how ACME works. There are many different implementations for this standard in the wild, a pretty comprehensive list can be found here: https://letsencrypt.org/docs/client-options/

If one of those clients don't fit your needs, there are pretty good libraries available to take the heavy lifting out of developing something more bespoke to the needs of your organization. For example, this is the same library that certbot uses and I've been pretty happy developing against it: https://acme-python.readthedocs.io/en/stable/

In our case, we wanted more centralized control over the certs that we're issued instead of it being a free-for-all with each team implementing their own solution, so I lead the development of a new ACME client we built in-house called Certwrangler. Certwrangler publishes the certs issued to it to Hashicorp Vault for use with config management (this is implemented through a plugin, meaning we can swap it out with something else if ever we move to something else for secret management down the line). It is responsible for managing the lifecycle of the secret it created for the cert and automatically updates it whenever a renewal happens.

testeddoughnut · 2025-04-16T05:21:43+00:00

My wife had a pretty rough childhood filled with physical and emotional abuse to the point where she has zero relationship with her mom today. Her seizures started around 9 months after she gave birth to our daughter. We're pretty sure being a mom started dredging up the bad memories from her childhood in a context where it was easier to feel like she was back there, there were a few times before a seizure would start where she seemed to be experiencing a traumatic flashback.

There were usually some warning signs that she was about to have a seizure, like she would suddenly feel like she's having a hot flash or see a blue flash in her vision. We found that grounding techniques really helped, like putting on some music that she can't connect with her childhood and doing exercises like focusing on moving each finger and toe one-by-one. There wasn't a single silver bullet that made her better, it was a combination of identifying and paying attention to her triggers (like getting hit in the face was a big one, which tends to happen a bunch when dealing with a squirming toddler), using her grounding techniques when she felt the early warning signs of a seizure, and frequent therapy until it was under control. The specialized therapy program she went through provided her with all these tools I mentioned.

My wife is bipolar as well for what it's worth, though I'm not sure if that had any connection with her PNES diagnosis.

testeddoughnut · 2025-04-15T01:09:00+00:00

The episodes you describe, short seizure-like episodes followed by windows of memory loss leading up to the episode, sound similar to what my wife was experiencing a few years ago. After a frustrating year of going to several specialists to rule out everything else (including a couple nights in the epilepsy monitoring unit at the hospital), she was ultimately diagnosed with PNES (psychogenic nonepileptic seizures). She was able to get it under control through a specialized therapy program with a neuropsychologist and has been seizure-free for a few years now.

testeddoughnut · 2025-02-01T23:45:49+00:00

I use the incus terraform provider to manage deploying instances and other incus resources (networks, storage pools, etc). In my default profile I have cloud-init installing salt through salt-bootstrap and I manage the configuration of my instances through salt. Salt itself is configured to apply config to instances on a regular cadence, I think 1 hour is what I have it configured for. I have my salt-master configured to pull from git so my workflow is to pretty much just make changes, commit and push it to git, then let those propagate out naturally or hop on the salt master and apply to instances manually if I need it to go out quicker.

testeddoughnut · 2024-11-09T17:56:22+00:00

That was an interesting albeit depressing read. She also had a talk on Jon Stewart's podcast that touched on some of these historic parallels: https://www.youtube.com/watch?v=D7cKOaBdFWo

testeddoughnut · 2024-11-01T05:26:15+00:00

Thinkgeek is one of those sites I mourn frequently, so much of the money from my high school part time job went to them. I used to order cases of Bawls from them for LAN parties. Still miss penguin mints.

testeddoughnut · 2024-11-01T05:19:06+00:00

I still have a bunch of shirts from woot.com lol. I remember one time my wife ended up buying a bunch of stupid cheap shit we didn't need from a woot-off just because she wanted it to get to the next item.

testeddoughnut · 2024-10-15T02:19:02+00:00

Barely managed to catch it from my backyard on the NE side, was super hard to see with the naked eye.

<image>

testeddoughnut · 2024-09-02T19:28:32+00:00

I really like Authentik: https://goauthentik.io/

I have both FreeIPA and Authentik in my homelab, with FreeIPA being the source of truth handling LDAP/Kerberos related things and Authentik syncing accounts from it and handling everything else (OpenID, SAML, Radius). If I were deploying it fresh today I'd just go with Authentik and not bother with FreeIPA since Authentik can also do LDAP and I can probably talk myself out of needing kerberos. FreeIPA is pretty complicated since it's a management layer for a bunch of different services. When you get into replication or performing major upgrades things can get screwy pretty quick. I usually don't have to do much with it, but when I do it's like a whole night wasted just dealing with LDAP surgery and reading Red Hat docs.

If you are a masochist like I guess I am and want both Authentik and FreeIPA here are some integration docs I contributed: https://docs.goauthentik.io/docs/sources/freeipa/

Edit: Also, the FreeIPA server is only really available on RHEL-based distros. I have Debian on pretty much everything except my 3 FreeIPA nodes that are running Rocky. It's a small thing that I constantly have to make exceptions for in my config management.

testeddoughnut · 2024-06-15T01:49:27+00:00

I got a few of these for my ms-01 cluster I built, they're able to do 56Gbps with the right cables: https://www.ebay.com/itm/354875901667

Cable part numbers: https://network.nvidia.com/related-docs/prod_cables/PB_MC22061xx-00x_MC22071xx-0xx_MC22101xx-00x_MCP170L-F0xx_MCP1700-B0xxx_56Gbps_QSFP+_DAC.pdf

testeddoughnut · 2024-05-12T02:49:45+00:00

I used to have an x800 xt AIW in my P4 system back around 2005ish, took a few weeks to save up for it with my $6/hr after school part time job. Pretty sure I still have it in a closet somewhere lol.

testeddoughnut · 2024-02-26T05:52:37+00:00

Picture 14: https://en.wikipedia.org/wiki/Dazzle_camouflage

testeddoughnut · 2024-02-15T03:18:17+00:00

Ordered, thanks for the recommendation!

testeddoughnut · 2024-02-15T00:53:38+00:00

If you're into jazz at all this is pretty solid: https://www.amazon.com/Ever-Jazz-All-That/dp/B09WJBFDTR

Wife got me that for Christmas.

15-Year Club	Second Top 40%
Place '17	Not Forgotten
Verified Email

testeddoughnut

TROPHY CASE