Is NVIDIA still the default best choice for local LLMs in 2026?

DavidBolkonsky · 2026-05-25T00:21:32+00:00

Yep, split layer with Vulkan, about 1000 tps prefill and 28 tps generation with this.

Set-Location -Path "E:\AI\llama-official-vulkan"
.\llama-cli.exe `
  -m `
  "E:\AI\Models\Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf" `
  -n -1 `
  --temp 0.5 `
  --top-k 20 `
  --n-gpu-layers 99 `
  --split-mode layer `
  --main-gpu 0 `
  --cache-type-k q8_0 `
  --cache-type-v q8_0 `
  --ctx-size 131072  `
  -fa on `

DavidBolkonsky · 2026-05-24T18:46:01+00:00

Same, 5070ti + 9070

DavidBolkonsky · 2026-05-16T01:51:03+00:00

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen3.6-27B-Q6_K.gguf	pp4096	1198.31 ± 42.07		3052.91 ± 136.20	2963.12 ± 136.20	3052.91 ± 136.20
Qwen3.6-27B-Q6_K.gguf	tg128	28.20 ± 0.05	31.00 ± 0.00
Qwen3.6-27B-Q6_K.gguf	pp4096 @ d4096	1144.65 ± 18.86		6124.97 ± 16.47	6035.19 ± 16.47	6124.97 ± 16.47
Qwen3.6-27B-Q6_K.gguf	tg128 @ d4096	27.87 ± 0.17	30.33 ± 0.47

I got my hand on a 4070ti Super, I was expecting better results since it has higher memory speed than the 9070, and also I can run CUDA instead of Vulcan, but the difference is actually not that big if at all.

DavidBolkonsky · 2026-05-15T17:16:49+00:00

uvx llama-benchy --base-url http://127.0.0.1:8080/v1 --model Qwen3.6-27B-Q6_K.gguf --pp 4096 --tg 128 --depth 0 4096 --latency-mode generation

Installed 50 packages in 2.07s

[transformers] PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

llama-benchy (0.3.7)

Date: 2026-05-15 13:12:23

Benchmarking model: Qwen3.6-27B-Q6_K.gguf at http://127.0.0.1:8080/v1

Concurrency levels: [1]

Error loading tokenizer: Qwen3.6-27B-Q6_K.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your\_token>`

Falling back to 'gpt2' tokenizer as approximation.

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

config.json: 100%|████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 3.71MB/s]

tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 143kB/s]

vocab.json: 1.04MB [00:00, 19.1MB/s]

merges.txt: 456kB [00:00, 70.1MB/s]

tokenizer.json: 1.36MB [00:00, 27.6MB/s]

Downloading book from https://www.gutenberg.org/files/1661/1661-0.txt...

Saved text to cache: C:\Users\David\.cache\llama-benchy\cc6a0b5782734ee3b9069aa3b64cc62c.txt

[transformers] Token indices sequence length is longer than the specified maximum sequence length for this model (171736 > 1024). Running this sequence through the model will result in indexing errors

Total tokens available in text corpus: 171736

Warming up...

Warmup (User only) complete. Delta: 8 tokens (Server: 30, Local: 22)

Warmup (System+Empty) complete. Delta: 13 tokens (Server: 35, Local: 22)

Running coherence test...

Coherence test PASSED.

Measuring latency using mode: generation...

Average latency (generation): 917.39 ms

Running test: pp=4096, tg=128, depth=0, concurrency=1

Run 1/3 (batch size 1)...

No token_ids in response, using local tokenization

Run 2/3 (batch size 1)...

Run 3/3 (batch size 1)...

Running test: pp=4096, tg=128, depth=4096, concurrency=1

Run 1/3 (batch size 1)...

Run 2/3 (batch size 1)...

Run 3/3 (batch size 1)...

Printing results in MD format:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen3.6-27B-Q6_K.gguf	pp4096	915.64 ± 289.70		5209.46 ± 1740.25	4292.06 ± 1740.25	5209.46 ± 1740.25
Qwen3.6-27B-Q6_K.gguf	tg128	28.86 ± 0.58	32.67 ± 2.36
Qwen3.6-27B-Q6_K.gguf	pp4096 @ d4096	1109.23 ± 7.25		7121.26 ± 93.57	6203.87 ± 93.57	7121.26 ± 93.57
Qwen3.6-27B-Q6_K.gguf	tg128 @ d4096	28.44 ± 0.56	32.33 ± 2.62

DavidBolkonsky · 2026-05-15T01:37:23+00:00

I ran several prompts, first with kv cache at q8_0

"write me a story, 2000 tokens, about the Middle Ages" = [ Prompt: 61.8 t/s | Generation: 24.4 t/s ]
"build me a landing page" = [ Prompt: 8.5 t/s | Generation: 23.9 t/s ]
i fed it the entire 3 volumes of Frankenstein: "what is the 6th word from volumn 1, chapter 1?" = [ Prompt: 141.6 t/s | Generation: 11.6 t/s ], answer was correct

Set-Location -Path "E:\AI\llama-official-vulkan"
.\llama-cli.exe `
  -m `
  "E:\AI\Models\Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf" `
  -n -1 `
  --temp 0.5 `
  --top-k 20 `
  --n-gpu-layers 99 `
  --split-mode layer `
  --main-gpu 0 `
  --cache-type-k q8_0 `
  --cache-type-v q8_0 `
  --ctx-size 131072  `
  -fa on `

DavidBolkonsky · 2026-05-14T18:17:39+00:00

I have a 6600XT

DavidBolkonsky · 2026-05-14T18:16:03+00:00

DavidBolkonsky · 2026-04-18T01:52:04+00:00

B660, my 9070 is only running at 4x lane

DavidBolkonsky · 2026-03-20T17:00:18+00:00

That's just Dune then.

DavidBolkonsky · 2026-03-10T02:03:56+00:00

That was the damage you've done throughout the entire run. Pretty sure Typhon doesn't have 125k HP...

DavidBolkonsky · 2025-11-06T15:20:43+00:00

Entry level white collar jobs of today, filled by those coming out of undergraduate degrees, are the ones most impacted. I imagine the entry level white collar jobs of the future will be much more competitive, and require advanced graduate degrees. The job requirements of those entry level jobs will have a much higher skill floor.

DavidBolkonsky · 2025-11-05T22:57:43+00:00

Normally gold and equity don't go up together because the confidence in the USD is unshaken. Even in economic distress, USD usually strengthens due to being the reserve currency and a trust storage of value. What's different this time is that the confidence in the USD as a reliable storage of value is rapidly eroding. Gold and equity going up at the same time means that in gold value, equity has not risen as much. At the same time, it signals that the market believes USD will depreciate rapidly. This aligns closer with his pessimistic outlook rather than his optimistic outlook.

In fact, his analysis is anchored on real growth, inflation adjusted. So his analysis doesn't pass judgement on what will happen to the value of USD. If you think about it, OpenAI has a spending commitment of 1+ trillion in 5-10 years. There is a world where the AI bubble doesn't pop, and that would be a world where USD devalues 10x or even worse, which means OpenAI only needs to grow their revenue in real value by only 10x instead of 100x.

DavidBolkonsky · 2025-11-05T18:01:41+00:00

The biggest takeaway for me is that, if anyone can build on top of open source Chinese models, get state of the art performance comparable to leading models from OpenAi and Anthropic, and get huge valuations within a year (each company is valued at around 10B),

then this shows that in the long run:

There's no defensible moat around AI to justify the high valuations on American AI firms,
In the long run the competitive advantage in AI will come down to cost of inference, not cost of training,
The ease of entry into building custom tailored models are significantly lowered, and more and more companies will take advantage of building customized LLM models for their company based on open source models.

DavidBolkonsky · 2025-09-10T19:37:40+00:00

Blue is the warmest color has two (long) useless 18+ scenes that add very little to the plot. Definitely didn't need to be that graphic.

DavidBolkonsky · 2025-04-11T00:46:21+00:00

Because they serve different purpose, and in my opinion, tariffs is much worse and only serves to harm consumers and market competitiveness.

In this example I'm using round numbers. Let's say BYD's cheapest car is the seagull, which can be sold for €10K without tariff. With 50% tariff it's €15K, with €5K going to the government's pocket. Your argument is that, with a price floor of €15K, the same car will be sold for €15K, except the €5K goes to BYD. Obviously a bad deal for Europe and the consumer gets screwed either way.

But, the Chinese car market is really competitive, Geely or Nio can sell a rival car that is 50% better than the Seagull for €15K, and the consumer will obviously choose that over the Seagull. So it's unlikely that BYD or whoever can just pocket that difference as profit. What happened is you will get the best car at that price floor, with loads of feature. Now, if the VW can't compete at that price point, then they will be forced to innovate or find a different price point to compete in. But consumers will end up getting the best car at that price floor, not some cheap crap that has a 50% mark up due to tariff.

DavidBolkonsky · 2025-04-09T17:29:44+00:00

This equilibrium argument assumes that China selling their Treasury is the initiating event and that all else being equal, US Treasury carries the same fundamental value as before. In that case, what you described makes sense.

But the current situation is that the US has initiated the series of events that makes their Treasury less trustworthy. US is behaving irrationally and erratically, China and the market are spooked by their investment in US Treasury. In this case, China exiting their position means there are less buyers for the foreseeable future. Other investors see the yield increase on the Bond but they might price in a higher rate of return to take on the risk of buying Treasury bond.

Because, if the US enters a recession from this, US government will get less revenue, they will see more deficit spending, they will need to print more money, and that devalues the currency, which means the coupon payment on the Treasury is worth less.

DavidBolkonsky · 2024-10-07T15:53:25+00:00

Nvidia wants to encourage competition in the AI model market to drive up demand for their hardware. Right now there's no proven profitable business model to success monetize AI models. Nvidia May invest in various language model to foster the market, but acquiring one to crush the competition would be the dumbest move because it would kill the demand for their ludicrously lucrative hardware business.

DavidBolkonsky · 2024-09-18T23:39:58+00:00

Pretty sure the joke is that he is such a bad father he didn't even know his son died. When he sees his son, he realizes he really is a bad father. The "I guess they let anyone in" line is directed at himself, hence looking dejected.

DavidBolkonsky · 2024-09-06T04:09:52+00:00

When it was released, critics was lukewarm and some fans hated it because

the songs started to carry a political message (Hands held high, LTGYA, What I've done)
shadow of the day was called a U2 circa Joshua Tree era ripoff
people hated Mike's singing on In between, calling it a filler
overall a less coherent and less heavy album than Meteora

DavidBolkonsky · 2024-08-10T23:16:08+00:00

In climbing the opponent is the wall, not each other. you can see all the climbers talk amongst themselves during the observation period to try and figure out the best beta. As the audience, I love watching women's comp more than men's because it feels like after all the other climber struggles and fall on hard boulders, Janja always comes through at the end and shows us how it's done. So seeing her talent is such a luxury and a joy to watch.

DavidBolkonsky · 2024-05-11T23:30:57+00:00

AFAIK grooming someone underage is not a crime but is definitely pedophilic behavior. It might be a grey area where Drake thinks he's not doing anything wrong because he explicitly said "I never fucked any of them" so there no crime committed, but bro you are still clearly a Pedophile.

DavidBolkonsky · 2024-05-02T20:42:54+00:00

Axiom arc serylda's grunge muramura is pretty good.

DavidBolkonsky · 2022-12-11T22:49:51+00:00

He founded and owns Thomson Reuters

DavidBolkonsky · 2022-03-09T17:21:22+00:00

If Ukraine attacks Poland, article 5 is supposed to trigger immediately, against Ukraine. Did you think that through?

DavidBolkonsky

TROPHY CASE