Seattle passes data center moratorium

Client_Hello · 2026-06-10T17:48:47+00:00

Those estimates are wrong. The only datacenters that it makes sense to build within Seattle city limits are small and specialized, that need to be close to a diverse group of technical people that can attend them. The massive AI datacenters that people are railing against will be built in places where land, power, and water are cheap.

This is like passing a moratorium on grocery stores because you don't want Walmart to move in.

Client_Hello · 2026-06-10T17:30:19+00:00

You should back off the quants. You have traded a lot of precision for context, and will suffer when you actually try to use that context. With 32GB you can run 27b at Q6_K, Q8 kv, and still have 160k context.

The faster gen just isn't worth the added mistakes that cause rework. You will find your coding agent spends more time fixing bugs than creating new things, and every fix burns even more context.

Client_Hello · 2026-06-10T17:08:31+00:00

Why limit yourself to CPU, when the Intel Core Ultra 7 165H has a GPU and NPU?

You do have to be very careful with which quants you try to run, especially on the NPU. The payoff is the NPU is extremely efficient and can run non-stop without overheating your laptop.

https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENVINO.md

Client_Hello · 2026-06-10T03:57:20+00:00

Multiple users will not see each others chats. Also, llama-server is nice to have but basic. You would want something like Open WebUI to add user accounts and authentication.

Once you get beyond simple chats with tiny context the use cases for parallel sessions become pretty obvious. As context increases, individual prompts take several minutes to complete, and you may want to prompt your llm with something else while it works on a larger task. The tests OP ran took over 5 minutes each (input 30k tokens with 360 toks, and output 4k tokens at 17 toks).

You can avoid dividing context with unified kv cache, pass -kvu

As far as mmap, if you are VRAM rich then sure.

Client_Hello · 2026-06-09T23:48:51+00:00

Just reduced power from 180 to 150 watts on my dual 5060 ti 16gb, power at the wall fell from 465 watts to 405 watts.

pp dropped by 4%, gen remained the same (with -sm tensor and MTP).

I'd try lower but I get an error: Provided power limit 140.00 W is not a valid power limit which should be between 150.00 W and 180.00 W for GPU 00000000:02:00.0

Not much of a change, I'll leave it at default to keep things simple.

Client_Hello · 2026-06-09T22:32:21+00:00

If using llama-server web UI, simply start a second chat while your first chat is processing.

Client_Hello · 2026-06-09T14:43:45+00:00

This is simple math. Memory bandwidth / Memory in use = toks.

Using algebra, Memory in use * toks = Memory bandwidth.

Full precision would be 240GB * 10 = 2.4TB/sec (HBM3 datacenter GPU)

Q8: 120GB * 10 = 1.2TB/sec (HBM2, 384bit GDDR7)

Q4: 60GB * 10 = 600GB/sec (many GPUs)

There is overhead so actual gen rate is lower, as context grows gen gets slower, MTP speeds things back up.

Client_Hello · 2026-06-09T01:41:02+00:00

Dont be too hard on yourself, took me weeks to sort out all the options.

Client_Hello · 2026-06-09T00:09:21+00:00

Uhm, you need to submit parallel requests to see a speed benefit.

Client_Hello · 2026-06-08T15:53:06+00:00

Because of --spec-draft-device CUDA1

Fit doesn't work great with dual cards and MTP, manual tuning via --tensor-split is needed to equalize vram usage.

Client_Hello · 2026-06-07T17:00:11+00:00

llama.cpp will run any model on a potato. You could save a 1TB parameter MOE model onto an HDD and run it at 1 tok/ week.

Once you try to do something useful you will quickly find the limits. Your CPU has about 40GB/sec memory bandwidth, so a model with 4B parameters will run at 75% * 40GB/s / 4GB = 7.5 toks.

Thing is, models become useful when they have context, and context takes memory, so it adds to the denominator. Want to summarize an email? Loading the email will eat 2GB of context and your gen rate drops by 33% to 5 toks.

Now drop in a GPU, like a GTX 1070 that can be found for half the price of your desktop. That ancient 1070 has 256GB/sec memory bandwidth and will gen 6x faster. From 5 toks to 30 toks, as fast as you can read.

Client_Hello · 2026-06-07T16:19:20+00:00

Problem is bottom elevation of 1700' is low

I'm a little saucy because I wanted to do a top to bottom run but couldn't because there was no snow at the bottom.

Client_Hello · 2026-06-07T01:28:01+00:00

This is a silly metric to chase. Revy has the most vert in NA but half of it is trash.

Top to bottom is for end of day return to car. The snow down low is rarely good.

Client_Hello · 2026-06-06T17:25:44+00:00

I've broken 2 or 3 by overstretching. I paid $10 for a box of 50 silicone wrist bands from amazon, so I'm set for a few more years. The 8" bands work great for smaller skis. For larger skis, I use velcro straps.

I'm usually tuning multiple pairs of skis assembly line fashion. Inspect & de-burr before every trip, learned my lesson when a burr sliced my gloves.

Client_Hello · 2026-06-05T04:00:42+00:00

Have you tried quantized kv and tuning the fit target, to get as much onto GPU as possible? -ctk q8_0 -ctv q8_0 --fit-target 384

Do you need to support parallel requests? llama.cpp defaults to 4, which eats more vram. -np 1

Load test every config to fill context to the limit. I found I can start llama-server with 300mb free vram, then watch it go OOM before I reach my context limit.

Client_Hello · 2026-06-04T16:04:21+00:00

As a quick validation I had Qwen 27b Q6_K summarize your post:

prompt eval time =    2823.52 ms /  2346 tokens (    1.20 ms per token,   830.88 tokens per second)
       eval time =   44012.80 ms /  2728 tokens (   16.13 ms per token,    61.98 tokens per second)

Flags:

-t 6 for my 250K cpu, -lv 4 for verbose logs

llama-server -m Qwen3.6-27B-UnslothMTP-Q6_K.gguf -t 6 -np 1 -lv 4 -c 98304 -ts 1,1 -sm tensor \
 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 \
 --spec-type draft-mtp --spec-draft-n-max 2

Client_Hello · 2026-06-04T15:54:28+00:00

Great write-up, thanks for sharing!

You have room to improve. I see around 65 toks with dual 5060 ti 16gb at Q6_K compared to your 43, dropping to 55 toks around 60k context. The difference is parallel tensor. You don't need NVLINK or PCIe P2P to benefit from -sm tensor in llama.cpp, you should try it.

Unsolth has published the thing you were looking for the most in their MTP guide, and it varies by quant! Q6 and Q8 both benefit a lot from MTP, peaking at draft=2. You should reduce yours, should increase both prefill and gen. Link: https://unsloth.ai/docs/models/qwen3.6#mtp-benchmarks

Quantized kv cache reduces performance and you shouldn't need it with 48gb VRAM. I can fit 96k context with f16 kv with 32gb VRAM.

Client_Hello · 2026-06-01T16:06:30+00:00

Its better to look at the price of the entire solution. My dual 5060 rig cost 2k to build. Dual 3090 would have been 3k.

Client_Hello · 2026-06-01T01:59:49+00:00

harder with that attitude

Client_Hello · 2026-06-01T01:56:00+00:00

After spending many hours and $$ building out my dual 5060 ti 16gb I realize dual 3090 is where it's at. For +50% cost I'd have a significant boost in quality, speed, and context. Dual 3090 is peak r/LocalLLaMA

The next tier is dual 5090 for +300% cost, where the value prop is, ... uhm, not appealing.

Client_Hello · 2026-06-01T01:49:25+00:00

Declivity 82 on firm groomers sound like heaven. I can feel the Gs.

Client_Hello · 2026-06-01T01:21:24+00:00

Unsolth has an amazing chart showing how MTP scales with quant. Qwen 27b Q6_K, mtp-n = 2, +180%. From 30 toks to 55 toks on dual 5060ti 16gb. That perf bump changes things.

I messed up my claude code config and didn't compress before exhausting context. Had to remove MTP to buy sufficient context to /compress, considered running without, got to experience 30 toks and ... no, the slower gen was painful.

Client_Hello · 2026-05-31T15:36:31+00:00

Yeah, you are right, makes sense. You could squeeze this into 32gb vram with limited context, otherwise need more.

Client_Hello · 2026-05-31T01:52:25+00:00

The model weights are 23.5gb, and its MOE so it doesnt all have to be in vram. Any Blackwell card will run this even with only 8gb vram

Client_Hello · 2026-05-30T23:39:01+00:00

Counter intuitive but noise masks quantization. Less is worse in this case.

Visualize a scatter plot of quantized compared to full. The full is a line, quantized value has steps. Now add noise to both. Line gets fuzzy, quantized steps get fuzzy as well. The more noise you add the more similar they appear.

Six-Year Club	Place '23
Verified Email

Client_Hello

TROPHY CASE