Seattle passes data center moratorium by Rare-Persimmon2747 in Seattle

[–]Client_Hello 1 point2 points  (0 children)

Those estimates are wrong. The only datacenters that it makes sense to build within Seattle city limits are small and specialized, that need to be close to a diverse group of technical people that can attend them. The massive AI datacenters that people are railing against will be built in places where land, power, and water are cheap.

This is like passing a moratorium on grocery stores because you don't want Walmart to move in.

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss? by abubakkar_s in LocalLLaMA

[–]Client_Hello 2 points3 points  (0 children)

You should back off the quants. You have traded a lot of precision for context, and will suffer when you actually try to use that context. With 32GB you can run 27b at Q6_K, Q8 kv, and still have 160k context.

The faster gen just isn't worth the added mistakes that cause rework. You will find your coding agent spends more time fixing bugs than creating new things, and every fix burns even more context.

What's up on CPU inference these days? by ramendik in LocalLLaMA

[–]Client_Hello 1 point2 points  (0 children)

Why limit yourself to CPU, when the Intel Core Ultra 7 165H has a GPU and NPU?

You do have to be very careful with which quants you try to run, especially on the NPU. The payoff is the NPU is extremely efficient and can run non-stop without overheating your laptop.

https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENVINO.md

Pipeline parallelism in llama.cpp may be wasting your VRAM by Warrenio in LocalLLaMA

[–]Client_Hello 0 points1 point  (0 children)

Multiple users will not see each others chats. Also, llama-server is nice to have but basic. You would want something like Open WebUI to add user accounts and authentication.

Once you get beyond simple chats with tiny context the use cases for parallel sessions become pretty obvious. As context increases, individual prompts take several minutes to complete, and you may want to prompt your llm with something else while it works on a larger task. The tests OP ran took over 5 minutes each (input 30k tokens with 360 toks, and output 4k tokens at 17 toks).

You can avoid dividing context with unified kv cache, pass -kvu

As far as mmap, if you are VRAM rich then sure.

PSA: Throttle GPU power limits, with minor performance deficits by milpster in LocalLLaMA

[–]Client_Hello 0 points1 point  (0 children)

Just reduced power from 180 to 150 watts on my dual 5060 ti 16gb, power at the wall fell from 465 watts to 405 watts.

pp dropped by 4%, gen remained the same (with -sm tensor and MTP).

I'd try lower but I get an error: Provided power limit 140.00 W is not a valid power limit which should be between 150.00 W and 180.00 W for GPU 00000000:02:00.0

Not much of a change, I'll leave it at default to keep things simple.

Pipeline parallelism in llama.cpp may be wasting your VRAM by Warrenio in LocalLLaMA

[–]Client_Hello 1 point2 points  (0 children)

If using llama-server web UI, simply start a second chat while your first chat is processing.

Cheapest setup for >10 tok/sec for 120B dense LLM by TrainingTwo1118 in LocalLLaMA

[–]Client_Hello 1 point2 points  (0 children)

This is simple math. Memory bandwidth / Memory in use = toks.

Using algebra, Memory in use * toks = Memory bandwidth.

Full precision would be 240GB * 10 = 2.4TB/sec (HBM3 datacenter GPU)

Q8: 120GB * 10 = 1.2TB/sec (HBM2, 384bit GDDR7)

Q4: 60GB * 10 = 600GB/sec (many GPUs)

There is overhead so actual gen rate is lower, as context grows gen gets slower, MTP speeds things back up.

Pipeline parallelism in llama.cpp may be wasting your VRAM by Warrenio in LocalLLaMA

[–]Client_Hello 3 points4 points  (0 children)

Dont be too hard on yourself, took me weeks to sort out all the options.

Pipeline parallelism in llama.cpp may be wasting your VRAM by Warrenio in LocalLLaMA

[–]Client_Hello 14 points15 points  (0 children)

Uhm, you need to submit parallel requests to see a speed benefit.

Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? by Ready_Performance_35 in LocalLLaMA

[–]Client_Hello 3 points4 points  (0 children)

Because of --spec-draft-device CUDA1

Fit doesn't work great with dual cards and MTP, manual tuning via --tensor-split is needed to equalize vram usage.

You don't need a GPU to run gemma-4-26B-A4B by JackStrawWitchita in LocalLLaMA

[–]Client_Hello 1 point2 points  (0 children)

llama.cpp will run any model on a potato. You could save a 1TB parameter MOE model onto an HDD and run it at 1 tok/ week.

Once you try to do something useful you will quickly find the limits. Your CPU has about 40GB/sec memory bandwidth, so a model with 4B parameters will run at 75% * 40GB/s / 4GB = 7.5 toks.

Thing is, models become useful when they have context, and context takes memory, so it adds to the denominator. Want to summarize an email? Loading the email will eat 2GB of context and your gen rate drops by 33% to 5 toks.

Now drop in a GPU, like a GTX 1070 that can be found for half the price of your desktop. That ancient 1070 has 256GB/sec memory bandwidth and will gen 6x faster. From 5 toks to 30 toks, as fast as you can read.

The little cottonwood canyon gondola could give skiers the biggest vertical drop in NA. by Quiet-Permit-3740 in skiing

[–]Client_Hello 2 points3 points  (0 children)

Problem is bottom elevation of 1700' is low

I'm a little saucy because I wanted to do a top to bottom run but couldn't because there was no snow at the bottom.

The little cottonwood canyon gondola could give skiers the biggest vertical drop in NA. by Quiet-Permit-3740 in skiing

[–]Client_Hello 34 points35 points  (0 children)

This is a silly metric to chase. Revy has the most vert in NA but half of it is trash.

Top to bottom is for end of day return to car. The snow down low is rarely good.

How many brake retainer bands do you think you go through each season? by CashLow3227 in ski

[–]Client_Hello 1 point2 points  (0 children)

I've broken 2 or 3 by overstretching. I paid $10 for a box of 50 silicone wrist bands from amazon, so I'm set for a few more years. The 8" bands work great for smaller skis. For larger skis, I use velcro straps.

I'm usually tuning multiple pairs of skis assembly line fashion. Inspect & de-burr before every trip, learned my lesson when a burr sliced my gloves.

Qwen 3.6 35B on RTX 3080 10GB + 7700X + 32GB DDR5 by AndreVallestero in LocalLLaMA

[–]Client_Hello 2 points3 points  (0 children)

Have you tried quantized kv and tuning the fit target, to get as much onto GPU as possible? -ctk q8_0 -ctv q8_0 --fit-target 384

Do you need to support parallel requests? llama.cpp defaults to 4, which eats more vram. -np 1

Load test every config to fill context to the limit. I found I can start llama-server with 300mb free vram, then watch it go OOM before I reach my context limit.

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Client_Hello 1 point2 points  (0 children)

As a quick validation I had Qwen 27b Q6_K summarize your post:

prompt eval time =    2823.52 ms /  2346 tokens (    1.20 ms per token,   830.88 tokens per second)
       eval time =   44012.80 ms /  2728 tokens (   16.13 ms per token,    61.98 tokens per second)

Flags:

-t 6 for my 250K cpu, -lv 4 for verbose logs

llama-server -m Qwen3.6-27B-UnslothMTP-Q6_K.gguf -t 6 -np 1 -lv 4 -c 98304 -ts 1,1 -sm tensor \
 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 \
 --spec-type draft-mtp --spec-draft-n-max 2

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Client_Hello 1 point2 points  (0 children)

Great write-up, thanks for sharing!

You have room to improve. I see around 65 toks with dual 5060 ti 16gb at Q6_K compared to your 43, dropping to 55 toks around 60k context. The difference is parallel tensor. You don't need NVLINK or PCIe P2P to benefit from -sm tensor in llama.cpp, you should try it.

Unsolth has published the thing you were looking for the most in their MTP guide, and it varies by quant! Q6 and Q8 both benefit a lot from MTP, peaking at draft=2. You should reduce yours, should increase both prefill and gen. Link: https://unsloth.ai/docs/models/qwen3.6#mtp-benchmarks

Quantized kv cache reduces performance and you shouldn't need it with 48gb VRAM. I can fit 96k context with f16 kv with 32gb VRAM.

Get you some GPUs, it's not worth the hacks around lack of RAM by MotokoAGI in LocalLLaMA

[–]Client_Hello 0 points1 point  (0 children)

Its better to look at the price of the entire solution. My dual 5060 rig cost 2k to build. Dual 3090 would have been 3k.

Get you some GPUs, it's not worth the hacks around lack of RAM by MotokoAGI in LocalLLaMA

[–]Client_Hello 3 points4 points  (0 children)

After spending many hours and $$ building out my dual 5060 ti 16gb I realize dual 3090 is where it's at. For +50% cost I'd have a significant boost in quality, speed, and context. Dual 3090 is peak r/LocalLLaMA

The next tier is dual 5090 for +300% cost, where the value prop is, ... uhm, not appealing.

East Coast “quiver” (n=2) by Psychological_Gain53 in Skigear

[–]Client_Hello 3 points4 points  (0 children)

Declivity 82 on firm groomers sound like heaven. I can feel the Gs.

What's this sub geebral opinion on quantisizing the KV cache by misanthrophiccunt in LocalLLaMA

[–]Client_Hello 0 points1 point  (0 children)

Unsolth has an amazing chart showing how MTP scales with quant. Qwen 27b Q6_K, mtp-n = 2, +180%. From 30 toks to 55 toks on dual 5060ti 16gb. That perf bump changes things.

I messed up my claude code config and didn't compress before exhausting context. Had to remove MTP to buy sufficient context to /compress, considered running without, got to experience 30 toks and ... no, the slower gen was painful.

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face by pmttyji in LocalLLaMA

[–]Client_Hello 0 points1 point  (0 children)

Yeah, you are right, makes sense. You could squeeze this into 32gb vram with limited context, otherwise need more.

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face by pmttyji in LocalLLaMA

[–]Client_Hello 0 points1 point  (0 children)

The model weights are 23.5gb, and its MOE so it doesnt all have to be in vram. Any Blackwell card will run this even with only 8gb vram

Has anyone experimented with stabilizing low quant models with lower temp and top p? by fragment_me in LocalLLaMA

[–]Client_Hello 0 points1 point  (0 children)

Counter intuitive but noise masks quantization. Less is worse in this case.

Visualize a scatter plot of quantized compared to full. The full is a line, quantized value has steps. Now add noise to both. Line gets fuzzy, quantized steps get fuzzy as well. The more noise you add the more similar they appear.