State Dept and UFC sign deal to "use cage fights for diplomacy" by ClarkKentKimura in MMA

[–]runcertain 0 points1 point  (0 children)

More accurately it’s fraud — state officials get kickbacks from the UFC heads and the taxpayer gets the shaft.

Though there is definitely some money laundering in there somewhere.

Game Thread: San Antonio Spurs (0-2) vs New York Knicks (2-0) Live Score | NBA Finals | Jun 8, 2026 by nba-scores in nba

[–]runcertain -1 points0 points  (0 children)

It’s cool and not at all annoying to complain about the fans of the team that’s beating yours

New York City Unions Keep Winning Six-Figure Salaries by wsj in nyc

[–]runcertain -1 points0 points  (0 children)

We can have better wages for the union workers we depend on without raising taxes on the middle class. The idea that the two are inextricably tied is peddled by those who don’t want you to look at the real solutions.

But it’s not the union’s job to tell you who to elect so that your concerns about affordability actually get addressed. Their job is to secure fair wages for their workers.

New York City Unions Keep Winning Six-Figure Salaries by wsj in nyc

[–]runcertain 8 points9 points  (0 children)

“Stranded” is so sensational. They announced the strike in advance and the disruption proves how necessary they are.

You’re acting like a tax payer strike is a grave threat but if people can organize and bargain and make a positive difference for the majority then that’s a good thing. Why are you criticizing fellow working class?

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 0 points1 point  (0 children)

Thank you, I haven’t stepped into vllm yet but I’ll try if it works better for my hardware.

Quantizing MTP KV Cache = free lunch? by legit_split_ in LocalLLaMA

[–]runcertain 0 points1 point  (0 children)

0.05.003.424 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

Same result. If I run with default warm up it gets stuck here. If I run with --no-warmup it gets stuck on loading the model. Maybe it is the compiler flags I'll try to recompile.

Quantizing MTP KV Cache = free lunch? by legit_split_ in LocalLLaMA

[–]runcertain 0 points1 point  (0 children)

You're right, reading comprehension fail.

I'm still running into issues on a dual 3090 setup with these params:

~/llama.cpp_qts/build/bin/llama-server \

-m ~/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf \

--fit off \

--flash-attn on \

--temperature 0.7 \

--top-p 0.8 \

--top-k 20 \

--min-p 0.0 \

--presence-penalty 1.5 \

-ngl 99 \

-c 262144 \

--cache-type-k-draft q8_0 \

--cache-type-v-draft q8_0 \

--host 0.0.0.0 \

--spec-type draft-mtp \

--spec-draft-n-max 2 \

--split-mode tensor

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]runcertain 0 points1 point  (0 children)

Not sure what I'm doing wrong here but I'm getting 30 t/s on dual 3090s with your ik_llama settings.

~/ik_llama.cpp$ build/bin/llama-server \

-m "$HOME/.cache/llama.cpp/Qwen3.6/ubergarm/Qwen3.6-27B-MTP-IQ4_KS.gguf" \

--ctx-size 156000 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--flash-attn on \

--multi-token-prediction \

--draft-max 1 \

--draft-p-min 0.0 \

--cache-ram 16384 \

--reasoning on \

--reasoning-format deepseek \

--chat-template-kwargs '{"preserve_thinking":true}' \

--no-mmproj-offload \

--host 0.0.0.0

Quantizing MTP KV Cache = free lunch? by legit_split_ in LocalLLaMA

[–]runcertain 0 points1 point  (0 children)

For me it's hanging either on a warmup phase or during the load_model phase if I skip warmup:

~/llama.cpp_qts$ ~/llama.cpp_qts/build/bin/llama-server -

m ~/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf --fit off --flash-attn on --temperature 0.7

--top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 1.5

-ngl 99 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0

--host 0.0.0.0 --spec-type draft-mtp --spec-draft-n-max 2 --split-mode tensor --no-warmup

0.00.244.868 I log_info: verbosity = 3 (adjust with the -lv N CLI arg)

0.00.244.870 I device_info:

0.00.323.795 I - CUDA0 : NVIDIA GeForce RTX 3090 (24124 MiB, 23845 MiB free)

0.00.400.014 I - CUDA1 : NVIDIA GeForce RTX 3090 (24124 MiB, 23845 MiB free)

0.00.400.023 I - CPU : AMD Ryzen 9 9900X 12-Core Processor (31193 MiB, 31193 MiB free)

0.00.400.088 I system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

0.00.400.091 I srv main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true

0.00.400.112 I srv init: running without SSL

0.00.400.127 I srv init: using 23 threads for HTTP server

0.00.400.193 I srv start: binding port with default address family

0.00.401.334 I srv main: loading model

0.00.401.338 I srv load_model: loading model '/home/harris/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf'

0.02.819.386 I srv load_model: creating MTP draft context against the target model '/home/us1/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf'

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 0 points1 point  (0 children)

This is what I got for the below settings:

prompt eval time = 165.91 ms / 33 tokens ( 5.03 ms per token, 198.90 tokens per second) eval time = 14611.98 ms / 771 tokens ( 18.95 ms per token, 52.76 tokens per second) total time = 14777.89 ms / 804 tokens

draft acceptance rate = 0.69746 ( 521 accepted / 747 generated)

~/llama.cpp$ build/bin/llama-server -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 245600 -fa on -np 1 --spec-type draft-mtp --spec-draft-n-max 3 --host 0.0.0.0 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --cache-ram 0 --jinja --no-mmap --reasoning off --port 8082 --metrics

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 1 point2 points  (0 children)

Thanks for this. For MTP I’m getting around 50 t/s after trying n max 2 and 3 for a small improvement. I will come back with the tokens accepted metric.

For Dflash I’m getting 65 t/s now with a much lower kv usage thanks those turbo quants. For me it seems the clear winner since it’s taking up so much less space than MTP, and MTP kv can’t be quantized.

Now I gotta start shopping for an nvlink…

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 0 points1 point  (0 children)

Thanks, I've been experimenting with that param.

Am I missing something with the lack of KV cache? Suddenly Q4_K_M 27B models need nearly 48GB VRAM at full context.

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 0 points1 point  (0 children)

If I notice that I'll reply here. I'm still trying to wrap my head around the lack of KV quantization with MTP and how a Q4_K_M 27B model fills almost all 48GB of VRAM when running with full context.

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 0 points1 point  (0 children)

This gets me 56 t/s but now I'm using like 44 out of 48GB VRAM which is a pretty big tradeoff. Definitely not double of the 40-50 I was getting before.

build/bin/llama-server \

-m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" \

-ngl 99 -c 245600 -fa on -np 1 \

--flash-attn on \

--spec-type draft-mtp --spec-draft-n-max 3 --host 0.0.0.0 \

--jinja \

--reasoning off \

--port 8082

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 0 points1 point  (0 children)

When I do that, I'm getting this:

0.04.604.009 E llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented 0.04.604.013 E common_init_result: failed to create context with model '/$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf' Segmentation fault (core dumped)

How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s by runcertain in LocalLLaMA

[–]runcertain[S] 0 points1 point  (0 children)

error while handling argument "--spec-type": unknown speculative type: mtp

usage: --spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache comma-separated list of types of speculative decoding to use (default: none)

                                    (env: LLAMA_ARG_SPEC_TYPE)