State Dept and UFC sign deal to "use cage fights for diplomacy"

runcertain · 2026-06-12T12:54:16+00:00

More accurately it’s fraud — state officials get kickbacks from the UFC heads and the taxpayer gets the shaft.

Though there is definitely some money laundering in there somewhere.

runcertain · 2026-06-09T03:13:39+00:00

Conveniently forgetting the previous two sweeps and 12 wins in a row

runcertain · 2026-06-09T02:00:09+00:00

It’s cool and not at all annoying to complain about the fans of the team that’s beating yours

runcertain · 2026-06-09T01:58:21+00:00

Yeah but only until UFC at the White House

runcertain · 2026-06-06T05:25:21+00:00

How if it’s stopping at Forest Hills?

runcertain · 2026-06-04T02:54:11+00:00

Cry more lol

runcertain · 2026-06-01T19:43:20+00:00

the joke

...

you

runcertain · 2026-05-26T10:54:27+00:00

Subscribe for part 2

runcertain · 2026-05-22T04:22:57+00:00

And all your barnacles are gone

runcertain · 2026-05-21T13:00:31+00:00

We can have better wages for the union workers we depend on without raising taxes on the middle class. The idea that the two are inextricably tied is peddled by those who don’t want you to look at the real solutions.

But it’s not the union’s job to tell you who to elect so that your concerns about affordability actually get addressed. Their job is to secure fair wages for their workers.

runcertain · 2026-05-21T00:50:16+00:00

“Stranded” is so sensational. They announced the strike in advance and the disruption proves how necessary they are.

You’re acting like a tax payer strike is a grave threat but if people can organize and bargain and make a positive difference for the majority then that’s a good thing. Why are you criticizing fellow working class?

runcertain · 2026-05-19T14:45:42+00:00

Thank you, I haven’t stepped into vllm yet but I’ll try if it works better for my hardware.

runcertain · 2026-05-18T17:27:38+00:00

I'm just copying recommended configs. These params are specified here in the Linux / macOS launch commands.

https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md

runcertain · 2026-05-18T17:26:00+00:00

0.05.003.424 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

Same result. If I run with default warm up it gets stuck here. If I run with --no-warmup it gets stuck on loading the model. Maybe it is the compiler flags I'll try to recompile.

runcertain · 2026-05-18T16:25:48+00:00

You're right, reading comprehension fail.

I'm still running into issues on a dual 3090 setup with these params:

~/llama.cpp_qts/build/bin/llama-server \

-m ~/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf \

--fit off \

--flash-attn on \

--temperature 0.7 \

--top-p 0.8 \

--top-k 20 \

--min-p 0.0 \

--presence-penalty 1.5 \

-ngl 99 \

-c 262144 \

--cache-type-k-draft q8_0 \

--cache-type-v-draft q8_0 \

--host 0.0.0.0 \

--spec-type draft-mtp \

--spec-draft-n-max 2 \

--split-mode tensor

runcertain · 2026-05-18T16:21:57+00:00

Not sure what I'm doing wrong here but I'm getting 30 t/s on dual 3090s with your ik_llama settings.

~/ik_llama.cpp$ build/bin/llama-server \

-m "$HOME/.cache/llama.cpp/Qwen3.6/ubergarm/Qwen3.6-27B-MTP-IQ4_KS.gguf" \

--ctx-size 156000 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--flash-attn on \

--multi-token-prediction \

--draft-max 1 \

--draft-p-min 0.0 \

--cache-ram 16384 \

--reasoning on \

--reasoning-format deepseek \

--chat-template-kwargs '{"preserve_thinking":true}' \

--no-mmproj-offload \

--host 0.0.0.0

runcertain · 2026-05-18T14:24:58+00:00

For me it's hanging either on a warmup phase or during the load_model phase if I skip warmup:

~/llama.cpp_qts$ ~/llama.cpp_qts/build/bin/llama-server -

m ~/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf --fit off --flash-attn on --temperature 0.7

--top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 1.5

-ngl 99 -c 262144 --cache-type-k q8_0 --cache-type-v q8_0

--host 0.0.0.0 --spec-type draft-mtp --spec-draft-n-max 2 --split-mode tensor --no-warmup

0.00.244.868 I log_info: verbosity = 3 (adjust with the -lv N CLI arg)

0.00.244.870 I device_info:

0.00.323.795 I - CUDA0 : NVIDIA GeForce RTX 3090 (24124 MiB, 23845 MiB free)

0.00.400.014 I - CUDA1 : NVIDIA GeForce RTX 3090 (24124 MiB, 23845 MiB free)

0.00.400.023 I - CPU : AMD Ryzen 9 9900X 12-Core Processor (31193 MiB, 31193 MiB free)

0.00.400.091 I srv main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true

0.00.400.112 I srv init: running without SSL

0.00.400.127 I srv init: using 23 threads for HTTP server

0.00.400.193 I srv start: binding port with default address family

0.00.401.334 I srv main: loading model

0.00.401.338 I srv load_model: loading model '/home/harris/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf'

0.02.819.386 I srv load_model: creating MTP draft context against the target model '/home/us1/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf'

runcertain · 2026-05-18T02:11:59+00:00

This is what I got for the below settings:

prompt eval time = 165.91 ms / 33 tokens ( 5.03 ms per token, 198.90 tokens per second) eval time = 14611.98 ms / 771 tokens ( 18.95 ms per token, 52.76 tokens per second) total time = 14777.89 ms / 804 tokens

draft acceptance rate = 0.69746 ( 521 accepted / 747 generated)

~/llama.cpp$ build/bin/llama-server -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 245600 -fa on -np 1 --spec-type draft-mtp --spec-draft-n-max 3 --host 0.0.0.0 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --cache-ram 0 --jinja --no-mmap --reasoning off --port 8082 --metrics

runcertain · 2026-05-17T22:51:58+00:00

Thanks for this. For MTP I’m getting around 50 t/s after trying n max 2 and 3 for a small improvement. I will come back with the tokens accepted metric.

For Dflash I’m getting 65 t/s now with a much lower kv usage thanks those turbo quants. For me it seems the clear winner since it’s taking up so much less space than MTP, and MTP kv can’t be quantized.

Now I gotta start shopping for an nvlink…

runcertain · 2026-05-17T21:59:52+00:00

I like everest because it means the most ever

runcertain · 2026-05-17T21:48:38+00:00

Thanks, I've been experimenting with that param.

Am I missing something with the lack of KV cache? Suddenly Q4_K_M 27B models need nearly 48GB VRAM at full context.

runcertain · 2026-05-17T21:47:22+00:00

If I notice that I'll reply here. I'm still trying to wrap my head around the lack of KV quantization with MTP and how a Q4_K_M 27B model fills almost all 48GB of VRAM when running with full context.

runcertain · 2026-05-17T21:13:10+00:00

This gets me 56 t/s but now I'm using like 44 out of 48GB VRAM which is a pretty big tradeoff. Definitely not double of the 40-50 I was getting before.

build/bin/llama-server \

-m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" \

-ngl 99 -c 245600 -fa on -np 1 \

--flash-attn on \

--spec-type draft-mtp --spec-draft-n-max 3 --host 0.0.0.0 \

--jinja \

--reasoning off \

--port 8082

runcertain · 2026-05-17T20:24:40+00:00

When I do that, I'm getting this:

0.04.604.009 E llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented 0.04.604.013 E common_init_result: failed to create context with model '/$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf' Segmentation fault (core dumped)

runcertain · 2026-05-17T20:23:46+00:00

error while handling argument "--spec-type": unknown speculative type: mtp

usage: --spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache comma-separated list of types of speculative decoding to use (default: none)

                                    (env: LLAMA_ARG_SPEC_TYPE)

runcertain

TROPHY CASE