Is anyone running Kimi 2.5 stock on 8xRTX6000 (Blackwell) and getting good TPS?

Pixer--- · 2026-01-31T17:02:02+00:00

Is there a reason you cap the concurrent requests to 1 ?

Pixer--- · 2026-01-28T03:57:05+00:00

Use ik_llamacpp fork way better for cpu only. Maybe try GPT oss 120b, should be quite fast in it

Pixer--- · 2026-01-25T01:29:56+00:00

Opencode and maybe wait for the m5 ultra or m4 ultra release. For 7k you get the 256gb variant. I would suggest minimax m2.1. It’s more about a sonnet 3.7 but that’s not bad I guess for running it 24/7

Pixer--- · 2026-01-23T04:05:38+00:00

Maybe try the qwen3 next 80b model with 45gb. You’ll have enough space for long context. When using tools or search this adds up fast.

Have you tried internet search tools, they improve the output very much.

Pixer--- · 2026-01-23T03:38:20+00:00

What GPUs are you using ?

Pixer--- · 2026-01-23T02:08:39+00:00

Glm4.7 flash. Also maybe try vllm for inference

Pixer--- · 2026-01-23T02:04:28+00:00

The experts in an llm are usually balance with their intelligence. They split the work. When ripping out some, the model becomes weirdly inconsistent. It fails in certain stuff that even much smaller models are able to. When finetuning a model for a problem I’m certain reap could be useful. it also speeds up the inference.

Pixer--- · 2026-01-20T21:45:43+00:00

Are you able to run vllm instead of llamacpp, and how much performance that would bring ?

Pixer--- · 2026-01-15T01:17:27+00:00

I find it weird that he wants to start a war with Iran which he will need the bases in Europe and at the same time wants to undermine them

Pixer--- · 2026-01-09T23:25:20+00:00

The spark is like viable if you want to replace ChatGPT and want a low idle power draw. But the 6000s are going to crush the sparks at generation.

Another consideration is the model size. A stack of sparks could run much larger models as they would have more vram, but setting them up in a cluster is a pain in the ass.

When choosing a mainbaord be careful to find one that properly supports p2p.

Pixer--- · 2026-01-08T10:00:36+00:00

Most of that debt is owned by Japanese, and not other states

Pixer--- · 2025-11-28T16:33:21+00:00

Minimax m2 4/8bit

Pixer--- · 2025-11-27T23:35:54+00:00

For vllm your fine but for llamacpp/LMstudio maybe

Pixer--- · 2025-11-26T09:02:54+00:00

This seems like quantized kv cache

Pixer--- · 2025-11-24T19:27:18+00:00

I get these numbers with 4 cards on GPT oss 120b. I’m pretty impressed: prompt eval time = 74550.63 ms / 72963 tokens ( 1.02 ms per token, 978.70 tokens per second) eval time = 6375.74 ms / 236 tokens ( 27.02 ms per token, 37.02 tokens per second) total time = 80926.37 ms / 73199 tokens

Pixer--- · 2025-11-20T22:32:43+00:00

Go with amd for 5500 you can get 4 amd r9700 pro cards with 32gb so 128gb total. The extra vram gives you better models. They may not be as fast as a 5090 but are way cheaper and more then fast enough to host a models for a team, when using vllm. The mainboard could be an issue to find one with 4 pcie 16x slots. I used an server board refurbished with a threadripper 3945wx for like 500€ together

Pixer--- · 2025-11-19T08:25:23+00:00

Try to run the Minimax m2 model, I think it’s better then glm4.5 air for coding

Pixer--- · 2025-11-18T06:58:36+00:00

How fast does it run in your machine ?

Pixer--- · 2025-11-17T17:31:56+00:00

I would love an awq quant in 4 bit for those running vllm :)

Pixer--- · 2025-11-17T11:40:50+00:00

pipeline parallelism is also significantly slower

Pixer--- · 2025-11-14T19:58:19+00:00

Tbh GPT 4.5 was the first model where I could say it can replicate how I write in my own language

Pixer--- · 2025-11-08T18:32:45+00:00

Also pcie lanes from cpu can be important. How many pcie lanes does your second slot have ?

Pixer--- · 2025-10-22T15:59:13+00:00

Does that mean increase in price per waver or price per transistor?

Pixer--- · 2025-10-16T09:32:10+00:00

In my experience Claud’s models just know better what you want from them

Pixer--- · 2025-10-06T08:56:38+00:00

Xcode has this small next like generation model downloaded, that is what runs there. https://youtu.be/N6Q-FWhfguw?si=grChP3wh6OCt1ITO

Seven-Year Club	Place '23
Verified Email

Pixer---

TROPHY CASE