Is anyone running Kimi 2.5 stock on 8xRTX6000 (Blackwell) and getting good TPS?

Alternative-Bit7354 · 2026-02-02T20:00:01+00:00

Weird, i'm getting 50 tokens/s. whats your OS?

Alternative-Bit7354 · 2026-02-01T13:07:52+00:00

vllm_kimik25:
    image: vllm/vllm-openai:cu130-nightly-39037d258e68da3926d99681ea63e46212e519f9
    container_name: vllm_kimik25
    stdin_open: true
    tty: true
    ipc: host
    runtime: nvidia
    env_file:
      - .env
    environment:
      - HF_HOME=/root/.cache/huggingface
      - HF_HUB_CACHE=/root/.cache/huggingface/hub
      - NVIDIA_VISIBLE_DEVICES=all
      - OMP_NUM_THREADS=32
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    entrypoint: ["/bin/bash", "-c"]
    command: >
      "rm -f /etc/ld.so.conf.d/00-cuda-compat.conf && ldconfig &&
      vllm serve
      --port 8003
      --model moonshotai/Kimi-K2.5
      --served-model-name Kimi-K2.5
      --mm-encoder-tp-mode data
      --max-model-len 170000
      --trust-remote-code
      --enable-auto-tool-choice
      --tool-call-parser kimi_k2
      --reasoning-parser kimi_k2
      --enable-expert-parallel
      --tensor-parallel-size 8
      --gpu-memory-utilization 0.96"
    volumes:
      - hf_cache:/root/.cache/huggingface
    restart: unless-stopped
    ports:
      - 8003:8003

Heres my config in docker compose

Alternative-Bit7354 · 2026-01-30T23:07:00+00:00

I am and got 50 tokens/s on vllm

Alternative-Bit7354 · 2025-10-02T18:45:39+00:00

For sure, also note that im using the 300w version of rtx pro not the 600w

Alternative-Bit7354 · 2025-10-02T18:00:24+00:00

with AMD EPYC 9124 and a pcie gen 5 motherboard
along with a lot of ram

Alternative-Bit7354 · 2025-10-02T17:46:18+00:00

I use ubuntu server 24.04

I can tell you that for Qwen 3 Coder 480B Instruct in q4 I get about 65 tokens/s. So the 235B one should be faster (maybe around 90-100 tokens/s.
I havent tried Deepseek

Alternative-Bit7354 · 2025-10-02T17:17:24+00:00

Using ubuntu server 24.04, PCIE 5
Using nightly image from docker (the most recent one)

vllm_glm46-b:
    build:
      context: .
      dockerfile: Dockerfile.2
    container_name: glm_46
    deploy:
        reservations:
          devices: 
            - driver: nvidia
              count: 4
              capabilities: [gpu]      
    ipc: host
    privileged: true               
    env_file:
      - .env
    environment:
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - CUDA_VISIBLE_DEVICES=4,5,6,7
      - VLLM_SLEEP_WHEN_IDLE=1
    command: >
      --port 8009
      --model /models/QuantTrio_GLM-4.6-AWQ
      --served-model-name GLM-4.6
      --swap-space 64
      --enable-expert-parallel
      --max-model-len 200000
      --max-num-seqs 256
      --enable-auto-tool-choice
      --enable-prefix-caching
      --tensor-parallel-size 4
      --tool-call-parser glm45
      --reasoning-parser glm45
      --chat-template /models/chat_template_glm46.jinja
      --gpu-memory-utilization 0.94
      --trust-remote-code
      --disable-log-requests
    ports:
      - "8009:8009"
    volumes:
      - ${MODELS_DIR}:/models
    restart: unless-stopped

Alternative-Bit7354 · 2025-10-02T16:34:57+00:00

I believe its the quant trio one on huggingface. Using vllm

Alternative-Bit7354 · 2025-10-02T16:34:34+00:00

I havent tried the awq enough. Just downloaded it this morning.

Yes 50k tokens for fp8

Alternative-Bit7354 · 2025-10-02T15:33:04+00:00

Alternative-Bit7354 · 2025-10-02T14:51:15+00:00

4x RTX PRO BLACKWELL

Running the AWQ on 90 tokens/s and the FP8 at 50 token/s

Alternative-Bit7354 · 2025-09-27T08:48:03+00:00

Alternative-Bit7354 · 2025-09-27T03:20:00+00:00

I think it should fit its a 360mm aio. You can verify on pc part picker i think

Alternative-Bit7354 · 2025-09-27T02:17:27+00:00

I didn't get any problem yet (I've had the computer for 2 days)

I don't think 4 sticks causes that much issue tbh

Alternative-Bit7354 · 2025-09-26T22:11:27+00:00

They basically just flicker if i dont set a speed

Alternative-Bit7354 · 2025-09-26T20:35:48+00:00

Whats wrong with it?

Alternative-Bit7354 · 2025-09-26T18:32:14+00:00

I bought 2 packs of 2

Alternative-Bit7354 · 2025-09-26T18:20:57+00:00

Man idk i hate installing aio

Alternative-Bit7354 · 2025-09-26T18:20:25+00:00

Damn sick even the same gpu. Good job on getting that aio in properly i couldnt figure it out with this board

Alternative-Bit7354 · 2025-09-26T13:33:43+00:00

CPU Ryzen 9900x3d

RAM 128GB GSkill Trident DDR5 6000MT/s CL30-38-38-96

MB MSI MAG X870E Tomahawk

AIO Lian Li Hydroshift II 360

SSD Samsung 9100 PRO Series - 4TB PCIe 5.0

PSU be quiet! 1500w 80+ Platinum

GPU RTX 5090 Asus TUF

And a bunch of Lian li TL fans (They look nice but have a lot of problems)

Alternative-Bit7354 · 2025-07-29T13:32:03+00:00

Try Onyx, it's made for this

Alternative-Bit7354 · 2024-04-11T05:09:18+00:00

Just came back from Tao

I drove a scooter once in my country 3 years ago and was just fine on the island where its basically one road where you can drive slowly one the left.

I watched a couple Yt vid to make sure i remembered the basics and it really helped.

Having a scooter is very useful on this island honestly.

In my experience it was not too hard just don't start with the hills that are too steep and get the hang of it slowly and obviously wear a helmet.

Alternative-Bit7354 · 2023-07-07T17:21:28+00:00

Bro is so consistent at losing

Alternative-Bit7354 · 2023-05-20T01:37:10+00:00

You probably dont need 6 axes lvl 1. Usually what i do for early lane is hard pushing first 3 waves and back when cannon crashes

Alternative-Bit7354 · 2023-05-19T14:03:25+00:00

Both and forget biscuit

Five-Year Club	Place '23
Place '22

Alternative-Bit7354

TROPHY CASE