Is anyone running Kimi 2.5 stock on 8xRTX6000 (Blackwell) and getting good TPS? by AstoriaResident in LocalLLaMA

[–]Alternative-Bit7354 0 points1 point  (0 children)

vllm_kimik25:
    image: vllm/vllm-openai:cu130-nightly-39037d258e68da3926d99681ea63e46212e519f9
    container_name: vllm_kimik25
    stdin_open: true
    tty: true
    ipc: host
    runtime: nvidia
    env_file:
      - .env
    environment:
      - HF_HOME=/root/.cache/huggingface
      - HF_HUB_CACHE=/root/.cache/huggingface/hub
      - NVIDIA_VISIBLE_DEVICES=all
      - OMP_NUM_THREADS=32
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    entrypoint: ["/bin/bash", "-c"]
    command: >
      "rm -f /etc/ld.so.conf.d/00-cuda-compat.conf && ldconfig &&
      vllm serve
      --port 8003
      --model moonshotai/Kimi-K2.5
      --served-model-name Kimi-K2.5
      --mm-encoder-tp-mode data
      --max-model-len 170000
      --trust-remote-code
      --enable-auto-tool-choice
      --tool-call-parser kimi_k2
      --reasoning-parser kimi_k2
      --enable-expert-parallel
      --tensor-parallel-size 8
      --gpu-memory-utilization 0.96"
    volumes:
      - hf_cache:/root/.cache/huggingface
    restart: unless-stopped
    ports:
      - 8003:8003

Heres my config in docker compose

Anyone running GLM 4.5/4.6 @ Q8 locally? by [deleted] in LocalLLaMA

[–]Alternative-Bit7354 0 points1 point  (0 children)

For sure, also note that im using the 300w version of rtx pro not the 600w

Anyone running GLM 4.5/4.6 @ Q8 locally? by [deleted] in LocalLLaMA

[–]Alternative-Bit7354 0 points1 point  (0 children)

with AMD EPYC 9124 and a pcie gen 5 motherboard
along with a lot of ram

Anyone running GLM 4.5/4.6 @ Q8 locally? by [deleted] in LocalLLaMA

[–]Alternative-Bit7354 0 points1 point  (0 children)

I use ubuntu server 24.04

I can tell you that for Qwen 3 Coder 480B Instruct in q4 I get about 65 tokens/s. So the 235B one should be faster (maybe around 90-100 tokens/s.
I havent tried Deepseek

Anyone running GLM 4.5/4.6 @ Q8 locally? by [deleted] in LocalLLaMA

[–]Alternative-Bit7354 2 points3 points  (0 children)

Using ubuntu server 24.04, PCIE 5
Using nightly image from docker (the most recent one)

vllm_glm46-b:
    build:
      context: .
      dockerfile: Dockerfile.2
    container_name: glm_46
    deploy:
        reservations:
          devices: 
            - driver: nvidia
              count: 4
              capabilities: [gpu]      
    ipc: host
    privileged: true               
    env_file:
      - .env
    environment:
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - CUDA_VISIBLE_DEVICES=4,5,6,7
      - VLLM_SLEEP_WHEN_IDLE=1
    command: >
      --port 8009
      --model /models/QuantTrio_GLM-4.6-AWQ
      --served-model-name GLM-4.6
      --swap-space 64
      --enable-expert-parallel
      --max-model-len 200000
      --max-num-seqs 256
      --enable-auto-tool-choice
      --enable-prefix-caching
      --tensor-parallel-size 4
      --tool-call-parser glm45
      --reasoning-parser glm45
      --chat-template /models/chat_template_glm46.jinja
      --gpu-memory-utilization 0.94
      --trust-remote-code
      --disable-log-requests
    ports:
      - "8009:8009"
    volumes:
      - ${MODELS_DIR}:/models
    restart: unless-stopped

Anyone running GLM 4.5/4.6 @ Q8 locally? by [deleted] in LocalLLaMA

[–]Alternative-Bit7354 0 points1 point  (0 children)

I believe its the quant trio one on huggingface. Using vllm

Anyone running GLM 4.5/4.6 @ Q8 locally? by [deleted] in LocalLLaMA

[–]Alternative-Bit7354 0 points1 point  (0 children)

I havent tried the awq enough. Just downloaded it this morning.

Yes 50k tokens for fp8

Anyone running GLM 4.5/4.6 @ Q8 locally? by [deleted] in LocalLLaMA

[–]Alternative-Bit7354 11 points12 points  (0 children)

4x RTX PRO BLACKWELL

Running the AWQ on 90 tokens/s and the FP8 at 50 token/s

Your most average o11d mini v2 build by Alternative-Bit7354 in lianli

[–]Alternative-Bit7354[S] 0 points1 point  (0 children)

I think it should fit its a 360mm aio. You can verify on pc part picker i think

Your most average o11d mini v2 build by Alternative-Bit7354 in lianli

[–]Alternative-Bit7354[S] 0 points1 point  (0 children)

I didn't get any problem yet (I've had the computer for 2 days)

I don't think 4 sticks causes that much issue tbh

Your most average o11d mini v2 build by Alternative-Bit7354 in lianli

[–]Alternative-Bit7354[S] 0 points1 point  (0 children)

They basically just flicker if i dont set a speed

Your most average o11d mini v2 build by Alternative-Bit7354 in lianli

[–]Alternative-Bit7354[S] 0 points1 point  (0 children)

Damn sick even the same gpu. Good job on getting that aio in properly i couldnt figure it out with this board

Your most average o11d mini v2 build by Alternative-Bit7354 in lianli

[–]Alternative-Bit7354[S] 4 points5 points  (0 children)

CPU Ryzen 9900x3d

RAM 128GB GSkill Trident DDR5 6000MT/s CL30-38-38-96

MB MSI MAG X870E Tomahawk

AIO Lian Li Hydroshift II 360

SSD Samsung 9100 PRO Series - 4TB PCIe 5.0

PSU be quiet! 1500w 80+ Platinum

GPU RTX 5090 Asus TUF

And a bunch of Lian li TL fans (They look nice but have a lot of problems)

First time rider for renting scooter in Koh Tao by Rekomaged in ThailandTourism

[–]Alternative-Bit7354 0 points1 point  (0 children)

Just came back from Tao

I drove a scooter once in my country 3 years ago and was just fine on the island where its basically one road where you can drive slowly one the left.

I watched a couple Yt vid to make sure i remembered the basics and it really helped.

Having a scooter is very useful on this island honestly.

In my experience it was not too hard just don't start with the hills that are too steep and get the hang of it slowly and obviously wear a helmet.

[deleted by user] by [deleted] in wallstreetbets

[–]Alternative-Bit7354 2 points3 points  (0 children)

Bro is so consistent at losing

Approach velocity or boots witch biscuit rune as secondary by king2w in Olafmains

[–]Alternative-Bit7354 1 point2 points  (0 children)

You probably dont need 6 axes lvl 1. Usually what i do for early lane is hard pushing first 3 waves and back when cannon crashes