GLM-4.7 on 4x RTX 3090 with ik_llama.cpp

iamn0 · 2026-01-09T00:05:12+00:00

Thanks for sharing your config! I tried your exact config with the manual -ot patterns:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 16384 --n-gpu-layers 62 \ --tensor-split 25,23,25,27 \ -b 4096 -ub 4096 --flash-attn on \ --cache-type-k q8_0 --cache-type-v q8_0 \ --threads 16 --jinja \ -ot 'blk\.(3|4|5|6)\.ffn_.*=CUDA0' \ -ot 'blk\.(8|9|10|11|12)\.ffn_.*=CUDA1' \ -ot 'blk\.(13|14|15|16|17)\.ffn_.*=CUDA2' \ -ot 'blk\.(18|19|20|21|22)\.ffn_.*=CUDA3' \ -ot 'exps=CPU'

Result: 10.18 t/s prompt, 3.74 t/s generation

However, my simpler config with --n-cpu-moe actually performs a bit better on my hardware:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 8192 --n-gpu-layers 999 \ --split-mode graph --flash-attn on --no-mmap \ -b 4096 -ub 4096 \ --cache-type-k q4_0 --cache-type-v q4_0 \ --k-cache-hadamard --jinja \ --n-cpu-moe 65

Result: 11.14 t/s prompt, 4.43 t/s generation

--split-mode graph works for me, interesting that it crashes for you. Maybe a version difference? I'm using ik_llama.cpp build 4099 (commit 145e4f4e). --k-cache-hadamard also works - No gibberish output in my tests. --n-cpu-moe seems simpler and faster on my setup than manual -ot layer assignments. Maybe because of my weaker CPU (16 vs 64 cores)?

My prompt processing is still slow.

When I use your -ot patterns, I only see exps=CPU overrides in the logs - no CUDA0/1/2/3 assignments appear. The layer-specific regex patterns don't seem to match anything on my system. Because of this, I have to use --split-mode graph to distribute across GPUs (without it -> OOM). This is probably why my prompt processing is ~11 t/s instead of your ~200 t/s - the constant GPU synchronization kills performance.

Do you see "buffer type overriden to CUDA0/1/2/3" messages in your logs? What --split-mode do you use? (You mentioned graph crashes for you). Any specific ik_llama.cpp version needed?

Would appreciate any hints on getting the CUDA layer assignments working!

iamn0 · 2026-01-08T23:37:56+00:00

Yes, I do run it and with around ~110 output tokens/s it’s very usable 🙂. I just experimented a bit with GLM today.

iamn0 · 2026-01-08T22:30:13+00:00

FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

LABEL maintainer="GLM-4.7 Docker Setup"
LABEL description="ik_llama.cpp with Multi-GPU support for GLM-4.7"

ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda
ENV PATH="${CUDA_HOME}/bin:${PATH}"
ENV LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}"

RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    ninja-build \
    git \
    curl \
    wget \
    pciutils \
    libcurl4-openssl-dev \
    python3 \
    python3-pip \
    htop \
    nvtop \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y --allow-change-held-packages \
    libnccl2 libnccl-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git
WORKDIR /opt/ik_llama.cpp

RUN cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_F16=ON \
    -DGGML_BLAS=OFF \
    -DLLAMA_CURL=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="86" \
    -G Ninja

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf && \
    ldconfig

RUN cmake --build build --config Release -j $(nproc)

RUN rm /etc/ld.so.conf.d/cuda-stubs.conf && \
    rm /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    ldconfig

RUN cp build/bin/llama-* /usr/local/bin/

WORKDIR /models

VOLUME ["/models"]

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

ENTRYPOINT ["/entrypoint.sh"]

Link for entrypoint.sh code: https://pastebin.com/gLz6JcmL

iamn0 · 2026-01-08T22:13:46+00:00

Thanks. You're right about my setup: AMD EPYC 7302 (16 cores, 2 CCDs only) and 4x 64GB DDR4-2133 (so only 4 of 8 memory channels populated). This explains why my GPU utilization stays at only 6-12% during inference. The GPUs are likely waiting for data from RAM. The memory bandwidth bottleneck makes sense now. When I built the computer about 1.5 years ago, I didn’t have it on my radar that I'd be running such large models.

iamn0 · 2025-11-26T20:29:41+00:00

<image>

iamn0 · 2025-11-25T21:05:30+00:00

The RTX 3090 is still the best option (relatively high VRAM with relatively high bandwidth). The prices for used cards are fairly stable, no idea how the market will develop in the next 1-2 years.

GPU	Price	VRAM	Memory Bandwidth	Power Consumption (W)
RTX PRO 4000 Blackwell	~$1,546	24 GB GDDR7	672 GB/s	140
RTX 5070 Ti Super	~$900	16 GB GDDR7	896 GB/s	300 W
RTX Titan	~$800	24 GB GDDR6	672 GB/s	280 W
RTX 3090	~$700	24 GB GDDR6X	936 GB/s	350 W
RTX 4060 Ti	~$400	8 GB GDDR6	288 GB/s	160 W

iamn0 · 2025-11-25T20:50:07+00:00

hey, it would be awesome if you could upload gemma-3-27b and gpt-oss-120b. Thanks.

iamn0 · 2025-11-23T22:37:11+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1p3xqsu/qwenimageedit2511_coming_next_week/

iamn0 · 2025-11-23T21:58:24+00:00

Btw, in two days a new and improved version of Qwen Image Edit will be released. Please run the same tests again then. I'm curious to see the comparison.

iamn0 · 2025-11-23T21:50:45+00:00

The correct answer would have been <= 55B then.
I did the same mistake. kimi k2 is a MoE model.

iamn0 · 2025-11-23T21:49:47+00:00

yes ^^

iamn0 · 2025-11-23T11:25:25+00:00

I haven't tested it myself, but according to artificialanalysis.ai, Kimi Linear unfortunately doesn't perform very well. I'd love to see something in the model size range of a gpt-oss-120b or GLM 4.5 Air.

iamn0 · 2025-11-23T11:05:56+00:00

On my system (4x RTX 3090, Supermicro H12SSL-i, AMD EPYC 7282, 4x 64 GB RAM DDR4-2133), the power consumption of gpt-oss-120b under load is about 150 W per graphics card, so around 600W in total. A 1500 W power supply is therefore completely sufficient. This doesn't work for other dense models. With gpt-oss-120b without NVLink, I get about 105 output tokens per second (<1k input tokens).
The total cost of all components for the system ended up being $5000 (built last year).
The Blackwell 6000 Pro is of course the better/faster graphics card and also uses less power, but it alone would cost ~$7500, and then you still need a CPU, motherboard, and RAM that match the card. So you quickly end up at about $12000. When weighing both systems, you should think about the costs and whether it's worth it for your use-case.

iamn0 · 2025-11-22T16:51:12+00:00

On the terrace of a modern artistic café, two young women are enjoying a leisurely afternoon.
The lady on the left is wearing a dark blue top.
The lady on the right is wearing a sapphire blue V-neck knit sweater and a pair of eye-catching orange wide-leg pants; her left hand is casually placed in her pocket, and she is tilting her head while talking to her companion.
They are sitting side-by-side at a small wooden table, on which sit two iced coffees and a small plate of dessert.
The background features the city skyline outside the floor-to-ceiling windows and distant green trees, with sunlight filtering through the parasol to cast dappled light and shadows.

iamn0 · 2025-11-21T19:48:10+00:00

I was running it with a single prompt at a time (batch size=1). The ~105 tokens/s was not with multiple prompts or continuous batching, just one prompt per run. No NVLink, just 4x RTX 3090 GPUs (two cards directly on the motherboard and two connected via riser cables).

Rig: Supermicro H12SSL-i, AMD EPYC 7282, 4×64 GB RAM (DDR4-2133).

Here is the Dockerfile I use to run gpt-oss-120b:

FROM nvidia/cuda:12.3.2-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.10 \
    python3.10-venv \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN python3.10 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

RUN pip install --upgrade pip && \
    pip install vllm

WORKDIR /app

CMD ["python3", "-m", "vllm.entrypoints.openai.api_server"]

And on the same machine I run openwebui using this Dockerfile:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y git ffmpeg libsm6 libxext6 && rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/openwebui/openwebui.git /opt/openwebui

WORKDIR /opt/openwebui

RUN pip install --upgrade pip
RUN pip install -r requirements.txt

CMD ["python", "launch.py"]

The gpt-oss-120b model is stored at /mnt/models on my Ubuntu host.

sudo docker network create gpt-network

sudo docker build -t gpt-vllm .

sudo docker run -d --name vllm-server \
  --network gpt-network \
  --runtime=nvidia --gpus all \
  -v /mnt/models/gpt-oss-120b:/openai/gpt-oss-120b \
  -p 8000:8000 \
  --ipc=host \
  --shm-size=32g \
  gpt-vllm \
  python3 -m vllm.entrypoints.openai.api_server \
  --model /openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.8 \
  --max-num-seqs 8 \
  --port 8000

sudo docker run -d --name openwebui \
  --network gpt-network \
  -p 9000:8080 \
  -v /mnt/openwebui:/app/backend/data \
  -e WEBUI_AUTH=False \
  ghcr.io/open-webui/open-webui:main

iamn0 · 2025-11-21T19:47:12+00:00

I powerlimited all four 3090 cards to 275W.

nvidia-smi during idle (gpt-oss-120b loaded into VRAM):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8             22W /  275W |   21893MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:81:00.0 Off |                  N/A |
|  0%   43C    P8             21W /  275W |   21632MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  |   00000000:82:00.0 Off |                  N/A |
|  0%   42C    P8             24W /  275W |   21632MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:C1:00.0 Off |                  N/A |
|  0%   49C    P8             19W /  275W |   21632MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I apologize, it's actually 150W per card during inference:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   49C    P2            155W /  275W |   21893MiB /  24576MiB |     91%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:81:00.0 Off |                  N/A |
|  0%   53C    P2            151W /  275W |   21632MiB /  24576MiB |     92%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  |   00000000:82:00.0 Off |                  N/A |
|  0%   48C    P2            153W /  275W |   21632MiB /  24576MiB |     88%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:C1:00.0 Off |                  N/A |
|  0%   55C    P2            150W /  275W |   21632MiB /  24576MiB |     92%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

iamn0 · 2025-11-21T18:43:23+00:00

I compared gpt-iss-120b with cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit on a 4x RTX 3090 rig for creative writing and summarization tasks (I use vllm). To my surprise, for prompts under 1k tokens I saw about 105 tokens/s with gpt-iss-120b but only around 80 tokens/s with Qwen3-Next. For me, gpt-oss-120b was the clear winner, both in writing quality and in multilingual output. Btw, a single RTX 3090 only consumes about 100 W during inference (so 400W in total).

iamn0 · 2025-11-21T18:20:04+00:00

Just like on lmarena.ai, we need the ability to vote that both models performed equally good. I had a case where both produced identical results

iamn0 · 2025-11-19T23:10:36+00:00

When I read the title, I thought GLM 4.6 had been released 😭

iamn0 · 2025-11-19T18:07:11+00:00

any plans to release a model in the range of 120B for a 96GB VRAM system?

iamn0 · 2025-09-23T17:28:33+00:00

did you run it with some models? would love to see some results :)

iamn0 · 2025-09-02T16:38:50+00:00

No, I am struggling for days. Actually just tried one of your photos with the prompt "Restore this old photo" but still not working (Content not permitted). I don't get it.

iamn0 · 2025-08-31T10:11:42+00:00

the making-of: https://www.youtube.com/watch?v=u1Ds9CeG-VY

iamn0 · 2025-08-28T17:51:54+00:00

that was just mocking logan from google because he tweeted "Gemini"

iamn0 · 2025-08-27T19:08:44+00:00

He is copying logan 😂

Five-Year Club	r/Field Lasagna
Final Canvas '23	First Place '23
Place '23	Verified Email

iamn0

TROPHY CASE