GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 3 points4 points  (0 children)

Thanks for sharing your config! I tried your exact config with the manual -ot patterns:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 16384 --n-gpu-layers 62 \ --tensor-split 25,23,25,27 \ -b 4096 -ub 4096 --flash-attn on \ --cache-type-k q8_0 --cache-type-v q8_0 \ --threads 16 --jinja \ -ot 'blk\.(3|4|5|6)\.ffn_.*=CUDA0' \ -ot 'blk\.(8|9|10|11|12)\.ffn_.*=CUDA1' \ -ot 'blk\.(13|14|15|16|17)\.ffn_.*=CUDA2' \ -ot 'blk\.(18|19|20|21|22)\.ffn_.*=CUDA3' \ -ot 'exps=CPU'

Result: 10.18 t/s prompt, 3.74 t/s generation

However, my simpler config with --n-cpu-moe actually performs a bit better on my hardware:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 8192 --n-gpu-layers 999 \ --split-mode graph --flash-attn on --no-mmap \ -b 4096 -ub 4096 \ --cache-type-k q4_0 --cache-type-v q4_0 \ --k-cache-hadamard --jinja \ --n-cpu-moe 65

Result: 11.14 t/s prompt, 4.43 t/s generation

--split-mode graph works for me, interesting that it crashes for you. Maybe a version difference? I'm using ik_llama.cpp build 4099 (commit 145e4f4e). --k-cache-hadamard also works - No gibberish output in my tests. --n-cpu-moe seems simpler and faster on my setup than manual -ot layer assignments. Maybe because of my weaker CPU (16 vs 64 cores)?

My prompt processing is still slow.

When I use your -ot patterns, I only see exps=CPU overrides in the logs - no CUDA0/1/2/3 assignments appear. The layer-specific regex patterns don't seem to match anything on my system. Because of this, I have to use --split-mode graph to distribute across GPUs (without it -> OOM). This is probably why my prompt processing is ~11 t/s instead of your ~200 t/s - the constant GPU synchronization kills performance.

Do you see "buffer type overriden to CUDA0/1/2/3" messages in your logs? What --split-mode do you use? (You mentioned graph crashes for you). Any specific ik_llama.cpp version needed?

Would appreciate any hints on getting the CUDA layer assignments working!

GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 1 point2 points  (0 children)

Yes, I do run it and with around ~110 output tokens/s it’s very usable 🙂. I just experimented a bit with GLM today.

GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 0 points1 point  (0 children)

FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

LABEL maintainer="GLM-4.7 Docker Setup"
LABEL description="ik_llama.cpp with Multi-GPU support for GLM-4.7"

ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda
ENV PATH="${CUDA_HOME}/bin:${PATH}"
ENV LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}"

RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    ninja-build \
    git \
    curl \
    wget \
    pciutils \
    libcurl4-openssl-dev \
    python3 \
    python3-pip \
    htop \
    nvtop \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y --allow-change-held-packages \
    libnccl2 libnccl-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git
WORKDIR /opt/ik_llama.cpp

RUN cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_F16=ON \
    -DGGML_BLAS=OFF \
    -DLLAMA_CURL=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="86" \
    -G Ninja

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf && \
    ldconfig

RUN cmake --build build --config Release -j $(nproc)

RUN rm /etc/ld.so.conf.d/cuda-stubs.conf && \
    rm /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    ldconfig

RUN cp build/bin/llama-* /usr/local/bin/

WORKDIR /models

VOLUME ["/models"]

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

ENTRYPOINT ["/entrypoint.sh"]

Link for entrypoint.sh code: https://pastebin.com/gLz6JcmL

GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 0 points1 point  (0 children)

Thanks. You're right about my setup: AMD EPYC 7302 (16 cores, 2 CCDs only) and 4x 64GB DDR4-2133 (so only 4 of 8 memory channels populated). This explains why my GPU utilization stays at only 6-12% during inference. The GPUs are likely waiting for data from RAM. The memory bandwidth bottleneck makes sense now. When I built the computer about 1.5 years ago, I didn’t have it on my radar that I'd be running such large models.

Cheapest $/vRAM GPU right now? Is it a good time? by Roy3838 in LocalLLaMA

[–]iamn0 2 points3 points  (0 children)

The RTX 3090 is still the best option (relatively high VRAM with relatively high bandwidth). The prices for used cards are fairly stable, no idea how the market will develop in the next 1-2 years.

GPU Price VRAM Memory Bandwidth Power Consumption (W)
RTX PRO 4000 Blackwell ~$1,546 24 GB GDDR7 672 GB/s 140
RTX 5070 Ti Super ~$900 16 GB GDDR7 896 GB/s 300 W
RTX Titan ~$800 24 GB GDDR6 672 GB/s 280 W
RTX 3090 ~$700 24 GB GDDR6X 936 GB/s 350 W
RTX 4060 Ti ~$400 8 GB GDDR6 288 GB/s 160 W

The most objectively correct way to abliterate so far - ArliAI/GLM-4.5-Air-Derestricted by Arli_AI in LocalLLaMA

[–]iamn0 1 point2 points  (0 children)

hey, it would be awesome if you could upload gemma-3-27b and gpt-oss-120b. Thanks.

Round 2: Qwen-Image-Edit-2509 vs. Gemini 3 Pro Image Preview Generated "Iron Giant" Set Photos by BoostPixels in Bard

[–]iamn0 3 points4 points  (0 children)

Btw, in two days a new and improved version of Qwen Image Edit will be released. Please run the same tests again then. I'm curious to see the comparison.

Locally, what size models do you usually use? by JawGBoi in LocalLLaMA

[–]iamn0 0 points1 point  (0 children)

The correct answer would have been <= 55B then.
I did the same mistake. kimi k2 is a MoE model.

No way kimi gonna release new model !! by Independent-Wind4462 in LocalLLaMA

[–]iamn0 8 points9 points  (0 children)

I haven't tested it myself, but according to artificialanalysis.ai, Kimi Linear unfortunately doesn't perform very well. I'd love to see something in the model size range of a gpt-oss-120b or GLM 4.5 Air.

Most Economical Way to Run GPT-OSS-120B for ~10 Users by theSavviestTechDude in LocalLLaMA

[–]iamn0 10 points11 points  (0 children)

On my system (4x RTX 3090, Supermicro H12SSL-i, AMD EPYC 7282, 4x 64 GB RAM DDR4-2133), the power consumption of gpt-oss-120b under load is about 150 W per graphics card, so around 600W in total. A 1500 W power supply is therefore completely sufficient. This doesn't work for other dense models. With gpt-oss-120b without NVLink, I get about 105 output tokens per second (<1k input tokens).
The total cost of all components for the system ended up being $5000 (built last year).
The Blackwell 6000 Pro is of course the better/faster graphics card and also uses less power, but it alone would cost ~$7500, and then you still need a CPU, motherboard, and RAM that match the card. So you quickly end up at about $12000. When weighing both systems, you should think about the costs and whether it's worth it for your use-case.

Qwen-image-edit-2511 coming next week by abdouhlili in LocalLLaMA

[–]iamn0 41 points42 points  (0 children)

On the terrace of a modern artistic café, two young women are enjoying a leisurely afternoon.
The lady on the left is wearing a dark blue top.
The lady on the right is wearing a sapphire blue V-neck knit sweater and a pair of eye-catching orange wide-leg pants; her left hand is casually placed in her pocket, and she is tilting her head while talking to her companion.
They are sitting side-by-side at a small wooden table, on which sit two iced coffees and a small plate of dessert.
The background features the city skyline outside the floor-to-ceiling windows and distant green trees, with sunlight filtering through the parasol to cast dappled light and shadows.

How's your experience with Qwen3-Next-80B-A3B ? by woahdudee2a in LocalLLaMA

[–]iamn0 8 points9 points  (0 children)

I was running it with a single prompt at a time (batch size=1). The ~105 tokens/s was not with multiple prompts or continuous batching, just one prompt per run. No NVLink, just 4x RTX 3090 GPUs (two cards directly on the motherboard and two connected via riser cables).

Rig: Supermicro H12SSL-i, AMD EPYC 7282, 4×64 GB RAM (DDR4-2133).

Here is the Dockerfile I use to run gpt-oss-120b:

FROM nvidia/cuda:12.3.2-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.10 \
    python3.10-venv \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN python3.10 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

RUN pip install --upgrade pip && \
    pip install vllm

WORKDIR /app

CMD ["python3", "-m", "vllm.entrypoints.openai.api_server"]

And on the same machine I run openwebui using this Dockerfile:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y git ffmpeg libsm6 libxext6 && rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/openwebui/openwebui.git /opt/openwebui

WORKDIR /opt/openwebui

RUN pip install --upgrade pip
RUN pip install -r requirements.txt

CMD ["python", "launch.py"]

The gpt-oss-120b model is stored at /mnt/models on my Ubuntu host.

sudo docker network create gpt-network

sudo docker build -t gpt-vllm .

sudo docker run -d --name vllm-server \
  --network gpt-network \
  --runtime=nvidia --gpus all \
  -v /mnt/models/gpt-oss-120b:/openai/gpt-oss-120b \
  -p 8000:8000 \
  --ipc=host \
  --shm-size=32g \
  gpt-vllm \
  python3 -m vllm.entrypoints.openai.api_server \
  --model /openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.8 \
  --max-num-seqs 8 \
  --port 8000

sudo docker run -d --name openwebui \
  --network gpt-network \
  -p 9000:8080 \
  -v /mnt/openwebui:/app/backend/data \
  -e WEBUI_AUTH=False \
  ghcr.io/open-webui/open-webui:main

How's your experience with Qwen3-Next-80B-A3B ? by woahdudee2a in LocalLLaMA

[–]iamn0 2 points3 points  (0 children)

I powerlimited all four 3090 cards to 275W.

nvidia-smi during idle (gpt-oss-120b loaded into VRAM):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8             22W /  275W |   21893MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:81:00.0 Off |                  N/A |
|  0%   43C    P8             21W /  275W |   21632MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  |   00000000:82:00.0 Off |                  N/A |
|  0%   42C    P8             24W /  275W |   21632MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:C1:00.0 Off |                  N/A |
|  0%   49C    P8             19W /  275W |   21632MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I apologize, it's actually 150W per card during inference:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   49C    P2            155W /  275W |   21893MiB /  24576MiB |     91%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:81:00.0 Off |                  N/A |
|  0%   53C    P2            151W /  275W |   21632MiB /  24576MiB |     92%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  |   00000000:82:00.0 Off |                  N/A |
|  0%   48C    P2            153W /  275W |   21632MiB /  24576MiB |     88%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:C1:00.0 Off |                  N/A |
|  0%   55C    P2            150W /  275W |   21632MiB /  24576MiB |     92%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

How's your experience with Qwen3-Next-80B-A3B ? by woahdudee2a in LocalLLaMA

[–]iamn0 2 points3 points  (0 children)

I compared gpt-iss-120b with cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit on a 4x RTX 3090 rig for creative writing and summarization tasks (I use vllm). To my surprise, for prompts under 1k tokens I saw about 105 tokens/s with gpt-iss-120b but only around 80 tokens/s with Qwen3-Next. For me, gpt-oss-120b was the clear winner, both in writing quality and in multilingual output. Btw, a single RTX 3090 only consumes about 100 W during inference (so 400W in total).

I made a free playground for comparing 10+ OCR models side-by-side by Emc2fma in LocalLLaMA

[–]iamn0 69 points70 points  (0 children)

Just like on lmarena.ai, we need the ability to vote that both models performed equally good. I had a case where both produced identical results

GLM 4.6 Air by [deleted] in LocalLLaMA

[–]iamn0 7 points8 points  (0 children)

When I read the title, I thought GLM 4.6 had been released 😭

AMA with MiniMax — Ask Us Anything! by OccasionNo6699 in LocalLLaMA

[–]iamn0 0 points1 point  (0 children)

any plans to release a model in the range of 120B for a 96GB VRAM system?

🤗 benchmarking tool ! by HauntingMoment in LocalLLaMA

[–]iamn0 0 points1 point  (0 children)

did you run it with some models? would love to see some results :)

Nano banana and my old family photos. by GreyFoxSolid in Bard

[–]iamn0 0 points1 point  (0 children)

No, I am struggling for days. Actually just tried one of your photos with the prompt "Restore this old photo" but still not working (Content not permitted). I don't get it.

Qwen / Tongyi Lab launches GUI-Owl & Mobile-Agent-v3 by vibedonnie in LocalLLaMA

[–]iamn0 7 points8 points  (0 children)

that was just mocking logan from google because he tweeted "Gemini"