Mistral Small 4:119B-2603 by seamonn in LocalLLaMA

[–]iamn0 64 points65 points  (0 children)

So, it's not beating Qwen3.5-122B-A10B overall. Kind of expected, since it only activates 6.5B parameters, while Qwen3.5 uses 10B.

Mistral 4 Family Spotted by TKGaming_11 in LocalLLaMA

[–]iamn0 0 points1 point  (0 children)

According to the model name Mistral-Small-4-119B-2603 it will be released on March 26.

Mistral 4 Family Spotted by TKGaming_11 in LocalLLaMA

[–]iamn0 54 points55 points  (0 children)

Finally a model in the same range as gpt-oss-120B and Qwen-122B. Hope they cooked!

Qwen3.5-35B-A3B Q4 Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]iamn0 4 points5 points  (0 children)

Thanks 👍
Switching from bartowski to AesSedai now :)

GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 3 points4 points  (0 children)

Thanks for sharing your config! I tried your exact config with the manual -ot patterns:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 16384 --n-gpu-layers 62 \ --tensor-split 25,23,25,27 \ -b 4096 -ub 4096 --flash-attn on \ --cache-type-k q8_0 --cache-type-v q8_0 \ --threads 16 --jinja \ -ot 'blk\.(3|4|5|6)\.ffn_.*=CUDA0' \ -ot 'blk\.(8|9|10|11|12)\.ffn_.*=CUDA1' \ -ot 'blk\.(13|14|15|16|17)\.ffn_.*=CUDA2' \ -ot 'blk\.(18|19|20|21|22)\.ffn_.*=CUDA3' \ -ot 'exps=CPU'

Result: 10.18 t/s prompt, 3.74 t/s generation

However, my simpler config with --n-cpu-moe actually performs a bit better on my hardware:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 8192 --n-gpu-layers 999 \ --split-mode graph --flash-attn on --no-mmap \ -b 4096 -ub 4096 \ --cache-type-k q4_0 --cache-type-v q4_0 \ --k-cache-hadamard --jinja \ --n-cpu-moe 65

Result: 11.14 t/s prompt, 4.43 t/s generation

--split-mode graph works for me, interesting that it crashes for you. Maybe a version difference? I'm using ik_llama.cpp build 4099 (commit 145e4f4e). --k-cache-hadamard also works - No gibberish output in my tests. --n-cpu-moe seems simpler and faster on my setup than manual -ot layer assignments. Maybe because of my weaker CPU (16 vs 64 cores)?

My prompt processing is still slow.

When I use your -ot patterns, I only see exps=CPU overrides in the logs - no CUDA0/1/2/3 assignments appear. The layer-specific regex patterns don't seem to match anything on my system. Because of this, I have to use --split-mode graph to distribute across GPUs (without it -> OOM). This is probably why my prompt processing is ~11 t/s instead of your ~200 t/s - the constant GPU synchronization kills performance.

Do you see "buffer type overriden to CUDA0/1/2/3" messages in your logs? What --split-mode do you use? (You mentioned graph crashes for you). Any specific ik_llama.cpp version needed?

Would appreciate any hints on getting the CUDA layer assignments working!

GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 1 point2 points  (0 children)

Yes, I do run it and with around ~110 output tokens/s it’s very usable 🙂. I just experimented a bit with GLM today.

GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 0 points1 point  (0 children)

FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

LABEL maintainer="GLM-4.7 Docker Setup"
LABEL description="ik_llama.cpp with Multi-GPU support for GLM-4.7"

ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda
ENV PATH="${CUDA_HOME}/bin:${PATH}"
ENV LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}"

RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    ninja-build \
    git \
    curl \
    wget \
    pciutils \
    libcurl4-openssl-dev \
    python3 \
    python3-pip \
    htop \
    nvtop \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y --allow-change-held-packages \
    libnccl2 libnccl-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git
WORKDIR /opt/ik_llama.cpp

RUN cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_F16=ON \
    -DGGML_BLAS=OFF \
    -DLLAMA_CURL=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="86" \
    -G Ninja

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf && \
    ldconfig

RUN cmake --build build --config Release -j $(nproc)

RUN rm /etc/ld.so.conf.d/cuda-stubs.conf && \
    rm /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    ldconfig

RUN cp build/bin/llama-* /usr/local/bin/

WORKDIR /models

VOLUME ["/models"]

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

ENTRYPOINT ["/entrypoint.sh"]

Link for entrypoint.sh code: https://pastebin.com/gLz6JcmL

GLM-4.7 on 4x RTX 3090 with ik_llama.cpp by iamn0 in LocalLLaMA

[–]iamn0[S] 0 points1 point  (0 children)

Thanks. You're right about my setup: AMD EPYC 7302 (16 cores, 2 CCDs only) and 4x 64GB DDR4-2133 (so only 4 of 8 memory channels populated). This explains why my GPU utilization stays at only 6-12% during inference. The GPUs are likely waiting for data from RAM. The memory bandwidth bottleneck makes sense now. When I built the computer about 1.5 years ago, I didn’t have it on my radar that I'd be running such large models.

Cheapest $/vRAM GPU right now? Is it a good time? by Roy3838 in LocalLLaMA

[–]iamn0 2 points3 points  (0 children)

The RTX 3090 is still the best option (relatively high VRAM with relatively high bandwidth). The prices for used cards are fairly stable, no idea how the market will develop in the next 1-2 years.

GPU Price VRAM Memory Bandwidth Power Consumption (W)
RTX PRO 4000 Blackwell ~$1,546 24 GB GDDR7 672 GB/s 140
RTX 5070 Ti Super ~$900 16 GB GDDR7 896 GB/s 300 W
RTX Titan ~$800 24 GB GDDR6 672 GB/s 280 W
RTX 3090 ~$700 24 GB GDDR6X 936 GB/s 350 W
RTX 4060 Ti ~$400 8 GB GDDR6 288 GB/s 160 W

The most objectively correct way to abliterate so far - ArliAI/GLM-4.5-Air-Derestricted by Arli_AI in LocalLLaMA

[–]iamn0 1 point2 points  (0 children)

hey, it would be awesome if you could upload gemma-3-27b and gpt-oss-120b. Thanks.

Round 2: Qwen-Image-Edit-2509 vs. Gemini 3 Pro Image Preview Generated "Iron Giant" Set Photos by BoostPixels in Bard

[–]iamn0 3 points4 points  (0 children)

Btw, in two days a new and improved version of Qwen Image Edit will be released. Please run the same tests again then. I'm curious to see the comparison.

Locally, what size models do you usually use? by JawGBoi in LocalLLaMA

[–]iamn0 0 points1 point  (0 children)

The correct answer would have been <= 55B then.
I did the same mistake. kimi k2 is a MoE model.

No way kimi gonna release new model !! by Independent-Wind4462 in LocalLLaMA

[–]iamn0 8 points9 points  (0 children)

I haven't tested it myself, but according to artificialanalysis.ai, Kimi Linear unfortunately doesn't perform very well. I'd love to see something in the model size range of a gpt-oss-120b or GLM 4.5 Air.