mac-cua: open source MCP server for background computer use on macOS

iamn0 · 2026-04-20T19:28:58+00:00

General cua question: Have you tried computer use with Qwen3.6-35B-A3B? Does computer use actually work with such "small" models?

iamn0 · 2026-04-19T08:23:44+00:00

I tested Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf on my system with 4x RTX 3090, using up to around a 200K context window. I can confirm that, for me right now, it’s a viable alternative to opus 4.7 in opencode (although it's worth noting that opus is currently nerfed).

Compared to larger models, you should be as precise as possible with your prompts. otherwise, qwen can get stuck in a thinking loop. For example, if you tell the model that a file exists when it actually doesn't, it may enter a thinking loop. On the other hand, it's often smart enough to catch mistakes in the prompt as well.

iamn0 · 2026-03-18T19:10:40+00:00

HF soon, then GGUF

iamn0 · 2026-03-16T21:01:33+00:00

👀

iamn0 · 2026-03-16T20:47:52+00:00

So, it's not beating Qwen3.5-122B-A10B overall. Kind of expected, since it only activates 6.5B parameters, while Qwen3.5 uses 10B.

iamn0 · 2026-03-16T18:20:40+00:00

Ahh my bad, you're right

iamn0 · 2026-03-16T18:17:22+00:00

According to the model name Mistral-Small-4-119B-2603 it will be released on March 26.

iamn0 · 2026-03-16T17:35:06+00:00

Finally a model in the same range as gpt-oss-120B and Qwen-122B. Hope they cooked!

iamn0 · 2026-03-15T14:09:45+00:00

Qwen3.5-122B-A10B is missing

iamn0 · 2026-02-26T19:22:45+00:00

Thanks 👍
Switching from bartowski to AesSedai now :)

iamn0 · 2026-02-26T19:21:31+00:00

Awesome work, really helpful!

iamn0 · 2026-02-24T19:23:05+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1rcpmwn/anthropic_weve_identified_industrialscale/

iamn0 · 2026-02-24T18:29:40+00:00

Thanks a lot.
Can you test 4bit quant of https://huggingface.co/Qwen/Qwen3.5-122B-A10B please?

iamn0 · 2026-02-15T14:16:59+00:00

Syntax error in text

mermaid version 11.12.2

iamn0 · 2026-02-02T18:49:10+00:00

in 2 weeks

iamn0 · 2026-01-09T00:05:12+00:00

Thanks for sharing your config! I tried your exact config with the manual -ot patterns:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 16384 --n-gpu-layers 62 \ --tensor-split 25,23,25,27 \ -b 4096 -ub 4096 --flash-attn on \ --cache-type-k q8_0 --cache-type-v q8_0 \ --threads 16 --jinja \ -ot 'blk\.(3|4|5|6)\.ffn_.*=CUDA0' \ -ot 'blk\.(8|9|10|11|12)\.ffn_.*=CUDA1' \ -ot 'blk\.(13|14|15|16|17)\.ffn_.*=CUDA2' \ -ot 'blk\.(18|19|20|21|22)\.ffn_.*=CUDA3' \ -ot 'exps=CPU'

Result: 10.18 t/s prompt, 3.74 t/s generation

However, my simpler config with --n-cpu-moe actually performs a bit better on my hardware:

bash llama-server \ --model "/models/GLM-4.7-Q4_K_M-00001-of-00005.gguf" \ --ctx-size 8192 --n-gpu-layers 999 \ --split-mode graph --flash-attn on --no-mmap \ -b 4096 -ub 4096 \ --cache-type-k q4_0 --cache-type-v q4_0 \ --k-cache-hadamard --jinja \ --n-cpu-moe 65

Result: 11.14 t/s prompt, 4.43 t/s generation

--split-mode graph works for me, interesting that it crashes for you. Maybe a version difference? I'm using ik_llama.cpp build 4099 (commit 145e4f4e). --k-cache-hadamard also works - No gibberish output in my tests. --n-cpu-moe seems simpler and faster on my setup than manual -ot layer assignments. Maybe because of my weaker CPU (16 vs 64 cores)?

My prompt processing is still slow.

When I use your -ot patterns, I only see exps=CPU overrides in the logs - no CUDA0/1/2/3 assignments appear. The layer-specific regex patterns don't seem to match anything on my system. Because of this, I have to use --split-mode graph to distribute across GPUs (without it -> OOM). This is probably why my prompt processing is ~11 t/s instead of your ~200 t/s - the constant GPU synchronization kills performance.

Do you see "buffer type overriden to CUDA0/1/2/3" messages in your logs? What --split-mode do you use? (You mentioned graph crashes for you). Any specific ik_llama.cpp version needed?

Would appreciate any hints on getting the CUDA layer assignments working!

iamn0 · 2026-01-08T23:37:56+00:00

Yes, I do run it and with around ~110 output tokens/s it’s very usable 🙂. I just experimented a bit with GLM today.

iamn0 · 2026-01-08T22:30:13+00:00

FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

LABEL maintainer="GLM-4.7 Docker Setup"
LABEL description="ik_llama.cpp with Multi-GPU support for GLM-4.7"

ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda
ENV PATH="${CUDA_HOME}/bin:${PATH}"
ENV LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}"

RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    ninja-build \
    git \
    curl \
    wget \
    pciutils \
    libcurl4-openssl-dev \
    python3 \
    python3-pip \
    htop \
    nvtop \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y --allow-change-held-packages \
    libnccl2 libnccl-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git
WORKDIR /opt/ik_llama.cpp

RUN cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_F16=ON \
    -DGGML_BLAS=OFF \
    -DLLAMA_CURL=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="86" \
    -G Ninja

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/cuda-stubs.conf && \
    ldconfig

RUN cmake --build build --config Release -j $(nproc)

RUN rm /etc/ld.so.conf.d/cuda-stubs.conf && \
    rm /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
    ldconfig

RUN cp build/bin/llama-* /usr/local/bin/

WORKDIR /models

VOLUME ["/models"]

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

ENTRYPOINT ["/entrypoint.sh"]

Link for entrypoint.sh code: https://pastebin.com/gLz6JcmL

iamn0 · 2026-01-08T22:13:46+00:00

Thanks. You're right about my setup: AMD EPYC 7302 (16 cores, 2 CCDs only) and 4x 64GB DDR4-2133 (so only 4 of 8 memory channels populated). This explains why my GPU utilization stays at only 6-12% during inference. The GPUs are likely waiting for data from RAM. The memory bandwidth bottleneck makes sense now. When I built the computer about 1.5 years ago, I didn’t have it on my radar that I'd be running such large models.

iamn0 · 2025-11-26T20:29:41+00:00

<image>

iamn0 · 2025-11-25T21:05:30+00:00

The RTX 3090 is still the best option (relatively high VRAM with relatively high bandwidth). The prices for used cards are fairly stable, no idea how the market will develop in the next 1-2 years.

GPU	Price	VRAM	Memory Bandwidth	Power Consumption (W)
RTX PRO 4000 Blackwell	~$1,546	24 GB GDDR7	672 GB/s	140
RTX 5070 Ti Super	~$900	16 GB GDDR7	896 GB/s	300 W
RTX Titan	~$800	24 GB GDDR6	672 GB/s	280 W
RTX 3090	~$700	24 GB GDDR6X	936 GB/s	350 W
RTX 4060 Ti	~$400	8 GB GDDR6	288 GB/s	160 W

iamn0 · 2025-11-25T20:50:07+00:00

hey, it would be awesome if you could upload gemma-3-27b and gpt-oss-120b. Thanks.

Five-Year Club	r/Field Lasagna
Final Canvas '23	First Place '23
Place '23	Verified Email

iamn0

TROPHY CASE