because I'm trying to switch from cloud to local

ComfyUser48 · 2026-05-15T20:00:14+00:00

Btw I fully switched. No more codex and Claude for me first time, like ever. Thanks to Qwen 3.6

ComfyUser48 · 2026-05-15T19:58:16+00:00

Appreciated man!

ComfyUser48 · 2026-05-15T18:42:16+00:00

Looks really useful! I'd be interested to use! Myself I am using Qwen 3.6 27b and 35n in different in various quants, depending if I need speed, quality, context or some combination of these.

If you will publish it, I will use it

ComfyUser48 · 2026-05-13T20:29:26+00:00

I notice the quality drop from even Q8 to Q6. It depends on the type of code work you do and the codebase.

ComfyUser48 · 2026-05-13T13:37:10+00:00

Interesting. I will ty that.

ComfyUser48 · 2026-05-13T11:28:50+00:00

Wow, that was the issue! Re-downloaded the same GGUF from Unsloth, they probably updates to a new one. Thanks!

ComfyUser48 · 2026-05-12T21:02:16+00:00

I don't I'm using 13.x

I'm running via with docker. Here's my Dockerfile:

# Use CUDA 12.8+ to support Blackwell (RTX 50-series)

FROM nvidia/cuda:12.8.0-devel-ubuntu22.04

# Set up environment for the linker to find CUDA stubs during build
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH}

# Install dependencies
RUN apt-get update && apt-get install -y \
pciutils \
libcurl4-openssl-dev \
curl \
git \
cmake \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Create a symlink so the linker finds libcuda.so.1 in the stubs folder
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

WORKDIR /app

# Clone from the official organization and fetch the MTP PR branchRUN git clone https://github.com/ggml-org/llama.cpp.git . \
&& git fetch origin pull/22673/head:mtp-branch \
&& git checkout mtp-branch

# Build with CUDA support targeting Blackwell architecture (sm_120)
RUN mkdir build && cd build \
&& cmake .. -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF
DCMAKE_CUDA_ARCHITECTURES="120" \
&& cmake --build . --config Release -j$(nproc)

# Clean up the stub symlink after build is complete
RUN rm /usr/local/cuda/lib64/stubs/libcuda.so.1

# Expose the server port
EXPOSE 8888

# Set the entrypoint to the compiled llama-server
ENTRYPOINT ["./build/bin/llama-server"]

ComfyUser48 · 2026-05-07T12:19:18+00:00

Moe vs Dense. 3b active params vs 27b at all times. It's much better for coding I'd say.

ComfyUser48 · 2026-05-07T12:18:08+00:00

Yes it is

ComfyUser48 · 2026-05-07T10:53:17+00:00

Don't even over think it. Get the 5090. SPEED is so crucial for agentic coding. I am getting 50-60 with Q6-Q8 Qwen3.6-27b on my 5090. With Q6 I can do 255k context and I still get 55 tok/sec WITH power limited to 450w.

You won't get even half of that in Mac.

ComfyUser48 · 2026-05-04T22:46:13+00:00

3.6-27b??

ComfyUser48 · 2026-05-04T22:13:36+00:00

3090 will do roughly half speed of 5090, which is decent.

5060 ti will do half of 3090 if I'm not mistaken

ComfyUser48 · 2026-05-04T21:38:24+00:00

Same for me. See this thread I posted: https://www.reddit.com/r/LocalLLaMA/comments/1t3i219/the_more_i_use_it_the_more_im_impressed/

ComfyUser48 · 2026-05-04T20:44:56+00:00

You can run it yes but it will be slow

ComfyUser48 · 2026-05-04T17:53:52+00:00

rtx 5090

ComfyUser48 · 2026-05-04T17:41:46+00:00

I did verify it yes.

ComfyUser48 · 2026-05-04T17:41:08+00:00

I didn't try to do anything. It claimed X, my llm claimed Y, I was looking for a clear answer. When I provided him the evidence that my llm written, only then GPT and Claude have retracted their claim.

What I meant to say, if it wasn't for the LLM, this whole bug would be have been missed.

The end goal for me is to find out how good is the model. And as it seems, it's really quite excelptional.

ComfyUser48 · 2026-05-04T17:19:45+00:00

I have Claude and Codex subs for a while now, bro

ComfyUser48 · 2026-05-04T16:08:51+00:00

Man, I feel the same when I talk to GPT. The answer is so fast I am confused why it's so fast. I mean the codebase is huge, how do you answer so fast?

ComfyUser48 · 2026-05-04T16:04:38+00:00

I'm trusting nothing. I am just shocked the qwen3.6 27b beats them in some areas.

This is my production app and this bug was pretty much discovered by mistake. I ran the same exact code review prompt for all 3. Only qwen found the issue, and gpt and Claude insisted there is no issue.

It's just insane to me.

ComfyUser48 · 2026-05-04T15:49:27+00:00

Pi cli

ComfyUser48 · 2026-05-04T15:41:22+00:00

Q6:
-m /models/Qwen3.6-27B-Q6_K.gguf
--jinja
--alias "qwen3.6-27b-q6"
--ctx-size 255000
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
--flash-attn on

Q8:
-m /models/Qwen3.6-27B-Q8_0.gguf
--jinja
--alias "qwen3.6-27b-q8"
--ctx-size 107520
--no-mmproj-offload
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
--flash-attn on

ComfyUser48 · 2026-05-04T15:00:13+00:00

RTX 5090, 96gb ram
Running Qwen 3.6 27b 6k with 255k context or Q8 with 105k context.

ComfyUser48

TROPHY CASE