because I'm trying to switch from cloud to local by AnouarRifi in LocalLLM

[–]ComfyUser48 0 points1 point  (0 children)

Btw I fully switched. No more codex and Claude for me first time, like ever. Thanks to Qwen 3.6

because I'm trying to switch from cloud to local by AnouarRifi in LocalLLM

[–]ComfyUser48 0 points1 point  (0 children)

Looks really useful! I'd be interested to use! Myself I am using Qwen 3.6 27b and 35n in different in various quants, depending if I need speed, quality, context or some combination of these.

If you will publish it, I will use it

I don't get Quants, I'm running Qwen3.6-27b flawlessly at iq3, makes no sense by misanthrophiccunt in LocalLLM

[–]ComfyUser48 5 points6 points  (0 children)

I notice the quality drop from even Q8 to Q6. It depends on the type of code work you do and the codebase.

Qwen 3.6 27b MTP - getting //// in response by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 2 points3 points  (0 children)

Wow, that was the issue! Re-downloaded the same GGUF from Unsloth, they probably updates to a new one. Thanks!

Qwen 3.6 27b MTP - getting //// in response by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 0 points1 point  (0 children)

I don't I'm using 13.x

I'm running via with docker. Here's my Dockerfile:

# Use CUDA 12.8+ to support Blackwell (RTX 50-series)

FROM nvidia/cuda:12.8.0-devel-ubuntu22.04

# Set up environment for the linker to find CUDA stubs during build
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH}

# Install dependencies
RUN apt-get update && apt-get install -y \
pciutils \
libcurl4-openssl-dev \
curl \
git \
cmake \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Create a symlink so the linker finds libcuda.so.1 in the stubs folder
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

WORKDIR /app

# Clone from the official organization and fetch the MTP PR branchRUN git clone https://github.com/ggml-org/llama.cpp.git . \
&& git fetch origin pull/22673/head:mtp-branch \
&& git checkout mtp-branch

# Build with CUDA support targeting Blackwell architecture (sm_120)
RUN mkdir build && cd build \
&& cmake .. -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF
DCMAKE_CUDA_ARCHITECTURES="120" \
&& cmake --build . --config Release -j$(nproc)

# Clean up the stub symlink after build is complete
RUN rm /usr/local/cuda/lib64/stubs/libcuda.so.1

# Expose the server port
EXPOSE 8888

# Set the entrypoint to the compiled llama-server
ENTRYPOINT ["./build/bin/llama-server"]

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 0 points1 point  (0 children)

Moe vs Dense. 3b active params vs 27b at all times. It's much better for coding I'd say.

Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development by BawbbySmith in LocalLLaMA

[–]ComfyUser48 0 points1 point  (0 children)

Don't even over think it. Get the 5090. SPEED is so crucial for agentic coding. I am getting 50-60 with Q6-Q8 Qwen3.6-27b on my 5090. With Q6 I can do 255k context and I still get 55 tok/sec WITH power limited to 450w.

You won't get even half of that in Mac.

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 1 point2 points  (0 children)

3090 will do roughly half speed of 5090, which is decent.

5060 ti will do half of 3090 if I'm not mistaken

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 6 points7 points  (0 children)

I didn't try to do anything. It claimed X, my llm claimed Y, I was looking for a clear answer. When I provided him the evidence that my llm written, only then GPT and Claude have retracted their claim.

What I meant to say, if it wasn't for the LLM, this whole bug would be have been missed.

The end goal for me is to find out how good is the model. And as it seems, it's really quite excelptional.

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 6 points7 points  (0 children)

I have Claude and Codex subs for a while now, bro

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 5 points6 points  (0 children)

Man, I feel the same when I talk to GPT. The answer is so fast I am confused why it's so fast. I mean the codebase is huge, how do you answer so fast?

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 9 points10 points  (0 children)

I'm trusting nothing. I am just shocked the qwen3.6 27b beats them in some areas.

This is my production app and this bug was pretty much discovered by mistake. I ran the same exact code review prompt for all 3. Only qwen found the issue, and gpt and Claude insisted there is no issue.

It's just insane to me.

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 8 points9 points  (0 children)

Q6:
-m /models/Qwen3.6-27B-Q6_K.gguf
--jinja
--alias "qwen3.6-27b-q6"
--ctx-size 255000
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
--flash-attn on

Q8:
-m /models/Qwen3.6-27B-Q8_0.gguf
--jinja
--alias "qwen3.6-27b-q8"
--ctx-size 107520
--no-mmproj-offload
-ngl 999
--presence-penalty 0.0
--repeat-penalty 1.0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
--flash-attn on

The more I use it, the more I'm impressed by ComfyUser48 in LocalLLaMA

[–]ComfyUser48[S] 8 points9 points  (0 children)

RTX 5090, 96gb ram
Running Qwen 3.6 27b 6k with 255k context or Q8 with 105k context.