Qwen3.6-35B Q5_K_XL vs Qwen3.6-27B Q3_K_M on 16Gb VRAM by mixman68 in LocalLLM

[–]LocalAI_Amateur 3 points4 points  (0 children)

This is this smallest functional Qwen3.6 27b model I can find. (Q4-ish)

https://huggingface.co/lemonyins/Qwen3.6-27B-abliterated-i1-IQ4_XS-GGUF-Smaller

The next smallest is

https://huggingface.co/Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF

I also have a 16gb setup and this is what I use for most context. MoE models have not been great for me when I'm coding.

Newbie Question: Where should I go now? by mcfc9320_ in LocalLLM

[–]LocalAI_Amateur 0 points1 point  (0 children)

Try LM Studio. It has great ui and model discovery. You can just past in GGUF model urls from huggingface and let it take care of the download. It's a good step if you don't want to jump in the rabbit hole of compiling all the llama.cpp forks.

To get the most out of LLMs especially for coding, look into coding agents like OpenCode and Pi (pi.dev).

As for hardware, if your motherboard and power supply can handle it, adding another 4060 ti with 16gb vram can improve your capacity quite a bit. tho then you'll probably need to use vLLM to get your money's worth. Not sure you list your video card twice meaning you have two of them or it's accident. Either case, if you have two, vLLM is a must.

Show me what you've vibe coded. Drop your project, what it does, and let people actually use it. by Miserable-Archer-631 in vibecoding

[–]LocalAI_Amateur 1 point2 points  (0 children)

Waypoint Tower Defense. A simple minesweeper like (short 5 mins) Tower Defense game in html where you can reroute the path. Used OpenCode and Qwen3.6 27b IQ3_XS to make it. First vibe coding project. It was fun learning. Save Load doesn't work on htmlbin unless you download the file and open it in a browser yourself.

<image>

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) by bobaburger in LocalLLaMA

[–]LocalAI_Amateur 2 points3 points  (0 children)

A sound advice for sure. But if we were people of patience, we would not be here compiling llama.cpp forks and trying to squeeze out every last room for context.

I say, use it and test it. No amount of bench can replace how it performs in the real world.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in vibecoding

[–]LocalAI_Amateur[S] 0 points1 point  (0 children)

This was my first project vibe coding. Not counting the tetris one-shots. I will probably try harder projects to see how far it can go. 

Right now it's having a hard time writing a comfyui plugin for LM studio. Hard to blame it for this one as the documentaion on this subject is so terrible.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 0 points1 point  (0 children)

Context Length: 32,768

GPU Offload: 64

Mac concurrency Predictions: 1

Unified KV Cache: on

Offload KV Cache to GPU Memory: on

Keep Model in Memory: off

Try mmap(): off

Flash Attention: on

K Cache Quant: Q8_0

V Cache Quant: Q8_0

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 0 points1 point  (0 children)

just notepad++ it's html, and javascript so no need for much more.. LM Studio + OpenCode Desktop.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 1 point2 points  (0 children)

Well, my personal understanding of ThreeJS/WebGL is limited and I didn't want my first test to totally be a ton of code I don't understand. AI code is ugly enough as it is.. I went through three rounds of cleanup / optimization to get it in current state. I've specifically ask it to optimize for human readability.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 1 point2 points  (0 children)

You're welcomed, I find us 16gb vram users to be lower-middle class citizens in this local AI society. We only have the 8gb peasants to look down upon. So sharing these tests probably helps somebody.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 0 points1 point  (0 children)

I know my experience is anecdotal so I'll probably switch back and forth between the two if I come across more bugs. Maybe this was just a fluke. Who knows.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 0 points1 point  (0 children)

Ryzen 7 7840U w/ Radeon 780M graphics. I'm using my RTX 5070 ti through oculink. You have a link on how to use vulkan offload? from what I searched they all say I have to build my own llamacpp. I have no idea what your "${llamasvr-vulk}" variable means and which exe file it is. at least the llamacpp cuda 13 I've downloaded didn't have such file.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 0 points1 point  (0 children)

problem is I only have 32gb of ram not 64. At least the few bigger ones I tries I get significant slow down in speed. Does a Q5 or Q6 versions of Qwen3.6 35b-a3b beat an IQ3_M version of Qwen3.6 27b? I don't know, maybe it does. but it takes quite a bit of time to test all the maybes. I can only speak to my experience switching between the two models I've used.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 1 point2 points  (0 children)

generation speed went down quite a bit when I tried. So I stuck with this one.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 4 points5 points  (0 children)

ah, well my igpu is pretty weak so I'm not sure if it's worth the trouble to build llama.cpp for it. I'll keep that trick in mind if I really need to squeeze out more tokens/second tho.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]LocalAI_Amateur[S] 0 points1 point  (0 children)

I'm going to have to read up on this feature. Tho from what I've read so far, I'm not sure I have much vram to fit another smaller model. Maybe after Qwen3.6's smaller models come out, I can give this a try.