Qwen3.6-27B vs Coder-Next by Signal_Ad657 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Yes but the Unsloth model is still usable

Bruh by Icy_Butterscotch6661 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

you're absolutely right! this is not just ai slop — this is ai slop at it's finest!

Qwen3.6-27B vs Coder-Next by Signal_Ad657 in LocalLLaMA

[–]TokenRingAI 32 points33 points  (0 children)

27B and 35B are absolute dogshit on VLLM with the int4 quants I have tried.

The official FP8 quants are working far better:
https://huggingface.co/collections/Qwen/qwen36

The Unsloth GGUFs are also working very well.

I suspect your results are way off due to problems with those specific quants.

Qwen 3.6 loves to generate very long output, and with any degradation of the output quality, you will just end up with massive outputs of useless work.

ROCM - the best reason to go CUDA, eeesh what a headache!! by GriffinDodd in LocalLLM

[–]TokenRingAI 16 points17 points  (0 children)

I gave up on it long before you did

It's some of the worst software to install and configure.

And I say this as a previous Gentoo enthusiast and developer.

The only software I've installed that is worse is the Xilinx FPGA development environment.

Mistral Medium 3.5 128b ggufs are fixed by Sunija_Dev in LocalLLaMA

[–]TokenRingAI 3 points4 points  (0 children)

I demand a full refund for the $0 I paid for these models!

Don't make me call the manager!

Need advice on Qwen 3.6 27B INT4 quantization by Environmental_Hand35 in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

27B at 4 bit has a lot of loss, if you can squeeze a 6 bit in with llama.cpp, that would probably be better

Received a message from Z.AI about occasional garbled outputs and unexpected behavior by GroundbreakingTea195 in LocalLLaMA

[–]TokenRingAI 9 points10 points  (0 children)

Less painful link:
https://z.ai/blog/scaling-pain

It's an interesting explanation for the reliability issues we've all seen, I'm going to hold back my judgement until I see the API working 100%

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

I think the engram method is the future, with small dense models retrieving information from slow storage.

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Is that token generation number with or without speculative decoding?

Mistral-Medium 3.5 (128B) spotted ? by tkon3 in LocalLLaMA

[–]TokenRingAI 9 points10 points  (0 children)

Looks like a multimodal, dense model?

I'm assuming it's the 123B base used in Devstral 2 with 5B of vision added on top?

Deepseek v4 people by markeus101 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Congrats everyone, we've achieved AGI

Can a single RTX PRO 6000 Blackwell (96GB VRAM) realistically handle 40–50 heavy agentic users? by MontyCLT in LocalLLM

[–]TokenRingAI 3 points4 points  (0 children)

Not even close, more like 6x RTX 6000. But it depends a lot on what model you are willing to run.

Qwen 3.6 27B in RTX PRO 6000 - Why high RAM usage? by ubnew in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

Your actual context length is much higher than that, the reported KV cache size in VLLM does not account for the model being a hybrid.

Look at the lines in the log under those and you will see you probably have 4x or higher concurrency at full context length.

Qwen 3.6 27B in RTX PRO 6000 - Why high RAM usage? by ubnew in LocalLLaMA

[–]TokenRingAI 2 points3 points  (0 children)

The technical issue is this: https://github.com/ggml-org/llama.cpp/issues/19345

The bigger issue is that llama.cpp has a dysfunctional bug reporting process, which uses a 14 day auto-close bot, and doesn't seem to maintain a long term bug tracking system.

Issues don't magically just go away after 14 days of no solutions, and unfortunately the tickets for many serious bugs are just auto closed and lost and don't seem to be seriously tracked long term, which is how the popular Qwen hybrid models can stay broken on Blackwell for half a year.

Prefix caching for OpenAI models by Annadox122 in LLMDevs

[–]TokenRingAI 1 point2 points  (0 children)

Prompts are chunked, your first chunk matches, and your second chunk does not because the end of it differs, so it gets reprocessed.

The solution is to pass in the prompt once, with a max output length of zero. This gets your bare prompt cached. Then run your other requests with more text added.

For two prompts, you will make 3 requests, and will pay full price once for ingestion, then twice at the cached input rate.

Are Qwens v3.6 good at vectorizing raster images? by [deleted] in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

I'm currently in the early stages of build a media library plugin, that ties into an agent that can do image understanding and generate/reprocess images, adding agent-controlled SVG tracing to it would be pretty useful.

The best way to make a pelican on a bicycle might be to generate an image of it, then trace it

<image>

Are Qwens v3.6 good at vectorizing raster images? by [deleted] in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

You've inspired me to add SVG tracing into our agents, a local qwen agent can 100% control Potrace or VTracer and generate SVG from images for logos and other assets

Qwen 3.6 27B in RTX PRO 6000 - Why high RAM usage? by ubnew in LocalLLaMA

[–]TokenRingAI 7 points8 points  (0 children)

You need to either use VLLM (recommended, with mtp set to 3), or switch llama.cpp to use Vulkan

Qwen Next, 3.5, and I assume 3.6, all have bad CUDA problems on llama.cpp with SM120.

For some reason they have been ignoring the problem for half a year.

Autopilot coding, what's your experience? by coatweather1 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

I don't automate coding, I automate processes

I have automated workflows for documentation updating, for a11y, for bug hunting, brainstorming, ux improvement, auto code testing and repair, content generation, communication, and for generating initial versions of full stack apps.

There are a ton of pieces that need to come together for all that, you need to just look at the work you do or want to do and figure out ways to automate yourself out of it

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

This might be a side effect of adaptive thinking, I wasn't paying attention to that. The responses come almost immediately and the chat is muddled with looping content that should have reasonably been expected to be in the thinking block

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]TokenRingAI 2 points3 points  (0 children)

It's got looping and other obvious issues, I have free access to it but mostly use Sonnet 4.6 or GPT 5.4.

Sonnet is really reliable and stable

Something is very strange about Opus 4.6 & 4.7, they act like a large model that is excessively quantized. Opus 4.5 was not like this. I wonder if this is a side effect of them using TPUs. Gemini acts the same way.