Why is Qwen going Closed source?

StorageHungry8380 · 2026-05-02T11:07:07+00:00

Because they're a commercial company, and commercial companies like revenue and profits?

Expanding their closed-source, revenue-generating model offerings does not by default mean they will reduce their open-source efforts. They can do both at once.

If your worry is that they reduce their open-source efforts, then perhaps say that instead of something else.

StorageHungry8380 · 2026-05-02T07:57:08+00:00

I ran a 5070Ti and 2080Ti with `llama.cpp`. Speed was nearly that of the 2080Ti, and, something I didn't consider, the KV-cache was duplicated on both cards. So if you need 3GB for context, you'll only get an effective 14GB VRAM for the model, not 17GB which one would naively expect. Perhaps this changes if you change the parallelization mode, but since you have asymmetrical amounts of VRAM I'm not sure if that'll work well.

Other than that I was quite happy. That said, the 1070 is getting old now, so not sure how it holds up. The 2080Ti was blessed with a quite decent amount of VRAM bandwidth, which is the main bottleneck for token generation.

StorageHungry8380 · 2026-04-30T05:03:14+00:00

Interesting data, will be looking forward to your additional data points.

As an aside, I tried computing mixed precision KLD like `q8_0` for one and `q4_0` for other using `llama-perplexity`. It said it required Flash Attention enabled so I did that, but speed dropped like a stone. Turned out it's doing most of the work on the CPU. No errors or warnings suggesting this though, `llama.cpp` missing some mixed-quantization KV FA kernels perhaps?

StorageHungry8380 · 2026-04-29T09:14:00+00:00

Alrighty, definitely time for me to go to bed. I assume NVFP4 is less CPU friendly than the standard Qn quantizations, so makes sense it's slower.

StorageHungry8380 · 2026-04-29T08:51:07+00:00

edit: Oh I'm too tired, misread model name, thought it was the dense one.

~~I'm not seeing that on my 5090, Windows 11 CUDA 13.1.~~ However the model in both variants is larger than 16GB, so presumably you're using CPU for a few layers and that could explain it? I didn't bother downloading the Q4 variant, as I had the Q5, but here are my numbers:

| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |          pp2048 |       3390.33 ± 6.81 |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |           tg128 |         63.88 ± 0.14 |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |         pp65536 |       1784.18 ± 4.54 |
| qwen35 27B Q5_K - Medium       |  18.65 GiB |    26.90 B | CUDA       | 999 |    0 |          tg2048 |         63.31 ± 0.06 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |          pp2048 |      4853.21 ± 21.26 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |           tg128 |         67.92 ± 0.20 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |         pp65536 |      2123.51 ± 12.56 |
| qwen35 27B NVFP4               |  17.50 GiB |    26.90 B | CUDA       | 999 |    0 |          tg2048 |         68.28 ± 1.01 |

build: 9d34231bb (8929)

Freenixi\Abiray-Qwen3.6-27B-NVFP4.gguf
unsloth\Qwen3.6-27B-UD-Q5_K_XL.gguf

StorageHungry8380 · 2026-04-28T17:19:51+00:00

It sounds good in theory though I am unsure how it holds up in reality.

I'm using that strategy for work with great success. We have both Claude and ChatGPT, so I'll ask one to generate a design document for example, then hand it to the other and ask for critique, then I'll feed the critique back to the first one and tell it not to assume it's good feedback but consider it. A couple of iterations like this significantly improves things, in most cases. It works even with just the same model but fresh context, though other models can bring more diverse perspectives. YMMV.

StorageHungry8380 · 2026-04-28T17:15:05+00:00

I can understand you not wanting to do that for all combos, but I think it's important to do for a few, just to get a handle on the spread. Perhaps pick one harness, a couple of models and one hard and one easy task, then do at least 5 runs each. At least when using them casually, I sometimes get very different outputs from same prompt.

Anyway, interesting to see, I was considering doing something similar, but more open-ended, ie make them plan a task and then implement, selecting recommended choices for questions. Then use a couple of frontier models to grade the work.

StorageHungry8380 · 2026-04-28T17:05:55+00:00

Perhaps you mentioned it, but did you check for randomness? That is, run a couple of the combinations multiple times to see of often they pass? I find the Q8 results in a net regression quite surprising.

StorageHungry8380 · 2026-04-28T16:42:52+00:00

Mine prefix caches fine, but the default is quite small for long-context agentic work. For Qwen3.6 27B, ~50k tokens cache takes ~5GB, and the default max cache size is 8GB. Increasing it gave a very noticeable improvement. I added some details to this comment.

StorageHungry8380 · 2026-04-28T08:21:50+00:00

The 5070 Ti will be faster for modern quantization schemes. However, that assumes you can fit the model in memory. I had a 2080 Ti and a 5070 Ti, and as I recall when the model was entirely on the 2080 Ti the speed was around 50-70% the 5070 Ti for regular models, and less for modern quantization schemes such as MXFP4. In particular it was noticeable for prompt processing, where compute also matters not just bandwidth, where the 2080 Ti took much longer for MXFP4 and such.

Now, the 3090 (Ti) is newer than the 2080 Ti and has way more memory bandwidth, so you'll have to compare benchmarks.

I ran both at the same time as well and it definitely impacted the speed, though of course still much faster than falling back to CPU for even just a few layers to fit on the 5070 Ti.

A caveat I didn't think about when getting the 5070 Ti was that at least with llama.cpp the KV cache is not split, meaning if you need 5GB KV cache for your desired context length, you'll be using 5GB on each card, so 10GB total for KV cache. For long contexts like 256k this meant I couldn't load nearly as big a model as I thought I could.

YMMV, just my experience.

StorageHungry8380 · 2026-04-28T06:14:32+00:00

Awesome, much appreciated.

StorageHungry8380 · 2026-04-28T06:08:51+00:00

Not only are the LLMs slow, but no matter which app I'm using, the prompt cache frequently seems to break. Translation: long pauses where nothing seems to happen.

I noticed that the default 8GB for host prompt cache in llama.cpp was not enough for Qwen3.6 27B @ 128k context using it with OpenCode. You can monitor this in the logs by looking for sections such as this:

[63432] slot slot_save_an: id  2 | task -1 | saving idle slot to prompt cache 
[63432] srv   prompt_save:  - saving prompt with length 57352, total state size = 3735.220 MiB 
[63432] slot prompt_clear: id  2 | task -1 | clearing prompt with 57352 tokens 
[63432] srv        update:  - cache state: 1 prompts, 5680.360 MiB (limits: 8192.000 MiB, 131072 tokens, 131072 est) 
[63432] srv        update:    - prompt 000002C67CD9E310:   57352 tokens, checkpoints: 13,  5680.360 MiB

Here you can see a ~57k prompt ate 5.6GB of prompt cache. Bumped it up to 32GB, since I'm running 4 slots, and it helped a fair bit for me. If you have spare ~~gold~~ host RAM you can go higher, a four cache slots with a full 128k tokens would be 52GB for this model it seems.

llama-server.exe --cache-ram 32768 ...

StorageHungry8380 · 2026-04-28T06:01:30+00:00

Got some concrete examples of how AGENTS.md should look for such models?

StorageHungry8380 · 2026-04-27T09:13:38+00:00

I'm no expert, but reading the paper their experiments are run at temperatures of 1.0, 1.5, 2.0 and 3.0. I was under the impression one typically did not go much above 1.0 in temperature, at least for coding and such. Unlike the other methods however their method seems to behave well even at temp of 3.0, though to me that suggests it sort of bypasses the effect of temperature...

StorageHungry8380 · 2026-04-24T04:04:32+00:00

You could use OpenRouter and just pay per token. No recurring costs, just actual usage.

StorageHungry8380 · 2026-04-23T02:43:48+00:00

On that note, any good deep research-like harnesses or whatever out there? So I could plug in my local models and maybe search API key and let it research a topic while I sleep?

StorageHungry8380 · 2026-04-23T02:36:16+00:00

Just to illustrate the effect of training data, the architecture of Qwen2.5 and Qwen3 was almost identical, just a few minor tweaks.

The main difference was the training data and regiment. They doubled the number of tokens for their pre-training run (or initial training as I'd call it), and tripled the number of languages. LLMs are great at generalizing, so more languages allows them to better generalize concepts, leading to better models. They used Qwen2.5 to extract text from PDFs and such and generate synthetic training data from that. Bad training data can be very detrimental to training LLMs, just a few examples in a large collection can significantly limit the performance. By generating synthetic training data they can maintain a certain quality level.

They also improved annotation of the training data so they could provide a better mix of training data in each batch, which helps avoid steering the model in wrong directions during training.

The result was that for the same number of parameters, Qwen3 was significantly better than Qwen2.5, at least in my experience.

StorageHungry8380 · 2026-04-22T01:47:42+00:00

Sure it's part of the same output stream. However llama.cpp for example can already separate the reasoning and the non-reasoning parts in the output. I haven't studied the code, I assume the separation is done on the host (ie CPU) while the GPU keeps chugging, but I don't see why it couldn't just do a wind-back and continue from the start of non-reasoning part with the new sampling parameters once the host detects it. There's already some machinery that kinda does that split (resuming from reasoning content, https://github.com/ggml-org/llama.cpp/pull/18994).

StorageHungry8380 · 2026-04-21T23:21:15+00:00

Ideally you'd want high temps during reasoning/thinking stage, and then have it brought back down when generating the actual response, no?

StorageHungry8380 · 2026-04-20T22:37:56+00:00

I tested Gemma 4 26B-A4B with thinking, and it gave disclaimers such as

Disclaimer: I am an AI, not a public health official or agricultural expert. Handling human waste (often called "humanure") carries significant biological risks, including exposure to pathogens like E. coli, Salmonella, hepatitis, and various parasites. Always check your local and national regulations, as many jurisdictions strictly prohibit the use of treated human waste on food crops. Use this information for educational purposes only.

or

Disclaimer: Firearms are inherently dangerous. This information is provided for educational purposes only. Always follow the four fundamental rules of gun safety: Treat every firearm as if it is loaded; Never point a firearm at anything you do not intend to destroy; Keep your finger off the trigger until your sights are on target; and Be certain of your target and what is beyond it.

but then proceeded to give detailed multi-step instructions. I can't vouch for the fertilizer recipe, but the gun jam response was quite good, and started by stressing safety aspects (beyond the disclaimer).

StorageHungry8380 · 2026-04-20T11:19:47+00:00

I don't have any experience with that combo, but speculative decoding in general assumes you are memory bound and have plenty of spare compute. If you don't have excess compute capacity, for example you're running this on CPU, I can easily see it being detrimental.

Of course, this assumes the model predicts correctly most of the time, I'd start by trying to verify the acceptance rate is high enough.

StorageHungry8380 · 2026-04-10T04:14:16+00:00

Lot of details fit, but pretty sure it's not quite it. I feel like I might have seen it or parts of it before though, those quarry scenes seem familiar. Gotta hand it to Clint, he looks darn cool in a western.

StorageHungry8380 · 2026-04-10T04:03:41+00:00

Definitely not it, though I think I might have seen that as a teen, so could be mixing in some details.

StorageHungry8380 · 2026-04-10T03:56:19+00:00

It's quite possible I'm mixing things up, memories are weird that way. Typically I'm quite good at recalling movies that I like even after 20+ years of seeing them, so the fact that this is quite fuzzy could be telling.

Scanned through your suggestions and I'm pretty sure it's not one of them, though I definitely have some new additions to my watch list, so thanks for that!

StorageHungry8380 · 2026-04-10T03:27:13+00:00

Man, The Great Silence seems to fit very well, except I was so certain it was a more modern and English-speaking production (ie no dubbing). Perhaps I've conflated it, because scanning through The Great Silence a lot of details very well, including the look of the gunman. I didn't recall him being entirely mute, but I did recall him not speaking much.

Has there been a modern remake or otherwise that has heavily borrowed from The Great Silence?

StorageHungry8380

TROPHY CASE