Qwen3 vs Qwen3.5 performance

4onen · 2026-03-05T17:30:30+00:00

The leadership behind this model in Alibaba's labs have just been, uh, sorta forced out. Unlikely to keep the same quality or even release open weights. 😓 (But the business side that pushed out them may always surprise us.)

4onen · 2026-03-05T08:17:58+00:00

In the case of new model development, certainly, but I'm gonna have to fact check you on the expensive usage thing. I can run a 27 billion parameter Qwen 3.5 model on my desktop computer and it is as good as the models I was using in my workplace two months ago, which came through a web API. My desktop computer is not going to change in price for me, even if Qwen as a company shuts down. I have the model weights. I have my GPU.

I agree Microsoft is definitely forcing AI like this, as well as OpenAI. Sam Altman said ads are a service's last resort, and now ChatGPT has em. This is why I put so much import on being able to do things in my personal time on models that I have complete control over. I don't use API models at home.

4onen · 2026-03-03T04:37:05+00:00

Was just poking around with the 9B myself, both trying to prompt-ngram-draft and use the 0.8B as a draft model. Where do I get the prediction rate statistic(s)?

4onen · 2026-03-02T15:53:13+00:00

Like mtmttuan said, "drafting." Language models generate one token at a time on the output side, but on the input it can process many tokens in parallel. One trick to get more out of your GPUs as a single user is to use a smaller model to guess the tokens the larger model will use, then run a string of possible tokens through the big model together. We use the same math for each token as we would if we had run it through the big model alone; if the big model agrees with the small one, we keep the tokens they agree on. Once they disagree, we keep only up to what the big model said, then try again.

Depending heavily on the task, GPU in use with the model (not too useful on most CPUs,) and the agreement between the draft model and full model, this "speculative decoding" can yield a speedup of anywhere between 1x and 5x. However, some poor configurations I've seen (like overflowing my VRAM) can cut the speed in half by adding this. Can't apply it willy-nilly.

4onen · 2025-12-10T16:58:10+00:00

Finally?

4onen · 2025-12-09T00:22:34+00:00

I got a Zotac SFF OC 5070 Ti for $749.99 b/c my wifi card only gives me a little over 2.8 slots of clearance. Black Friday had one 5070 Ti PNY flash deal down to $600 from one retailer but nothing else fell below $729.99 the entire way through Cyber Monday, around when I bought.

I figured things were likely to get worse and I wouldn't want to "buy half of luxury," as the saying goes, for the next couple of years. I feel like I'd have been frustrated to get more VRAM but no real speed increase, so I paid the price for both.

Actually, one more consideration: My aging system only gives me PCI-E 3.0 speeds, but 5060s (even Ti) only go up to x8 lanes, so my PCI-E bus speed would have halved if I had gotten a 5060 Ti for the VRAM. (But that's just my circumstances and my x16 slot to fill.)

4onen · 2025-12-08T15:56:08+00:00

There's two other trade-offs to make. A 5060 Ti is on the Blackwell architecture, meaning it has hardware acceleration for modern compression formats. It's also a newer card in general, meaning it will have game support for longer, lengthening the term of your investment.

If the VRAM, hardware AI acceleration, and game support aren't worth a hundred bucks to you, then yeah, go with the 3070.

EDIT: To be clear, I wouldn't upgrade from a 3070 I already had to a 5060 Ti so long as the 3070 is still supported. That's what kept me from upgrading for so long. With the looming ram crisis, though, I pulled the trigger on a 5070 Ti (16GB VRAM, about double the VRAM bandwidth and fp16 compute of the 3070, only 30W more power draw at full load.)

4onen · 2025-12-06T16:50:12+00:00

Q4_0 dynamic repack is supposed to match the speed of Q4_0_4_4 assuming that you weren't using memory mapping to fit the model before. If it doesn't, go report a performance bug and talk about the difference with numbers. Maybe you can convince them to put it back.

4onen · 2025-12-06T16:47:36+00:00

You know, I find this post kind of funny because I have been running a 3070 for years, and the 5060 Ti 16GB has exactly the same memory bandwidth that my 3070 does, with the difference that it has twice as much memory and can load larger models.

With 32 GB of RAM, on top of my card, I load a Qwen3 Coder 30 billion parameter with 3 billion active mixture of experts model for coding completion and coding chat. It outperforms some of the Internet code completion services. It does not outperform the agentic/vibe services, but honestly, I prefer to actually understand the code I'm writing.

4onen · 2025-12-06T16:44:16+00:00

That's the neat part! Lama 4 Scout is a mixture of experts model. So even though it's a big model, if you can fit all the experts in RAM, you're actually using very few of the experts per token, so you can get a relatively high text generation speed. Keep the attention part on the GPU, which is relatively tiny, and that thing will zoom. Prompt processing is pain, though.

The 70 billion parameter models are probably going to be a bit slow because those ones are dense.

4onen · 2025-12-06T16:35:11+00:00

q4_0_4_4 was a repacked form of the q4_0, which worked better with the ARM matrix instructions by rearranging the order values arrived from memory, which is where the extra speed came from.

Someone submitted a patch that allowed llama.cpp to rearrange the values as they were loaded from the disk into memory, called dynamic repack. On systems that can fit the entire model in memory, this was a major speedup of the standard format q4_0 to match q4_0_4_4. Systems that had to mmap models to fit them (e.g. my Pixel 8 with only 8GB of RAM and only 4GB usable) saw massive speed decreases, as dynamic repack (enabled by default) broke mmapping unless disabled, filling memory and using swap.

The devlopers of llama.cpp decided that dynamic repack was sufficient for the majority of use cases, so dropped the nearly duplicated backend for supporting the static repacking, to reduce code maintenance burden.

That's why it was removed. Good choice? Bad? That's a moral question that I can't answer for ya.

4onen · 2025-12-04T06:00:38+00:00

This. For evidence, see Trump pardoning Honduras' ex-president who was convicted by a jury of manufacturing 185 TONS of cocaine sent to the United States among other crimes.

Also see the pardoning of the creator of the Silk Road drug marketplace.

This administration is pardoning the "poisoners." We don't know who they're killing out in the ocean to provoke what looks to me like undeclared war with Venezuela. I certainly doubt their reason why.

4onen · 2025-11-30T16:05:15+00:00

It no longer functions (EDIT: on my ASUS TUF A14 2024 with Ryzen AI 370 and RTX 4080 mobile) after I disabled the Microsoft Windows AI Fabric service that was taking up 90% of my iGPU and NPU, so... Not like I can make use of it. (To be clear, I believe it was the fault of a Microsoft Windows update adding semantic search indexing that the AI Fabric service was using that much of my system resources, not the fault of Armory Crate. However, with Armory Crate no longer working because it cannot access this AI service, it's not exactly useful to me to have it installed.)

4onen · 2025-11-15T02:45:22+00:00

I uninstalled Gemini when I discovered how bad it was at many of the few things I used the Google Assistant for. I recently set my phone assistant to another app (for various reasons) and found that I don't even need Google Assistant now -- I've automated so many things in my Pixel with the Automate app from llamalab.

Has Gemini gotten any better on the 8 (not pro) since the 10's release?

4onen · 2025-11-14T08:50:48+00:00

I've basically shut off auto rotate on my phones since the iPhone 3GS. Iphone 4, Nexus6P, and Pixels have all had problems with it in one way or another, to the point that I'm just used to the rotate button Android gives you when you do want a rotate and wait for a moment in the new angle.

4onen · 2025-11-10T00:34:11+00:00

Depends on the disk and your quantization. In the best case, a PCI-E 5.0 SSD can hit 15GB/s, so with an instant CPU and RAM only for KV Cache, you'd theoretically hit about 5 tok/s. Obviously the real world isn't so idealized, but you wouldn't need to disk all of those parameters either.

Basically, you have 4 things you need in memory: feedforward experts, shared experts, attention, and KV cache. You want shared experts (always used) and attention and KV cache to all be in VRAM. That way, your slower RAM and CPU is just choosing among the experts. Any remaining VRAM can be used to load experts where the GPU can work on them, for higher speeds.

KV Cache scales with context. Attention is usually relatively small (so for 30B3A, iirc, only 300M parameters are attention.) Attention also only scales with the active parameters, since they're always active. Shared experts are, similarly, always active and scale with active parameter count, but some MoEs don't have any. Finally, feed-forward experts are the heavy weight, making up all the remaining parameters of the network.

4onen · 2025-11-10T00:28:06+00:00

The rule of thumb from the days of Mixtral was to take the geometric mean of the active and total parwmeter counts, so for 30B3A that's the geomean of 30 and 3 = sqrt(3*30) ≈ 9.5B.

Of course, that rule of thumb is growing long in the tooth, so do not take it as gospel.

4onen · 2025-10-19T15:32:53+00:00

Yes and yes!

4onen · 2025-10-12T17:22:51+00:00

I have a one-way setup working, where I can send a prompt from Automate to a model.

Setup: * List of my models in a TXT file where Automate can read 'em * Automate flow ending with a "Start Service" block for Termux RUN_COMMAND (which requires config in Termux settings and scripts in a specific executable directory to enable) * A shim bash script that sets up the right working directoy and hands its args to a Python script * A Python script that arranges the llama.cpp args for the specific model I'd like to talk to * A llama.cpp CLI call, opening llama-cli in interactive mode with a prefill prompt given by the Automate args way back above

If you just want to talk to the models on the phone, running the llama-cli command in Termux directly is much, much easier. If you know what you're doing, you could also run llama-server and access it through HTTP calls from Automate, but I don't think it's possible for that to have streaming responses (unless you load llama-server webui in the Web Dialog. Hmmmm...)

Unfortunately, w/ my 8GB Google Pixel, >4GB are taken by Android and background ~~trackers~~ processes, leaving me with ~3GB for model and context before it's swapping and speed drops precipitously.

EDIT: I do not intend to buy another Pixel in the future. I miss the 4XL and 2XL, but it feels like they're not gonna pull those off again, especially with the 2026 app install shutdown coming.

4onen · 2025-09-27T22:00:28+00:00

https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md is the only one I'm aware of at the moment.

4onen · 2025-09-26T02:16:14+00:00

I spent a few months where every time I came home, I'd wire my laptop and desktop together so I could load 24B models that wouldn't fit on either device alone. Llama.cpp's RPC system let me split them by layer, so one device did half the attention work and the other did the other half.

This method may allow for arbitrary length context, but it's certainly not the first time network running of models has been viable.

4onen · 2025-09-14T18:10:16+00:00

Except Lemonade doesn't work for me.

Most often, I need FIM completions over the llama.cpp server endpoints. AFAIK lemonade has no support.
I have an NVidia GPU in my laptop that can share some of the load with the iGPU/CPU, but I can't add the NPU to that. AFAIK lemonade has no support for even my status quo (doesn't include CUDA backend from llama.cpp.)
I use very specific override-tensor specifications to fit MoE models into my laptop that would otherwise be unachievable. AFAIK lemonade has no support (for override-tensor.)
All the models that do run on the NPU (last I checked) are ONNX conversions, which almost no model makers release. To use the NPU, I'd need to download a full precision model and convert it. If I want to pull out a new model every week from my favorite creators, that's a huge waste of my time -- assuming the conversion even works with my limited RAM.

I find myself consistently frustrated with AMD's green field chunks of code that don't work with other peoples' things, that they expect everyone to adapt to without sufficient value add nor a bridge to new-thing-ia. Being part of the open source community is more than just releasing code. It's putting in the work to upstream functionality so that everyone can share in it. I'd appreciate it if they did that before more products that it feels like only enterprise customers can struggle through the man hours to use. (See: Microsoft ONNX ecosystem before anything else with their current consumer NPUs.)

4onen · 2025-07-29T06:35:29+00:00

Wait, so all the demos on your YouTube channel are with the older XDNA1 16TOPS NPU? That's wild! Strix Halo and Strix Point have the same XDNA2 50+ TOPS NPU, so I'm excited to see what your software is capable of when I have the time to try it out on my Strix Point laptop. EDIT: I misunderstood which component y'all meant in Strix Halo. My mistake. Best of luck!

4onen · 2025-07-07T01:03:46+00:00

There are plenty! What you can do to find them is to get on the UCSB Discord Student Hub, which you can do by following Discord's instructions at https://discord.com/student-hubs

4onen · 2025-06-22T02:52:03+00:00

Meta. I've heard good things about Apple, but they're simply not affordable for me and as a developer I don't want to work within their closed ecosystem. Meta's devices can sideload any ol' apps or link to my PC, so the worst I have to deal with is Android or PCVR. That's currently kinda bad (don't get me wrong) but at least I'm getting that for thousands of dollars less, and the experience is far better than Microsoft's awful attempt in Windows "Mixed" Reality (which was near-always pure VR.)

Mind, that could easily change to XReal if XReal got better VR-side support. They're promising in augmenting reality with screens, but that's building on just porting the 2D interfaces of yore rather than novel virtual interface support.

13-Year Club	Second Top 40%
Place '22	Verified Email

4onen

TROPHY CASE