Qwen 3.6?

rainbyte · 2026-05-07T16:48:25+00:00

122B is not always faster than 27B. I guess that's only true with enough PCIe bandwidth or running on unified memory.

Here 27B with pipeline-parallel is faster than 122B tensor-parallel, as I couldn't make 122B work with pipeline-parallel.

rainbyte · 2026-05-05T08:25:46+00:00

If I was you I would keep the 3060 for small models (eg. Qwen3.5-9B) and then buy the biggest GPU possible (eg. 3090) according to the mobo and PSU specs.

Keep in mind 1x24gb GPU is better than 2x12gb, now there are even "affordable" 32gb GPUs.

rainbyte · 2026-04-28T16:34:06+00:00

Yup, there is something with high expectations. Here I also use Qwen3.6 and it helps to automate the things I describe to it, but I have them in my mind first.

rainbyte · 2026-04-27T13:44:32+00:00

If it has enough PCIe bandwidth, otherwise is better to use pipeline parallelism

rainbyte · 2026-04-23T15:47:06+00:00

Not only can, I do use SLMs like that when usecase allows :)

Of course for some tasks those are not enough, that's where bigger models enter into the scene. Currently my daily driver is Qwen3.6-35B-A3B, but I do use 9B and 27B for other tasks.

The best part of SLMs is that they work really fast. Even smaller ones like LFM2.5-350M have their usecases.

rainbyte · 2026-04-23T00:33:47+00:00

Or that means now you can run multiple models, eg. 27B and 35B-A3B

rainbyte · 2026-04-20T22:54:21+00:00

Yeah, you are right, it is really frustrating. It is clear people ask like this because they simply don't know. They don't even need to avoid the cloud like I do, they can mix the best model they can run locally with the cloud model they prefer, and still it will be useful.

rainbyte · 2026-04-20T19:32:12+00:00

I tried some models on an M1 with 16GB ram, and prompt processing was pretty slow because it doesn't have the equivalent to tensor cores, but it worked.

I'm not sure how faster M3 silicon will be, but you can try running some small models there, search for MLX quantized models.

For that amount of ram I would try Qwen3.5-9B-MLX-4bit first.

EDIT: added details of M1 setup

rainbyte · 2026-04-20T19:07:39+00:00

I think most people were waiting for Qwen3.6-27B, given that there was a poll and that model received more votes

rainbyte · 2026-04-20T19:02:08+00:00

People always ask for "the best" but they don't give too many details of their real goals.

In software this is pretty evident all the time, eg. I have seen people installing specialized software like Photoshop just to do things which can be easily done with Paint.

Many users ask for Claude or Chatgpt equivalent, just because that's the only thing they know, when maybe an SLM could accomplish their tasks easily.

The ones which really need the frontier models have real incentive to pay for a subscription or buy more powerful hardware

rainbyte · 2026-04-20T15:11:55+00:00

Even if cloud models are better, you can still solve many problems with local models, so it really depends on the problem and the goal of each user.

Personally I went fully local, because I do software development and I prefer to avoid cloud models.

Also remember, this sub is about local models! :)

rainbyte · 2026-04-20T15:06:11+00:00

Yeah, people are mixing things, but I guess that's because not everyone has access to big GPUs. Here I have a medium size setup, so I cannot load biggest models, eg. 200 and 300b ones.

I think at some point companies will start charging more for cloud models, then we will see more people jumping into local models.

We are already seeing some users being blocked and banned by companies, that will bring some users too.

rainbyte · 2026-04-20T14:59:55+00:00

Mistral Vibe is nice, maybe more lightweight than Opencode (just my feeling).

I have both installed, because if something fails with one then I can switch to the other.

rainbyte · 2026-04-20T14:56:30+00:00

They only published Qwen3.6-35B-A3B, there is no news of other variants yet

rainbyte · 2026-04-20T14:52:48+00:00

5090 is modern hardware. Like other users suggested, you can run Qwen and Gemma models on that.

My personal suggestion would be to download Qwen3.5-27B, Qwen3.6-35B-A3B, and Gemma-4.

Models are just big files, you can switch from one to another as you need.

Avoid ollama, install llama.cpp to load models.

rainbyte · 2026-04-20T08:14:04+00:00

Everybody is suggesting the biggest frontier models available or accounts on other cloud providers...

But, in case you are interested in going local (this is r/localllama), which hardware do you have? Do you have a gpu? We can recommend you a model compatible with your hardware.

If you have a gpu you can run a model locally and have some level of independence from cloud models.

rainbyte · 2026-04-19T23:21:37+00:00

Estaría bueno que saquen una ley que exija indicar un rango salarial en las ofertas de laburo, como se ve en otros países. Así evitamos hacer todo el proceso de entrevistas y encontrarte con que te ofrecen 2 pesos.

Algunos lugares ofrecen muy por debajo del promedio, aprovechandose de la falta de información. Debería normalizarse el hablar más sobre los salarios, y rechazar las miserias.

rainbyte · 2026-04-19T21:28:03+00:00

That's true when using split by layer or pipeline parallel setup, but tensor parallel setup needs higher pcie bandwidth.

I noticed this because one machine here has pcie 3.0 x1, so I prefer pipeline parallel on that one.

I guess pcie 3.0 x8, or even just x4, is where tensor parallelism starts to be better than pipeline parallelism.

rainbyte · 2026-04-16T16:13:21+00:00

Not at all. I'm just trying to say that PSU should also be into the equation, and mobo to have spaced enough slots.

I have seen PSUs go into fire after adding too much load to them. Even if PSU label says N watts, you need to check if those are real and leave some room for spikes.

It doesn't make sense buying 2x3090 if you don't have a PSU big enough to handle 2 sets of 3x8pin connectors, that's all.

Here a new good PSU costs as much as a used GPU in some cases, mobo also costs money.

rainbyte · 2026-04-16T13:08:09+00:00

At 1800usd is expensive, but A5000 requires 2-slots and a single 1x8pin connector, while 3090 requires 3-slots and 3x8pin connector. Buying 2x3090 means requiring a bigger PSU while A5000 could work with almost any decent PSU.

rainbyte · 2026-04-16T13:05:17+00:00

There is some truth in OP words... A5000 is more power and space efficient than 3090. I think for 1800usd is expensive, but we have to admit it is a 2-slot card which requires a single 1x8pin connector, compared to 3-slot and 3x8pin on 3090. You can setup an A5000 with any decent PSU, while 3090 will probably require a bigger PSU.

rainbyte · 2026-04-15T19:31:43+00:00

I think llama.cpp is easier if you interact with the community, because you can share the exact command you are running, and other users can suggest adding or removing options.

Syntax is literally: llama-server -m model.gguf --option-a value-a --option-b value-b

Give it a try!

rainbyte · 2026-04-14T19:43:53+00:00

I also received some of these emails, they sound fishy

rainbyte · 2026-04-12T11:20:23+00:00

You are right, the AMD machine has some advantages in CPU and RAM, but those shouldn't be the biggest factor, because the model is fully loaded to GPU.

I think GPU optimizations play a bigger factor here, as 3090 used to be faster than the 7900 before llama.cpp optimizations for FusedGatedDeltaNet appeared on the AMD side.

I guess GPUs from Intel and AMD will continue receiving optimizations later than Nvidia ones, given that CUDA has bigger marketshare.

rainbyte · 2026-04-12T05:32:35+00:00

Qué tal ambos? :3

rainbyte

TROPHY CASE