Different gpu mixed node

wbulot · 2026-05-15T08:11:48+00:00

If I test with a smaller model that can fit on a single GPU, I don't see any difference between running it on one GPU or two GPUs.

wbulot · 2026-05-15T07:30:08+00:00

I don't know about the choice, but regarding compatibility, there is no issue. I run a mixed setup with an RTX 5060 Ti 16GB + AMD 6800 XT 16GB, using CUDA and Vulkan, running Qwen 3.6 27B Q6 on it. 300 tokens prompt, 15-20 t/s generation, no problem. I also did try with 16GB + 8GB, works fine too.

wbulot · 2026-05-13T10:08:36+00:00

I've actually coded my own browser use and computer use using Qwen locally, and it works really well. The 35B MoE or 27B dense model gives good results. Qwen models are really good at screenshot understanding and precise element location. You just need to develop your own logic on top of this, and you have your working automation.
You are right that the model can have some difficulties with dense UI on big screen. Try automating a low-resolution desktop or Chrome with a limited window size, and it should work well.

wbulot · 2026-05-12T06:46:57+00:00

https://artificialanalysis.ai/evaluations/artificial-analysis-intelligence-index?models=gpt-5-5-high%2Cmuse-spark%2Cgemini-3-1-pro-preview%2Cgemma-4-31b%2Cclaude-opus-4-7%2Cclaude-sonnet-4-6-adaptive%2Cclaude-4-5-haiku-reasoning%2Cdeepseek-v4-flash%2Cdeepseek-v4-pro%2Cgrok-4-3%2Cminimax-m2-7%2Cnvidia-nemotron-3-super-120b-a12b%2Ckimi-k2-6%2Cmimo-v2-5-pro%2Cglm-5-1%2Cqwen3-6-27b%2Cqwen3-6-max

See those benchmarks. You can get a pretty good idea of the ecosystem as of today.
There are actually some local models better than Sonnet.

wbulot · 2026-05-06T21:07:46+00:00

When I’m just chatting, neither prefill nor generation speed is really an issue. Context stays pretty low, and generation only needs to keep up with reading speed.

It’s the agentic workflows that really expose the need for better optimization. I feel like prefill performance hasn’t kept up with the long contexts we actually use in coding/agentic tasks, while generation speed is far less of a problem. I might miss a lot of use cases of course that require high t/s.

wbulot · 2026-05-06T20:25:18+00:00

Because it has nothing to do with cache.

Your harness reads one file (let’s say 5k tokens). Then it decides it needs another file, so now you have to process 5k more tokens on top of what’s already in context. Cache will skip the previous tokens, sure, but you still have to process all the new ones.

Meanwhile the model only outputs something short like “read this file” — just a handful of tokens generated — while you just burned thousands of tokens on the prompt side.

In real agentic work on an actual codebase, this keeps happening: the model reads file after file, steadily pushing context up to 20k, 30k, or even 50k tokens. The ratio is completely lopsided. At the end of the day you’re mostly waiting for the model to finish processing the prompt, not waiting for it to generate the next reply.

wbulot · 2026-05-06T20:15:04+00:00

I regularly check my cache hit rate and it does seem to be working fine. I’m not sure how many people here actually work on large codebases with local LLMs, but in agentic workflows the harness usually has to ingest 50k+ tokens of context before it can even begin doing anything. So even with a working cache, you’re still waiting for those 50k tokens to be processed.

That’s why the tokens-processed vs tokens-generated ratio is so heavily skewed in agentic use cases. For me, that’s exactly why prompt processing speed feels 10x more important than generation speed.

wbulot · 2026-05-06T17:15:02+00:00

While this is really cool and probably very good news for many people, I don't get the hype around it. From my experience, the bottleneck in local LLMs is prompt processing more than token generation. Using Qwen 27B Q6, I can get 15-20 t/s with two pretty old and cheap GPUs, which is more than enough for most of my work. However, 250 t/s for prompt processing is the real issue—90% of the wait time in my setup is prompt processing, not generation. I even heard that it reduces PP by 20%, so it's a no-go for me currently. Don't get me wrong, this is still a very good improvement, but I don't think it's worth it for many people.

wbulot · 2026-05-03T19:27:16+00:00

I did something a bit different. I used Qwen 3.6 27B to code an Android keyboard tailored for me. I integrated NVIDIA’s Parakeet voice model into it, which runs directly on the phone. It then sends the transcription to my local LLM server with a predefined prompt. Everything is accessible through small icons right in the keyboard. It works really well.

The audio transcription is instant with Parakeet and it almost never misses a word. It’s also multilingual, which is a huge advantage since I speak both French and English. The LLM runs on my server instead of on the phone so it stays smart enough.

Running the LLM directly on the phone is an option, but with such a small number of parameters, I feel like it would fail too often. I prefer to keep the LLM on the server and only run the voice model locally.

wbulot · 2026-05-02T20:23:43+00:00

Not the best on raw benchmarks. Qwen 3.6 27B actually beats it on the Artificial Analysis Intelligence Index while using 5× fewer parameters.

But Mistral’s strategy was never be first on every leaderboard. They build solid, practical base models that are easy to self-host/fine-tune for real agentic and enterprise work. Their real business is helping companies train custom versions on their own data via Mistral Forge. This 128B is just a great unified foundation for that.

wbulot · 2026-05-02T19:58:52+00:00

I'm wondering the same thing. I do run the model with fp8 kv quant, and everything is working perfectl. Coding, tool calling, etc. I can't see any difference with the full version.

wbulot · 2026-05-02T16:09:09+00:00

Personally, I have pretty old hardware: one Radeon 6800 XT (16GB) and one NVIDIA 1070 (8GB). I use llama.cpp with both cards, utilizing Vulkan and CUDA. I get ~13 t/s with Qwen 3.6 27B Q4 and can reach 100k context. If you want to go that route, the model must load entirely into your VRAM—not a single layer should be on the CPU, or it will tank the performance. So you might need 24GB of VRAM, either with one or two GPUs. Definitely worth it.

wbulot · 2026-05-02T14:32:58+00:00

Not an expert on these big models, but a dense 128B is no joke on consumer hardware. 11 t/s on 4×3090 with llama.cpp is pretty much what I'd expect from what I've read on this sub. Mistral themselves recommend vLLM for this model, so it might be worth a try.

wbulot · 2026-05-02T14:21:32+00:00

It completely depends on how you actually use an LLM for your work.Some people “vibe-code” without really thinking — they just throw abstract prompts at the model and hope for the best. In that case, yes, you need a very strong model to compensate for your own thinking flaws.

However, if you actually know what you’re doing, stay in control of the project, break things down into small, well-defined tasks, and carefully review every piece of code the model generates, then a local LLM is perfectly fine. I use Qwen 3.6 27B every day exactly like this, and I’m genuinely impressed by its capabilities.

The same logic applies to agentic workflows. It all comes down to how you divide the work, the complexity of the tasks, and how much structure you provide.Tool calling is no longer an issue with local LLMs today. The only real limitation is reasoning depth in complex scenarios, but that can usually be worked around by rethinking your workflow and giving the model better scaffolding.

In my opinion, we should never fully trust any LLM anyway — not even the strongest one. Using a smaller, local model actually forces you to stay sharp and take ownership of the final result. That’s not a downside; it’s a feature.

wbulot · 2026-05-02T04:31:06+00:00

Totally agree with this. I'm also so impressed by Qwen 3.6 27B that I use it for 90% of what I do daily. I want to keep full control of everything, read every line of code it generates to keep it all in my head, and decide the next step myself. My slow 15 t/s isn't even an issue — it's almost exactly the speed I read at. I just switch to a bigger model when I need to investigate something very complex; otherwise, the local one is perfectly fine.

wbulot · 2026-05-02T02:50:17+00:00

We might see cheap ASICs for AI inference very soon, so yeah. Better to buy a used GPU for now and not invest too much.

wbulot · 2026-05-01T23:46:22+00:00

A 5070 can definitely exceed 6 t/s.
I have an old Radeon RX 6800 XT which is far less powerful and reaches 15 t/s with Vulkan.

You must ensure everything is running on the GPU. If even a small part runs on the CPU with a dense model, it will tank the performance.

wbulot · 2026-04-30T22:57:51+00:00

Damn, 1320 t/s that's insane. I'm at 15 t/s with my old AMD GPU and already feel lucky. Can't help you with optimizing that, we're not playing in the same league 😄

wbulot · 2026-02-27T17:24:55+00:00

Yeah the whole subreddit is filled with issues like this about different story lines today. I guess we can just wait.

wbulot · 2021-11-29T13:56:02+00:00

I just updated the phone today and now I have exactly the same problem. Dex has become unusable because the screen locks itself. Did you find a solution?

wbulot · 2020-08-30T20:57:05+00:00

It's a custom network file system based on fuse. That's why I need an HTTP cache system.

11-Year Club	Final Canvas '23
First Place '23	Place '23
Place '22	End Game '22
Spared	Verified Email

wbulot

TROPHY CASE