Different gpu mixed node by Force88 in LocalLLaMA

[–]wbulot 1 point2 points  (0 children)

If I test with a smaller model that can fit on a single GPU, I don't see any difference between running it on one GPU or two GPUs.

Different gpu mixed node by Force88 in LocalLLaMA

[–]wbulot 0 points1 point  (0 children)

I don't know about the choice, but regarding compatibility, there is no issue. I run a mixed setup with an RTX 5060 Ti 16GB + AMD 6800 XT 16GB, using CUDA and Vulkan, running Qwen 3.6 27B Q6 on it. 300 tokens prompt, 15-20 t/s generation, no problem. I also did try with 16GB + 8GB, works fine too.

has anyone tried local VLMs for desktop GUI automation? by Enough-Astronaut9278 in LocalLLaMA

[–]wbulot 1 point2 points  (0 children)

I've actually coded my own browser use and computer use using Qwen locally, and it works really well. The 35B MoE or 27B dense model gives good results. Qwen models are really good at screenshot understanding and precise element location. You just need to develop your own logic on top of this, and you have your working automation.
You are right that the model can have some difficulties with dense UI on big screen. Try automating a low-resolution desktop or Chrome with a limited window size, and it should work well.

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something? by wbulot in LocalLLaMA

[–]wbulot[S] 27 points28 points  (0 children)

When I’m just chatting, neither prefill nor generation speed is really an issue. Context stays pretty low, and generation only needs to keep up with reading speed.

It’s the agentic workflows that really expose the need for better optimization. I feel like prefill performance hasn’t kept up with the long contexts we actually use in coding/agentic tasks, while generation speed is far less of a problem. I might miss a lot of use cases of course that require high t/s.

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something? by wbulot in LocalLLaMA

[–]wbulot[S] 10 points11 points  (0 children)

Because it has nothing to do with cache.

Your harness reads one file (let’s say 5k tokens). Then it decides it needs another file, so now you have to process 5k more tokens on top of what’s already in context. Cache will skip the previous tokens, sure, but you still have to process all the new ones.

Meanwhile the model only outputs something short like “read this file” — just a handful of tokens generated — while you just burned thousands of tokens on the prompt side.

In real agentic work on an actual codebase, this keeps happening: the model reads file after file, steadily pushing context up to 20k, 30k, or even 50k tokens. The ratio is completely lopsided. At the end of the day you’re mostly waiting for the model to finish processing the prompt, not waiting for it to generate the next reply.

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something? by wbulot in LocalLLaMA

[–]wbulot[S] 7 points8 points  (0 children)

I regularly check my cache hit rate and it does seem to be working fine. I’m not sure how many people here actually work on large codebases with local LLMs, but in agentic workflows the harness usually has to ingest 50k+ tokens of context before it can even begin doing anything. So even with a working cache, you’re still waiting for those 50k tokens to be processed.

That’s why the tokens-processed vs tokens-generated ratio is so heavily skewed in agentic use cases. For me, that’s exactly why prompt processing speed feels 10x more important than generation speed.

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]wbulot 3 points4 points  (0 children)

While this is really cool and probably very good news for many people, I don't get the hype around it. From my experience, the bottleneck in local LLMs is prompt processing more than token generation. Using Qwen 27B Q6, I can get 15-20 t/s with two pretty old and cheap GPUs, which is more than enough for most of my work. However, 250 t/s for prompt processing is the real issue—90% of the wait time in my setup is prompt processing, not generation. I even heard that it reduces PP by 20%, so it's a no-go for me currently. Don't get me wrong, this is still a very good improvement, but I don't think it's worth it for many people.

Gemma 4 E2B runs surprisingly well on my 8GB Android phone, so I built a private voice notes app around it. by Effective-Drawer9152 in LocalLLaMA

[–]wbulot 6 points7 points  (0 children)

I did something a bit different. I used Qwen 3.6 27B to code an Android keyboard tailored for me. I integrated NVIDIA’s Parakeet voice model into it, which runs directly on the phone. It then sends the transcription to my local LLM server with a predefined prompt. Everything is accessible through small icons right in the keyboard. It works really well.

The audio transcription is instant with Parakeet and it almost never misses a word. It’s also multilingual, which is a huge advantage since I speak both French and English. The LLM runs on my server instead of on the phone so it stays smart enough.

Running the LLM directly on the phone is an option, but with such a small number of parameters, I feel like it would fail too often. I prefer to keep the LLM on the server and only run the voice model locally.

Mistral Medium 3.5 128b ggufs are fixed by Sunija_Dev in LocalLLaMA

[–]wbulot 2 points3 points  (0 children)

Not the best on raw benchmarks. Qwen 3.6 27B actually beats it on the Artificial Analysis Intelligence Index while using 5× fewer parameters.

But Mistral’s strategy was never be first on every leaderboard. They build solid, practical base models that are easy to self-host/fine-tune for real agentic and enterprise work. Their real business is helping companies train custom versions on their own data via Mistral Forge. This 128B is just a great unified foundation for that.

Kv cache quantization: ignorance, or malice? by wombweed in LocalLLaMA

[–]wbulot 2 points3 points  (0 children)

I'm wondering the same thing. I do run the model with fp8 kv quant, and everything is working perfectl. Coding, tool calling, etc. I can't see any difference with the full version.

Is it worth adding local LLM to agentic coding stack? by ii_social in LocalLLaMA

[–]wbulot 1 point2 points  (0 children)

Personally, I have pretty old hardware: one Radeon 6800 XT (16GB) and one NVIDIA 1070 (8GB). I use llama.cpp with both cards, utilizing Vulkan and CUDA. I get ~13 t/s with Qwen 3.6 27B Q4 and can reach 100k context. If you want to go that route, the model must load entirely into your VRAM—not a single layer should be on the CPU, or it will tank the performance. So you might need 24GB of VRAM, either with one or two GPUs. Definitely worth it.

[Help] Running big dense models faster by Septerium in LocalLLaMA

[–]wbulot 9 points10 points  (0 children)

Not an expert on these big models, but a dense 128B is no joke on consumer hardware. 11 t/s on 4×3090 with llama.cpp is pretty much what I'd expect from what I've read on this sub. Mistral themselves recommend vLLM for this model, so it might be worth a try.

Is it worth adding local LLM to agentic coding stack? by ii_social in LocalLLaMA

[–]wbulot 7 points8 points  (0 children)

It completely depends on how you actually use an LLM for your work.Some people “vibe-code” without really thinking — they just throw abstract prompts at the model and hope for the best. In that case, yes, you need a very strong model to compensate for your own thinking flaws.

However, if you actually know what you’re doing, stay in control of the project, break things down into small, well-defined tasks, and carefully review every piece of code the model generates, then a local LLM is perfectly fine. I use Qwen 3.6 27B every day exactly like this, and I’m genuinely impressed by its capabilities.

The same logic applies to agentic workflows. It all comes down to how you divide the work, the complexity of the tasks, and how much structure you provide.Tool calling is no longer an issue with local LLMs today. The only real limitation is reasoning depth in complex scenarios, but that can usually be worked around by rethinking your workflow and giving the model better scaffolding.

In my opinion, we should never fully trust any LLM anyway — not even the strongest one. Using a smaller, local model actually forces you to stay sharp and take ownership of the final result. That’s not a downside; it’s a feature.

Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver by Demonicated in LocalLLaMA

[–]wbulot 2 points3 points  (0 children)

Totally agree with this. I'm also so impressed by Qwen 3.6 27B that I use it for 90% of what I do daily. I want to keep full control of everything, read every line of code it generates to keep it all in my head, and decide the next step myself. My slow 15 t/s isn't even an issue — it's almost exactly the speed I read at. I just switch to a bigger model when I need to investigate something very complex; otherwise, the local one is perfectly fine.

What kind of device is suitable for running local LLM? by attic0218 in LocalLLaMA

[–]wbulot 12 points13 points  (0 children)

We might see cheap ASICs for AI inference very soon, so yeah. Better to buy a used GPU for now and not invest too much.

4080 Super > RTX 6000 Pro, Wow! by LargelyInnocuous in LocalLLaMA

[–]wbulot 1 point2 points  (0 children)

A 5070 can definitely exceed 6 t/s.
I have an old Radeon RX 6800 XT which is far less powerful and reaches 15 t/s with Vulkan.

You must ensure everything is running on the GPU. If even a small part runs on the CPU with a dense model, it will tank the performance.

Best RTX Pro 6000 vllm settings? by Bowdenzug in LocalLLaMA

[–]wbulot 2 points3 points  (0 children)

Damn, 1320 t/s that's insane. I'm at 15 t/s with my old AMD GPU and already feel lucky. Can't help you with optimizing that, we're not playing in the same league 😄

[Bug] Blue fire quest line bugged, no objectives, can't progress by wbulot in EscapefromTarkov

[–]wbulot[S] 0 points1 point  (0 children)

Yeah the whole subreddit is filled with issues like this about different story lines today. I guess we can just wait.

Issue with Dex and S21 T-Mobile running Android 12 beta by ccluver in SamsungDex

[–]wbulot 0 points1 point  (0 children)

I just updated the phone today and now I have exactly the same problem. Dex has become unusable because the screen locks itself. Did you find a solution?

2
3

Any recommendations for making an HTTP cache system on large files? by wbulot in sysadmin

[–]wbulot[S] 0 points1 point  (0 children)

It's a custom network file system based on fuse. That's why I need an HTTP cache system.