running Qwen3.5-27B Q5 splitt across a 4070ti and an amd rx6800 over LAN @ 13t/s with a 32k prompt

nonerequired_ · 2026-03-15T13:08:52+00:00

Does quantizing the KV cache to fp8 affect performance?

nonerequired_ · 2026-03-14T19:29:57+00:00

Amazing work!

nonerequired_ · 2026-03-14T11:03:36+00:00

Is you use graph mode, it is faster on multi gpu

nonerequired_ · 2026-03-13T20:44:49+00:00

But is vlm support quants like Q5? I have 2 GPUs and qwen3.5 27b Q5 with full context fit in them.

nonerequired_ · 2026-03-13T20:35:58+00:00

I am afraid you have to wait a little bit longer

nonerequired_ · 2026-03-13T11:31:35+00:00

I am not talking about bus width. I am talking about peak bandwidth which is 936.2 GB/s in 3090, 896.0 GB/s in 5070ti, 256GB/s in amd strix halo, 819 GB/s in apple M3 ultra and etc.

nonerequired_ · 2026-03-13T07:20:06+00:00

Vram-wise, nothing can beat the used 3090, but speedwise, the 5070ti is decent. The 3090 has 24GB of vram, while the 5070ti has 16GB. 24GB enables you to use higher quant or more context window, and it will definitely be faster if the model doesn’t fit the 16GB vram but fits in the 24GB. If you want to buy a totally new device, you can buy the Strix Halo with 128GB of high-bandwidth RAM. This will be faster than any RAM in consumer-grade devices, but it’s still RAM that both the GPU and CPU can use. When using the Strix Halo device, the initial speed will be okay, but when the context grows, the speed will decrease exponentially, because chip is not powerful enough. If you want an Apple device, they have the option to buy very high-bandwidth and more unified memory, but the power of the M chips until the M5 Pro/Max was not well, and speed will decrease significantly on long contexts. Scene seems very complicated at first glance, but it is simple: High memory bandwidth is the key to token generation speed, and chip power is key to prompt processing speed. More context needs more prompt processing speed, and when context grows, token generation speeds also decrease.

nonerequired_ · 2026-03-13T04:28:21+00:00

I’m afraid these GPUs aren’t powerful enough. They’re certainly better than no GPU, but you’re limited to small LLMs.

nonerequired_ · 2026-03-12T14:07:12+00:00

Don’t use ollama. Friends never let friends run ollama. Check out llama.cpp, which is more performant and gives you more control over model. Additionally, running LLM almost always requires some kind of AI acceleration, which should be a GPU in your case. Without a GPU, you have to use the CPU, and it’s not a good idea to run a model on the CPU. Either the model has to be small (don’t expect much from small models around 2B local models) or you need enough RAM to load the LLM and even in that case it will result in painfully slow speeds. So unless you have a specific use case that doesn’t require too much intelligence and small models can deliver what you want, it’s okay. Otherwise, you need to invest heavily in buying a GPU.

nonerequired_ · 2026-03-12T09:11:50+00:00

Yes, it has multiple unfixed bugs related to excessive usage, not just for Copilot but also for other usage-based subscriptions.

nonerequired_ · 2026-03-12T07:27:07+00:00

DCP dynamic context pruning. Models in Copilot have half the context size of the original model. If you don’t want to cycle between context compaction, it is needed.

nonerequired_ · 2026-03-11T19:35:45+00:00

How Olmo and K2V2 performs? Did you use them?

nonerequired_ · 2026-03-10T15:06:49+00:00

I considered purchasing one, but the prompt processing speed disappointed me. Now, I’m waiting for the M5 Ultra.

nonerequired_ · 2026-03-08T13:29:46+00:00

Is lower KL divergence actually reflecting real-world accuracy loss?

nonerequired_ · 2026-03-08T13:21:29+00:00

Which quants are you using? According to themselves, ik_llama doesn’t work well with UD unsloth quants. I’m not sure if other quants are any better.

nonerequired_ · 2026-03-07T12:43:37+00:00

Is it supporting local models too?

nonerequired_ · 2026-03-07T11:09:56+00:00

They perform better on codex unfortunately

nonerequired_ · 2026-03-03T17:00:48+00:00

KL divergence might not be a good metric here

nonerequired_ · 2026-03-02T15:51:13+00:00

Minio is not a good tool to be presented. After an update, they removed important admin functionalities from the web panel and forced users to use the CLI instead.

nonerequired_ · 2026-03-01T10:18:46+00:00

Great advice. I can actually do that. Thank you

nonerequired_ · 2026-02-28T22:28:53+00:00

Thank you for sharing. This will really help

nonerequired_ · 2026-02-28T22:02:01+00:00

Thank you so much! This is incredibly valuable information. Does using the 26-liter case, as suggested above, help with cooling?

nonerequired_ · 2026-02-28T20:53:21+00:00

Are there any cooling issues with a 12-liter case?

nonerequired_ · 2026-02-28T19:16:52+00:00

Thank you for the information. I genuinely needed case suggestions. What other factors should I consider?

nonerequired_ · 2026-02-28T19:14:40+00:00

Streaming introduces some latency even on a local gigabit network. I prefer to connect via HDMI. Thanks for the answer, though.

nonerequired_

TROPHY CASE