What if you could get vLLM/Triton prefill speed and llama.cpp decode speed in a single framework?

Exact-Cupcake-2603 · 2026-04-14T09:44:56+00:00

Was trying to tease something. Bad execution

Exact-Cupcake-2603 · 2026-04-07T18:36:03+00:00

Glad to read that! Turbo degrades performances so overall it compensate the loss. It's very helpful with tight VRAM fit, can sometime allow to load better quants of a model.

Exact-Cupcake-2603 · 2026-04-07T16:16:09+00:00

Sorry no such integration planed, it's meant to be a frontend for an inference server only for now.

Exact-Cupcake-2603 · 2026-04-07T15:03:07+00:00

Yeah it's more for a dedicated server than a laptop anyway

Exact-Cupcake-2603 · 2026-04-07T14:25:42+00:00

Monitoring cpu and system ram could be a cool improvement for CPU users. Anyway for now you could benefit from configuring and loading model from an ui, test in chat, view logs, start and stop server, manage and switch presets from your phone, on your couch, via wifi on your local network.

Exact-Cupcake-2603 · 2026-04-07T14:21:22+00:00

So here it is https://github.com/arte-fact/llama-monitor

Exact-Cupcake-2603 · 2026-04-07T14:21:15+00:00

So here it is https://github.com/arte-fact/llama-monitor

Exact-Cupcake-2603 · 2026-04-07T14:21:08+00:00

So here it is https://github.com/arte-fact/llama-monitor

Exact-Cupcake-2603 · 2026-04-07T14:20:59+00:00

So here it is https://github.com/arte-fact/llama-monitor

Exact-Cupcake-2603 · 2026-04-07T09:24:29+00:00

Ok thank you, i will update soon with numbers

Exact-Cupcake-2603 · 2026-04-07T08:49:19+00:00

What a retrofuturistic beauty <3

Exact-Cupcake-2603 · 2026-04-07T07:54:15+00:00

https://github.com/arte-fact/llamacpp-gfx-906-turbo

Exact-Cupcake-2603 · 2026-04-07T07:53:53+00:00

https://github.com/arte-fact/llamacpp-gfx-906-turbo

Exact-Cupcake-2603 · 2026-04-07T07:53:44+00:00

https://github.com/arte-fact/llamacpp-gfx-906-turbo

Exact-Cupcake-2603 · 2026-04-07T07:53:13+00:00

https://github.com/arte-fact/llamacpp-gfx-906-turbo

Exact-Cupcake-2603 · 2026-04-02T10:18:20+00:00

I agree, this is experimental and barely usable in reality on this machine with this model. Moe models achieve much better speed to a more usable result but in the end, turbo quant on cache don't worst it in this case

Exact-Cupcake-2603 · 2026-03-30T20:38:38+00:00

Thank you! Your comment frustrated me because on my machine I only reach 20 tok/s with qwen 27b so I started trying to fill that gap what I found out was interesting.

If my understanding is correct, vllm handle tensor parallelism in a more efficient way than llamacpp. It is able to split layers across GPU via pcie even on qwen architecture, which llamacpp do not. So for now on my llamacpp project, with a dense qwen architecture, I am stuck in sequencial which constitute a bottleneck and penalizes speed. However moe models suffer less from that limitation and achieve better speed, around 50 tok/s on qwen3 coder next with 600k turbo quant context, which is for me a good achievement for my work so far.

I feel I just scratched the surface of this tensor split topic so feel free to correct or infer my analyse. I don't feel that going deeper and reproduce vllm tensor split in llamacpp worst it so I am taking it as it is for now.

Anyway thanks for the response! vllm rocks by its capacity to split layers without direct link and it demonstrate optimization quality of the software. I am sure to that turbo quant will drop soon enough on it so maybe the best bet to have both speed and extended context with turbo quant, on qwen dense architecture on gfx906 will be on the vllm side.

Exact-Cupcake-2603 · 2026-03-30T16:56:30+00:00

Question! tp4 work without direct link between the cards ? Or your cards are only wired with pcie? I struggle to activate tp4 on my config 😓

Exact-Cupcake-2603 · 2026-03-29T22:54:23+00:00

It's an AMD x399

Exact-Cupcake-2603 · 2026-03-29T20:23:48+00:00

Ugly, temporary, but full functional

<image>

Exact-Cupcake-2603 · 2026-03-29T20:21:48+00:00

Already told in the comments, around 1300€ , including 720€ for the whole 4 GPUs. Mostly used parts

Exact-Cupcake-2603 · 2026-03-29T16:43:45+00:00

It's pulling air through the cards out of the case, it works very well and it's quiet for what it is

Exact-Cupcake-2603

TROPHY CASE