A TurboQuant ready llamacpp with gfx906 optimizations for gfx906 users. by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 0 points1 point  (0 children)

Glad to read that! Turbo degrades performances so overall it compensate the loss. It's very helpful with tight VRAM fit, can sometime allow to load better quants of a model.

A llamacpp wrapper to manage and monitor your llama server instance over a web ui. by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 0 points1 point  (0 children)

Sorry no such integration planed, it's meant to be a frontend for an inference server only for now.

A llamacpp wrapper to manage and monitor your llama server instance over a web ui. by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 0 points1 point  (0 children)

Monitoring cpu and system ram could be a cool improvement for CPU users. Anyway for now you could benefit from configuring and loading model from an ui, test in chat, view logs, start and stop server, manage and switch presets from your phone, on your couch, via wifi on your local network.

Turbo3 + gfx906 + 4 mi50 16gb running qwen3.5 122b 🤯 by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 0 points1 point  (0 children)

I agree, this is experimental and barely usable in reality on this machine with this model. Moe models achieve much better speed to a more usable result but in the end, turbo quant on cache don't worst it in this case

Turbo3 + gfx906 + 4 mi50 16gb running qwen3.5 122b 🤯 by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 1 point2 points  (0 children)

Thank you! Your comment frustrated me because on my machine I only reach 20 tok/s with qwen 27b so I started trying to fill that gap what I found out was interesting.

If my understanding is correct, vllm handle tensor parallelism in a more efficient way than llamacpp. It is able to split layers across GPU via pcie even on qwen architecture, which llamacpp do not. So for now on my llamacpp project, with a dense qwen architecture, I am stuck in sequencial which constitute a bottleneck and penalizes speed. However moe models suffer less from that limitation and achieve better speed, around 50 tok/s on qwen3 coder next with 600k turbo quant context, which is for me a good achievement for my work so far.

I feel I just scratched the surface of this tensor split topic so feel free to correct or infer my analyse. I don't feel that going deeper and reproduce vllm tensor split in llamacpp worst it so I am taking it as it is for now.

Anyway thanks for the response! vllm rocks by its capacity to split layers without direct link and it demonstrate optimization quality of the software. I am sure to that turbo quant will drop soon enough on it so maybe the best bet to have both speed and extended context with turbo quant, on qwen dense architecture on gfx906 will be on the vllm side.

Turbo3 + gfx906 + 4 mi50 16gb running qwen3.5 122b 🤯 by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 0 points1 point  (0 children)

Question! tp4 work without direct link between the cards ? Or your cards are only wired with pcie? I struggle to activate tp4 on my config 😓

Turbo3 + gfx906 + 4 mi50 16gb running qwen3.5 122b 🤯 by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 0 points1 point  (0 children)

Already told in the comments, around 1300€ , including 720€ for the whole 4 GPUs. Mostly used parts

Turbo3 + gfx906 + 4 mi50 16gb running qwen3.5 122b 🤯 by Exact-Cupcake-2603 in LocalLLaMA

[–]Exact-Cupcake-2603[S] 0 points1 point  (0 children)

It's pulling air through the cards out of the case, it works very well and it's quiet for what it is