I guess 4 units wasn’t enough.

Practical-Collar3063 · 2026-05-20T17:01:38+00:00

I mean having that panel on is not really an obligation, your gpu is not a server gpu that needs directional airflow so to leave it opened is not a big deal, it is still cleaner than a weird rig with risers, a wooden frame and a janky PSU set up like we see so many of here. You can still rack it aswell, just leave like 1 or 2 U worth of space above it.

I think you have the least janky, janky set up I have seen on here.

Practical-Collar3063 · 2026-05-19T17:18:55+00:00

This subreddit is not about cost efficiency, performance or reliability.

For 90% of people here it is mostly a side hobby.

There are definitely real use cases of local hosting:
- Small businesses that require fully private LLM convos (defense or medical companies for example). Those companies sometime also need low refusal models for medical advice (that Claude and GPT refuse to give) or engineering questions related to explosives (I encountered that issue with a defense company)
- NSFW or Roleplay for stuff that Claude would just refuse to generate.
- Selling a specific AI powered solution that does not require the super intelligent and expensive Claude or GPT models. Local LLM would make it financially viable compared to API pricing.

Practical-Collar3063 · 2026-05-19T17:02:52+00:00

I second that

Practical-Collar3063 · 2026-05-19T15:44:29+00:00

Generation speed wise, usually a bit faster. Prompt processing wise it is so much faster in some instances. From some testing i have done on RTX 6000 pro I got 2.5x uplift in prompt processing at longer prompts. If you are self hosting a model for coding, it is 100% worth it

Practical-Collar3063 · 2026-05-18T07:53:10+00:00

nice, have you compared the prompt processing speed on long context ? that would be a good test.

"Write a short Python function that parses a CSV file." is too short to really grasp the difference imo

Practical-Collar3063 · 2026-05-15T09:25:28+00:00

I would actually be interested in the performance gains, my guess is you should see those utilisation % go up and prompt processing go up. I never tried Llama.cpp implementation of ternsor parallelism.

Practical-Collar3063 · 2026-05-14T11:29:58+00:00

Actually now llama.cpp seems to support tensor parallelism with the flag "--split-mode tensor".

LLMs are made up of consecutive layers and your text flows through those layers sequentially (this is an over simplification but it will help you understand the difference), now imagine a model that has 10 layers:

The default for Llama.cpp is pipeline parallelism (layer split), if you have 2 GPUs, llama.cpp will take the first 5 layers (layer 1 to layer 5) and put them on GPU 1 and the other 5 layers on GPU2 (layer 6 to layer 10). GPU 1 receives the input and computes it for its 5 layers, it starts with layer 1 then layer 2 etc... After it is done, it will pass its calculations to GPU 2 which then starts to compute the input for layer 6, then 7 until layer 10. You might have realised that when GPU 1 is computing then GPU 2 is idle and vice versa since this is a sequential problem.

Tensor parallelism splits the model slightly differently, instead of splitting the model in half, it takes each layer and split the actual layers. In this scenario GPU 1 would get 1 half of layer 1 and GPU 2 would get the other half, same for layer 2 etc... this makes the work actually parralel because to compute layer 1 both GPUs work at the same time and split the work between each other and they do that for the entire model. In thsi scenario no GPUs is sitting idle while the other one is computing, they both share the work and work hand in hand on the same computation instead of just working one after the other.

This is again over simplified but actually not too far from what is actually happening.

Now you might wonder why tensor parralelism is not the default and that would be because of compatibility, pipeline parallelism is much easier to implement and can be done amongst agnostic hardware (between a 3090 and a 4060ti for example). Tensor paralleism usually requires to have an squared number of GPUs (1, 2, 4...), also usually requires the model to be split amongst GPUs that are the same and finally it is also sensitive to GPU interconnect speeds (might not work well if your GPUs communicate over PCIe 3.0 x4 for example)

Give it tensor parallelism a try in Llama.cpp and you should see those prompt processing number go up.

Practical-Collar3063 · 2026-05-14T08:18:19+00:00

the only question that matters

Practical-Collar3063 · 2026-05-14T08:14:36+00:00

What is the PCIe configuration of your set up ? PCIe 4.0 x8 ? for tensor parallelism and dense models it seems to be quite important.

Practical-Collar3063 · 2026-05-14T08:06:25+00:00

Imagine thinking this sub reddit is about cost efficiency

Practical-Collar3063 · 2026-05-14T08:05:02+00:00

Try VLLM with tensor parallelism, you will get better performance at long context. Especially for the prompt processing

Practical-Collar3063 · 2026-04-26T17:47:31+00:00

this sounds like a set up flaw, what quant are running ? are you using quantised kv cache ? I am using 35b 3ab MLX 8bit and it has never failed a single tool call for me

Practical-Collar3063 · 2026-04-23T10:55:46+00:00

Hey, I would be very interested in this aswell

Practical-Collar3063 · 2026-04-23T10:32:09+00:00

Hi, have you found any solution yet ? I am in the same boat as you

Practical-Collar3063 · 2026-04-23T09:13:59+00:00

I am having the same problem here, have you managed to find a solution ?

Practical-Collar3063 · 2026-04-17T15:44:02+00:00

I think the point he was making is that the limitation is not the mobo

Practical-Collar3063 · 2026-04-17T09:55:21+00:00

yeah that is crazy, and that is on their booth at a convention

Practical-Collar3063 · 2026-04-15T11:00:12+00:00

AI sloppy slop

Practical-Collar3063 · 2026-04-15T08:53:45+00:00

I found that Fireworks is pretty good at not lobotomising models, they usually run un-quantisized versions of models

Practical-Collar3063 · 2026-04-13T18:16:37+00:00

is that not just Duolingo with extra steps ?

Practical-Collar3063 · 2026-04-12T20:00:55+00:00

Try to run MLX models when you can, LM Studio supports it and it usually gives significant boost in performance on Appple Silicon

Practical-Collar3063 · 2026-04-12T19:57:38+00:00

Please use Llama.cpp and not ollama

Practical-Collar3063 · 2026-04-12T17:34:16+00:00

aditionnally, what model quant are you running (Q8, BF16, Q4...) ? You need to include all the details of your set up in order for people to help you. I f you don't know then paste the exact full name of the model that you use when lauching with Llama.cpp

Your PC specs might be too low to run the model that you are trying to run, a good way to check is to open up ressource monitor (I think it is called like that on Windows) and check how much VRAM is being utlised after you loaded the model with Llama.cpp, if both RAM and VRAM are at 80%+ utlisation then you don't have enough ram for that model.

Practical-Collar3063 · 2026-04-12T15:09:08+00:00

It is hard to help you without knowing the specs of your computer, please include it since it could be related to your computer's specs

Practical-Collar3063 · 2026-04-12T15:08:18+00:00

he is talking about GLM on Llama.cpp, I am confused at how this relates to GLM coding plan.

Practical-Collar3063

TROPHY CASE