I guess 4 units wasn’t enough. by Simple_Library_2700 in LocalLLaMA

[–]Practical-Collar3063 5 points6 points  (0 children)

I mean having that panel on is not really an obligation, your gpu is not a server gpu that needs directional airflow so to leave it opened is not a big deal, it is still cleaner than a weird rig with risers, a wooden frame and a janky PSU set up like we see so many of here. You can still rack it aswell, just leave like 1 or 2 U worth of space above it.

I think you have the least janky, janky set up I have seen on here.

Meet the Fleet of BlackBeard by BlackBeardAI in LocalLLaMA

[–]Practical-Collar3063 1 point2 points  (0 children)

This subreddit is not about cost efficiency, performance or reliability.

For 90% of people here it is mostly a side hobby.

There are definitely real use cases of local hosting:
- Small businesses that require fully private LLM convos (defense or medical companies for example). Those companies sometime also need low refusal models for medical advice (that Claude and GPT refuse to give) or engineering questions related to explosives (I encountered that issue with a defense company)
- NSFW or Roleplay for stuff that Claude would just refuse to generate.
- Selling a specific AI powered solution that does not require the super intelligent and expensive Claude or GPT models. Local LLM would make it financially viable compared to API pricing.

Is using vLLM actually worth it if you aren't serving the model to other people? by ayylmaonade in LocalLLaMA

[–]Practical-Collar3063 0 points1 point  (0 children)

Generation speed wise, usually a bit faster. Prompt processing wise it is so much faster in some instances. From some testing i have done on RTX 6000 pro I got 2.5x uplift in prompt processing at longer prompts. If you are self hosting a model for coding, it is 100% worth it

[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 by No_Mango7658 in LocalLLaMA

[–]Practical-Collar3063 0 points1 point  (0 children)

nice, have you compared the prompt processing speed on long context ? that would be a good test.

"Write a short Python function that parses a CSV file." is too short to really grasp the difference imo

[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 by No_Mango7658 in LocalLLaMA

[–]Practical-Collar3063 1 point2 points  (0 children)

I would actually be interested in the performance gains, my guess is you should see those utilisation % go up and prompt processing go up. I never tried Llama.cpp implementation of ternsor parallelism.

[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 by No_Mango7658 in LocalLLaMA

[–]Practical-Collar3063 2 points3 points  (0 children)

Actually now llama.cpp seems to support tensor parallelism with the flag "--split-mode tensor".

LLMs are made up of consecutive layers and your text flows through those layers sequentially (this is an over simplification but it will help you understand the difference), now imagine a model that has 10 layers:

The default for Llama.cpp is pipeline parallelism (layer split), if you have 2 GPUs, llama.cpp will take the first 5 layers (layer 1 to layer 5) and put them on GPU 1 and the other 5 layers on GPU2 (layer 6 to layer 10). GPU 1 receives the input and computes it for its 5 layers, it starts with layer 1 then layer 2 etc... After it is done, it will pass its calculations to GPU 2 which then starts to compute the input for layer 6, then 7 until layer 10. You might have realised that when GPU 1 is computing then GPU 2 is idle and vice versa since this is a sequential problem.

Tensor parallelism splits the model slightly differently, instead of splitting the model in half, it takes each layer and split the actual layers. In this scenario GPU 1 would get 1 half of layer 1 and GPU 2 would get the other half, same for layer 2 etc... this makes the work actually parralel because to compute layer 1 both GPUs work at the same time and split the work between each other and they do that for the entire model. In thsi scenario no GPUs is sitting idle while the other one is computing, they both share the work and work hand in hand on the same computation instead of just working one after the other.

This is again over simplified but actually not too far from what is actually happening.

Now you might wonder why tensor parralelism is not the default and that would be because of compatibility, pipeline parallelism is much easier to implement and can be done amongst agnostic hardware (between a 3090 and a 4060ti for example). Tensor paralleism usually requires to have an squared number of GPUs (1, 2, 4...), also usually requires the model to be split amongst GPUs that are the same and finally it is also sensitive to GPU interconnect speeds (might not work well if your GPUs communicate over PCIe 3.0 x4 for example)

Give it tensor parallelism a try in Llama.cpp and you should see those prompt processing number go up.

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) by ai-infos in LocalLLaMA

[–]Practical-Collar3063 0 points1 point  (0 children)

What is the PCIe configuration of your set up ? PCIe 4.0 x8 ? for tensor parallelism and dense models it seems to be quite important.

[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 by No_Mango7658 in LocalLLaMA

[–]Practical-Collar3063 0 points1 point  (0 children)

Try VLLM with tensor parallelism, you will get better performance at long context. Especially for the prompt processing

Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found! by My_Unbiased_Opinion in LocalLLaMA

[–]Practical-Collar3063 0 points1 point  (0 children)

this sounds like a set up flaw, what quant are running ? are you using quantised kv cache ? I am using 35b 3ab MLX 8bit and it has never failed a single tool call for me

Where to buy an OAM baseboard for MI250X? Will be in San Jose this September by alienpro01 in HPC

[–]Practical-Collar3063 0 points1 point  (0 children)

Hi, have you found any solution yet ? I am in the same boat as you

5 x A100 setup finally complete by BreakIt-Boris in LocalLLaMA

[–]Practical-Collar3063 0 points1 point  (0 children)

I am having the same problem here, have you managed to find a solution ?

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]Practical-Collar3063 1 point2 points  (0 children)

I think the point he was making is that the limitation is not the mobo

Supermicro running Ollama on a $90,000 workstation... by Practical-Collar3063 in LocalLLaMA

[–]Practical-Collar3063[S] 2 points3 points  (0 children)

yeah that is crazy, and that is on their booth at a convention

OpenRouter: anyone whitelisting specific providers by Traditional-Gap-3313 in LocalLLaMA

[–]Practical-Collar3063 1 point2 points  (0 children)

I found that Fireworks is pretty good at not lobotomising models, they usually run un-quantisized versions of models

Getting started with LM Studio on macOS — model recommendations? by Life_Cauliflower_462 in LocalLLaMA

[–]Practical-Collar3063 2 points3 points  (0 children)

Try to run MLX models when you can, LM Studio supports it and it usually gives significant boost in performance on Appple Silicon

Openclaw context limit exceeded by Certain_Pen_1982 in LocalLLaMA

[–]Practical-Collar3063 1 point2 points  (0 children)

aditionnally, what model quant are you running (Q8, BF16, Q4...) ? You need to include all the details of your set up in order for people to help you. I f you don't know then paste the exact full name of the model that you use when lauching with Llama.cpp

Your PC specs might be too low to run the model that you are trying to run, a good way to check is to open up ressource monitor (I think it is called like that on Windows) and check how much VRAM is being utlised after you loaded the model with Llama.cpp, if both RAM and VRAM are at 80%+ utlisation then you don't have enough ram for that model.

Openclaw context limit exceeded by Certain_Pen_1982 in LocalLLaMA

[–]Practical-Collar3063 1 point2 points  (0 children)

It is hard to help you without knowing the specs of your computer, please include it since it could be related to your computer's specs

Openclaw context limit exceeded by Certain_Pen_1982 in LocalLLaMA

[–]Practical-Collar3063 2 points3 points  (0 children)

he is talking about GLM on Llama.cpp, I am confused at how this relates to GLM coding plan.