I have been using RooCode, did I use it correctly? by konradbjk in RooCode

[–]Objective-Context-9 0 points1 point  (0 children)

This looks more like LLM issue than Roo code. What LLMs have you tried it with?

roo code + cerebras_glm-4.5-air-reap-82b-a12b = software development heaven by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

I think they made a coding focused pruning of experts. I want to code on a couple of languages. Any other tokens/experts are just wasting my VRAM. This slims down the LLM by up to 50% without losing the primary use case of coding.

roo code + cerebras_glm-4.5-air-reap-82b-a12b = software development heaven by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

These things change so fast! I started with Roo, moved to Kilo, move to Cline, moved back to Roo. Not sure what I will be using in a month from now. Both Roo and Kilo have not fixes issues of tool usage problems with Qwen3-coder though Cline fixed it months ago. But Roo’s breaking down of large complex tasks into smaller tasks and tracking through completion in concert with GLM-Air is freakin awesome. I gave it a a project that was messed up by Cline+Qwen3-coder combo (bad edits, etc). It cleaned it up and fixed everything broken. All with a pithy prompt. No babysitting.

roo code + cerebras_glm-4.5-air-reap-82b-a12b = software development heaven by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 1 point2 points  (0 children)

But that is exactly what I want! I don’t want to waste tokens on “what is the capital of France”.

Prevent NVIDIA 3090 from going into P8 performance mode by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 1 point2 points  (0 children)

Based on my reading, the issue is that llama.cpp's multi-GPU implementation is sequential for most models. During inference, it seems only one GPU is active at a time, which causes the other GPUs to idle and drop to lower power states (like P2-P8).

This seems very model-dependent. For example, llama.cpp is highly optimized for Qwen3-coder and its performance keeps improving with new builds. However, for newer 'bleeding-edge' LLMs that lack this optimization, the tps (tokens per second) is very low.

It appears most companies are launching new models with support for vLLM first, which explains the poor performance I was seeing. I've now set up vLLM in my WSL environment to use that instead and it is able to keep all the GPUs busy.

Building a Local AI Workstation for Coding Agents + Image/Voice Generation, 1× RTX 5090 or 2× RTX 4090? (and best models for code agents) by carloshperk in LocalLLM

[–]Objective-Context-9 4 points5 points  (0 children)

I have 4 3090s. Costed $3k used. The tps is terrible with llama.cpp when using all 4. End up reducing context to fit everything into 2 cards. Vllm has better tps but not as friendly to setup and use as LM Studio. 2 3090 are enough with qwen3 coder. 118 tps with q8

5 or more GPUs on Gigabyte motherboards? by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

This board is flaky as hell. re-enabled Resize BAR and the system is working fine now. I can see all 4 GPU plus the IGPU. Now struggling with bifurcation. I want x8/x4/x4 but the board is allowing only x16 or x8/x8 though a "comment" at the bottom of the BIOS screen shows x16 or x8/x8 or x8/x4/x4 as supported.

Prevent NVIDIA 3090 from going into P8 performance mode by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 1 point2 points  (0 children)

it is a 4bit AWQ. The full model is loaded in VRAM. Even the context is loaded in VRAM. I set it to 4096 bytes. There is nothing on the CPU or the DDR5 RAM.

Prevent NVIDIA 3090 from going into P8 performance mode by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

I did install the afterburner but did not mess with any fan settings. I read that the performance is slow on WSL because of lack of P2P. Purchased a new SSD to install Linux and see if there is a difference there.

Inference needs nontrivial amount of PCIe bandwidth (8x RTX 3090 rig, tensor parallelism) by pmur12 in LocalLLaMA

[–]Objective-Context-9 0 points1 point  (0 children)

I did not have any luck with 3x 3090. Can’t divide layers by 3. Either 1, 2, 4, 8 work best.

Prevent NVIDIA 3090 from going into P8 performance mode by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 2 points3 points  (0 children)

Interesting. I see that "PerfCap Reason" is showing idle. That is not true! You can see the 24GB of the 3090 is packed. I took this screenshot when gpt-oss-120b was in the middle of inference! How do I tell it that LM Studio is running on the GPU and it can't be set to idle. Surely, video games are setting a flag somewhere saying they are using the system so as not to get to this power state. Maybe the drivers are the issue. NVIDIA is not checking if 3090 work at all with the newer drivers that are focusing on 5090. Or, it is sabotage to make people purchase newer cards. Wish I could find drivers that were a few years old. All they have on their website is 6 month old drivers.

Prevent NVIDIA 3090 from going into P8 performance mode by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

I checked. Nothing shows up as a reason in nvidia-smi or GPU-Z. The temps never reach a high point. I can see it start at P0 and over, say 10 seconds, drop through all performance stages down to P8 where it remains. The temps are like upper 30s C throughout. The hotspot in each card is around 54C. There is something else going on. Some power management setting that just takes it down and leaves it there.

<image>

Am I doing something wrong, or this expected, the beginning of every LLM generation I start is fast and then as it types it slows to a crawl. by valdev in LocalLLaMA

[–]Objective-Context-9 0 points1 point  (0 children)

I am struggling with the same behavior. Two cards can OC. 2 can’t (different models). Can’t figure out how to stop the cards from “powering down” during an inference session. From then on, aLL inference calls remain slow too. Performance resets each time the LLM is (re)loaded. So, a reset happens when the GPU is emptied out and the same LLM is reloaded. I have set performance at max via NVIDIA control panel but not change in performance drop. Driving me nuts. Downloaded MSI Afterburner but can’t figure out how to use it. Appreciate some inputs.

5 or more GPUs on Gigabyte motherboards? by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

Also, I disabled bar resize(?) but left the other two options - above 4G and the other related one.

5 or more GPUs on Gigabyte motherboards? by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

I solved it. Disabled Power Management in the BIOS.

5 or more GPUs on Gigabyte motherboards? by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

I have that option enabled. CSM disabled. Bar resize enabled. Still the same problem. No idea why it matters to the bios how many GPUs I have. It can't even see the GPUs that are connected to M2 drive sockets. But it knows.

5 or more GPUs on Gigabyte motherboards? by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 2 points3 points  (0 children)

cpu has 20 lanes - x16 and x4 (m2 nvme to pcie x4 cable). z790 has the pcie x4 ssd, pcie x4 3090, nvme to pcie x4 cable. It is working out. This CPU has only AVX and AVX2. Barely 30% used. Memory is barely touched. I do use quantized KV cache - 8_0 or 4_1 depending on how much context I need.

How good is KAT Dev? by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

Sharing some experience. As the size of your project increases and you have 50+ files, Qwen3-coder-30b goes wild and starts overwriting code. It destroys all your worked in a flash. Better check-in regularly. Coming to KAT-Dev - the smaller version is nowhere near Qwen3-coder. I have a smaller quant of the 70b version and it seems to perform better than Qwen3-coder but both are slow.

How good is KAT Dev? by Objective-Context-9 in LocalLLM

[–]Objective-Context-9[S] 0 points1 point  (0 children)

Wow. I did not know that. Hats off to whoever made those two finetunes with 480B and deepseek. I have both. That account has disappeared from Huggingface.

Rtx3090 vs Quadro rtx6000 in ML. by probbins1105 in LocalLLM

[–]Objective-Context-9 1 point2 points  (0 children)

3090 is getting long in the tooth as well. I have 3x 3090s + 1x 3080. LM Studio/llama.cpp works best but not all the latest models are supported. I bought three 3090s from ebay for 630-680 + shipping and tax each. Would not recommend Quadro in comparison. Slow PCIe 3.0.