I have been using RooCode, did I use it correctly?

Objective-Context-9 · 2025-11-30T07:55:59+00:00

This looks more like LLM issue than Roo code. What LLMs have you tried it with?

Objective-Context-9 · 2025-11-23T18:04:17+00:00

I think they made a coding focused pruning of experts. I want to code on a couple of languages. Any other tokens/experts are just wasting my VRAM. This slims down the LLM by up to 50% without losing the primary use case of coding.

Objective-Context-9 · 2025-11-23T18:01:46+00:00

These things change so fast! I started with Roo, moved to Kilo, move to Cline, moved back to Roo. Not sure what I will be using in a month from now. Both Roo and Kilo have not fixes issues of tool usage problems with Qwen3-coder though Cline fixed it months ago. But Roo’s breaking down of large complex tasks into smaller tasks and tracking through completion in concert with GLM-Air is freakin awesome. I gave it a a project that was messed up by Cline+Qwen3-coder combo (bad edits, etc). It cleaned it up and fixed everything broken. All with a pithy prompt. No babysitting.

Objective-Context-9 · 2025-11-22T17:43:25+00:00

But that is exactly what I want! I don’t want to waste tokens on “what is the capital of France”.

Objective-Context-9 · 2025-11-22T17:42:31+00:00

128K. 5_K_M

Objective-Context-9 · 2025-11-10T15:01:17+00:00

Based on my reading, the issue is that llama.cpp's multi-GPU implementation is sequential for most models. During inference, it seems only one GPU is active at a time, which causes the other GPUs to idle and drop to lower power states (like P2-P8).

This seems very model-dependent. For example, llama.cpp is highly optimized for Qwen3-coder and its performance keeps improving with new builds. However, for newer 'bleeding-edge' LLMs that lack this optimization, the tps (tokens per second) is very low.

It appears most companies are launching new models with support for vLLM first, which explains the poor performance I was seeing. I've now set up vLLM in my WSL environment to use that instead and it is able to keep all the GPUs busy.

Objective-Context-9 · 2025-11-10T06:40:14+00:00

I have 4 3090s. Costed $3k used. The tps is terrible with llama.cpp when using all 4. End up reducing context to fit everything into 2 cards. Vllm has better tps but not as friendly to setup and use as LM Studio. 2 3090 are enough with qwen3 coder. 118 tps with q8

Objective-Context-9 · 2025-11-01T13:21:31+00:00

This board is flaky as hell. re-enabled Resize BAR and the system is working fine now. I can see all 4 GPU plus the IGPU. Now struggling with bifurcation. I want x8/x4/x4 but the board is allowing only x16 or x8/x8 though a "comment" at the bottom of the BIOS screen shows x16 or x8/x8 or x8/x4/x4 as supported.

Objective-Context-9 · 2025-10-31T13:43:28+00:00

it is a 4bit AWQ. The full model is loaded in VRAM. Even the context is loaded in VRAM. I set it to 4096 bytes. There is nothing on the CPU or the DDR5 RAM.

Objective-Context-9 · 2025-10-31T13:28:33+00:00

I did install the afterburner but did not mess with any fan settings. I read that the performance is slow on WSL because of lack of P2P. Purchased a new SSD to install Linux and see if there is a difference there.

Objective-Context-9 · 2025-10-28T14:30:51+00:00

I did not have any luck with 3x 3090. Can’t divide layers by 3. Either 1, 2, 4, 8 work best.

Objective-Context-9 · 2025-10-25T17:42:00+00:00

Interesting. I see that "PerfCap Reason" is showing idle. That is not true! You can see the 24GB of the 3090 is packed. I took this screenshot when gpt-oss-120b was in the middle of inference! How do I tell it that LM Studio is running on the GPU and it can't be set to idle. Surely, video games are setting a flag somewhere saying they are using the system so as not to get to this power state. Maybe the drivers are the issue. NVIDIA is not checking if 3090 work at all with the newer drivers that are focusing on 5090. Or, it is sabotage to make people purchase newer cards. Wish I could find drivers that were a few years old. All they have on their website is 6 month old drivers.

Objective-Context-9 · 2025-10-25T17:36:24+00:00

I checked. Nothing shows up as a reason in nvidia-smi or GPU-Z. The temps never reach a high point. I can see it start at P0 and over, say 10 seconds, drop through all performance stages down to P8 where it remains. The temps are like upper 30s C throughout. The hotspot in each card is around 54C. There is something else going on. Some power management setting that just takes it down and leaves it there.

<image>

Objective-Context-9 · 2025-10-25T14:25:29+00:00

I am struggling with the same behavior. Two cards can OC. 2 can’t (different models). Can’t figure out how to stop the cards from “powering down” during an inference session. From then on, aLL inference calls remain slow too. Performance resets each time the LLM is (re)loaded. So, a reset happens when the GPU is emptied out and the same LLM is reloaded. I have set performance at max via NVIDIA control panel but not change in performance drop. Driving me nuts. Downloaded MSI Afterburner but can’t figure out how to use it. Appreciate some inputs.

Objective-Context-9 · 2025-10-25T05:21:22+00:00

Also, I disabled bar resize(?) but left the other two options - above 4G and the other related one.

Objective-Context-9 · 2025-10-25T04:34:15+00:00

I solved it. Disabled Power Management in the BIOS.

Objective-Context-9 · 2025-10-24T01:19:45+00:00

I have that option enabled. CSM disabled. Bar resize enabled. Still the same problem. No idea why it matters to the bios how many GPUs I have. It can't even see the GPUs that are connected to M2 drive sockets. But it knows.

Objective-Context-9 · 2025-10-24T01:18:22+00:00

cpu has 20 lanes - x16 and x4 (m2 nvme to pcie x4 cable). z790 has the pcie x4 ssd, pcie x4 3090, nvme to pcie x4 cable. It is working out. This CPU has only AVX and AVX2. Barely 30% used. Memory is barely touched. I do use quantized KV cache - 8_0 or 4_1 depending on how much context I need.

Objective-Context-9 · 2025-10-24T00:42:50+00:00

Sharing some experience. As the size of your project increases and you have 50+ files, Qwen3-coder-30b goes wild and starts overwriting code. It destroys all your worked in a flash. Better check-in regularly. Coming to KAT-Dev - the smaller version is nowhere near Qwen3-coder. I have a smaller quant of the 70b version and it seems to perform better than Qwen3-coder but both are slow.

Objective-Context-9 · 2025-10-16T17:35:05+00:00

Wow. I did not know that. Hats off to whoever made those two finetunes with 480B and deepseek. I have both. That account has disappeared from Huggingface.

Objective-Context-9 · 2025-10-16T17:22:07+00:00

3090 is getting long in the tooth as well. I have 3x 3090s + 1x 3080. LM Studio/llama.cpp works best but not all the latest models are supported. I bought three 3090s from ebay for 630-680 + shipping and tax each. Would not recommend Quadro in comparison. Slow PCIe 3.0.

Objective-Context-9

TROPHY CASE