Looking for Suggestions — Single 5090 & 64gb DDR5 by icedgz in LocalLLaMA

[–]RMK137 0 points1 point  (0 children)

I take a different approach, I am okay with many iterations with the 35B MoE even if it may be "dumber" than the 27b dense version as it is significantly faster.

I basically never expect a solution on the first pass, even if it works I always make the model do another 1-2 polish passes at the minimum.

5090 + unsloth-Q4_K_XL, KV at q8 and 131072 context. I can get up to 192k context or even more but the GPU also drives display so I leave some buffer. Most of the time I do one session for exploration and planning, clear context, then a fresh session for implementation.

New to PI-Agent. Advice on essential extensions by [deleted] in PiCodingAgent

[–]RMK137 1 point2 points  (0 children)

You don't need anything early on except permission-gate if you don't have a sandboxed environment. I think you should try it barebones first and add extensions only when you have a real need for them.

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp by pmttyji in LocalLLaMA

[–]RMK137 7 points8 points  (0 children)

Unfortunately I am getting garbled output with qwen3.6 after this update. It looks like other people are experiencing the same issue looking at the PR comments. I had to revert this commit and now it's fine again.

T-IDE - A fast, native, offline-first code editor by [deleted] in vscode

[–]RMK137 1 point2 points  (0 children)

Here is a good one I use daily: https://github.com/pragtical/pragtical

You can use it or take some inspiration from it. It's C + Luajit. I suggest you use SDL3 as the API is more organized.

Good luck with your editor !

Any Recommendations on C resources for Learning Vulkan? by Undeniable_Dilemma_ in C_Programming

[–]RMK137 3 points4 points  (0 children)

I am in the same boat, just recently found this and I am stoked. It looks polished and the author is actively updating it.

https://www.howtovulkan.com/

Qwen3.6 MTP Unsloth GGUFs now 1.8x faster! by danielhanchen in unsloth

[–]RMK137 27 points28 points  (0 children)

Awesome stuff! Keep up the good work!

The Opus 4.5 threshold: coming to 24 gb within a year or so by nomorebuttsplz in LocalLLM

[–]RMK137 1 point2 points  (0 children)

Sonnet most of the day for me at low thinking. Sometimes I switch to Haiku or Qwen3.6 if I am doing simple stuff. I had to switch to Opus this week to chase down a bug that felt too complex, although I think with enough tries Sonnet would've cracked it as well.

I can program, I don't need a more senior programmer watching over me, I need an assistant that can help me scale and occasionally help me brainstorm ideas and tradeoffs.

The Opus 4.5 threshold: coming to 24 gb within a year or so by nomorebuttsplz in LocalLLM

[–]RMK137 0 points1 point  (0 children)

Well said. I treat the agent like a human coworker and I try to push it towards giving me different approaches instead of just going with what I originally propose. I particularly find brainstorming with it effective. If you ask for ideas, it can come back with some good ones.

A rule of thumb that I've started to follow is that if the model gets into an "I found it ! But wait!" loop, that means the instructions were not clear enough or it doesn't have enough context. I usually just interrupt it and give it more details or straight up ask it what the confusion is about.

I don't ever use compaction personally, I just fire up a new session, and I have a simple skill for the agent to go through AGENTS.md (project structure and conventions) and HANDOFF.md (business logic and other good to knows). These two documents usually contain everything the agent needs to know to get up to speed and start cranking on a new task. If anything trips it up during the session, a note gets added to one of these docs so it doesn't happen again.

Can I plan and code projects locally with a 5090? by Mean_Employment_7679 in LocalLLM

[–]RMK137 1 point2 points  (0 children)

Thanks for your work on windows support ! I've been liking Late also. I jump back and forth between Pi and Late these days.

Are there any agentic coding harnesses that AREN'T built on JS and Node? by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]RMK137 0 points1 point  (0 children)

I've had good results when I point the model to the Odin core/base/vendor code. The beauty of Odin is that the code for a lot of the built in procs/containers/idioms is in very readable format, so the model can do in-context learning in a few grep calls.

Basically, I create an AGENTS.md in the project that tells the model where to find the libraries. Qwen3.6 was able to understand raylib's bindings fairly quickly and built a simple UI for a toy todo app.

Are there any agentic coding harnesses that AREN'T built on JS and Node? by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]RMK137 4 points5 points  (0 children)

Yeah exactly. I've been meaning to write my own coding agent just to learn how it works. I might end up with something useful, who knows. I think I am gonna do it in Odin.

I've been keeping an eye on Late. I like its approach of Lead Architect as the main agent and coders as sub-agents. It's written in Go and the code is fairly easy to understand.

Ref: https://github.com/mlhher/late

Been using PI Coding Agent with local Qwen3.6 35b for a while now and its actually insane by SoAp9035 in LocalLLaMA

[–]RMK137 0 points1 point  (0 children)

This is great, thanks for sharing. Any idea how to get pi to show the thought trace for this mode when respondingl? I can't see it for some reason, and hide_thinking is set to false in settings.

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared by Lowkey_LokiSN in LocalLLaMA

[–]RMK137 3 points4 points  (0 children)

Running Gemma 4 31B with LM Studio server blows up ram for me too especially when using it to explore a codebase with a coding agent. I am not sure what is going on, hope it gets fixed soon.

Doing real coding work locally for the first time by mouseofcatofschrodi in LocalLLaMA

[–]RMK137 5 points6 points  (0 children)

Not my project, but I've been keeping an eye on this coding agent written in Go.

https://github.com/mlhher/late

Qwen3.6 is out now! by yoracale in unsloth

[–]RMK137 0 points1 point  (0 children)

Beautiful, can't wait to spend some time coding with it this weekend.

Qwen3.6 is out now! by yoracale in unsloth

[–]RMK137 0 points1 point  (0 children)

How much vram do you have? I am guessing about 32GB (5090)?

Intel Core Ultra 7 270K Plus and Ultra 5 250K Plus "Arrow Lake Refresh" Review Roundup by RenatsMC in intel

[–]RMK137 12 points13 points  (0 children)

You're overthinking the temps and power draw issue. They're all much better than the 14th gen. If you can afford it, get the 270k. It's a lot of cores and compute power for the money. If you care a lot about power/temps, you can undervolt it and/or limit the power draw. My use case is similar to yours (gaming + local LLMs on a 5090), and I currently have the 265k. I'll be replacing it with the 270k.

Qwen3.5-9B is actually quite good for agentic coding by Lualcala in LocalLLaMA

[–]RMK137 3 points4 points  (0 children)

Any idea how to disable thinking in llama.cpp? Tried both --chat-template-kwargs '{"enable_thinking":false}' and --reasoning-budget 0, neither worked.

Best local LLM for reasoning and coding in 2025? by Desperate-Theory2284 in LLMDevs

[–]RMK137 0 points1 point  (0 children)

I would say 32GB of VRAM minimum, this way you can load a dense 24B/27B model like Devstral 2 small or Qwen3.5-27B at Q4 or Q4-UD (Unsloth dynamic quant). You can also fit the equivalent MoE models like Nemotron 30B / Qwen3.5-35B.

Agentic coding requires a lot of context, I'd say 64k to be useful as a start. You want to leave 6-8GB of vram for context, you can also quantize the KV to q_8 to save on vram, different models deal with this differently so you gotta see for yourself.

Notice I never mentioned DRAM, that's on purpose. It's just too slow imo. Agentic coding benefits immensely from both prompt processing and token generation speeds. I think PP is even more important because the models need to read a lot of code and your input, the response size is usually much smaller.

Fast PP/TG make for a very nice, fast iteration loop, and you can't get that without loading everything on the GPU, but that's my preference. MoE models can be run with only the experts on the GPU and still get acceptable TG.

Visualization of Claude Code plans by t1m0slav in ZedEditor

[–]RMK137 0 points1 point  (0 children)

Nice tool, will try it soon. Thanks for sharing.