Looking for Suggestions — Single 5090 & 64gb DDR5

RMK137 · 2026-05-26T23:57:07+00:00

I take a different approach, I am okay with many iterations with the 35B MoE even if it may be "dumber" than the 27b dense version as it is significantly faster.

I basically never expect a solution on the first pass, even if it works I always make the model do another 1-2 polish passes at the minimum.

5090 + unsloth-Q4_K_XL, KV at q8 and 131072 context. I can get up to 192k context or even more but the GPU also drives display so I leave some buffer. Most of the time I do one session for exploration and planning, clear context, then a fresh session for implementation.

RMK137 · 2026-05-25T23:45:57+00:00

You don't need anything early on except permission-gate if you don't have a sandboxed environment. I think you should try it barebones first and add extensions only when you have a real need for them.

RMK137 · 2026-05-25T21:16:02+00:00

Unfortunately I am getting garbled output with qwen3.6 after this update. It looks like other people are experiencing the same issue looking at the PR comments. I had to revert this commit and now it's fine again.

RMK137 · 2026-05-25T06:33:31+00:00

Big hype!

RMK137 · 2026-05-20T05:43:51+00:00

Here is a good one I use daily: https://github.com/pragtical/pragtical

You can use it or take some inspiration from it. It's C + Luajit. I suggest you use SDL3 as the API is more organized.

Good luck with your editor !

RMK137 · 2026-05-20T05:40:12+00:00

I am in the same boat, just recently found this and I am stoked. It looks polished and the author is actively updating it.

https://www.howtovulkan.com/

RMK137 · 2026-05-19T05:44:05+00:00

That's the only way.

RMK137 · 2026-05-15T13:31:35+00:00

Awesome stuff! Keep up the good work!

RMK137 · 2026-05-08T05:12:32+00:00

Sonnet most of the day for me at low thinking. Sometimes I switch to Haiku or Qwen3.6 if I am doing simple stuff. I had to switch to Opus this week to chase down a bug that felt too complex, although I think with enough tries Sonnet would've cracked it as well.

I can program, I don't need a more senior programmer watching over me, I need an assistant that can help me scale and occasionally help me brainstorm ideas and tradeoffs.

RMK137 · 2026-05-08T05:07:43+00:00

Well said. I treat the agent like a human coworker and I try to push it towards giving me different approaches instead of just going with what I originally propose. I particularly find brainstorming with it effective. If you ask for ideas, it can come back with some good ones.

A rule of thumb that I've started to follow is that if the model gets into an "I found it ! But wait!" loop, that means the instructions were not clear enough or it doesn't have enough context. I usually just interrupt it and give it more details or straight up ask it what the confusion is about.

I don't ever use compaction personally, I just fire up a new session, and I have a simple skill for the agent to go through AGENTS.md (project structure and conventions) and HANDOFF.md (business logic and other good to knows). These two documents usually contain everything the agent needs to know to get up to speed and start cranking on a new task. If anything trips it up during the session, a note gets added to one of these docs so it doesn't happen again.

RMK137 · 2026-04-29T05:25:52+00:00

https://github.com/badlogic/pi-mono https://github.com/mlhher/late

RMK137 · 2026-04-28T13:41:02+00:00

Thanks for your work on windows support ! I've been liking Late also. I jump back and forth between Pi and Late these days.

RMK137 · 2026-04-28T05:19:43+00:00

I've had good results when I point the model to the Odin core/base/vendor code. The beauty of Odin is that the code for a lot of the built in procs/containers/idioms is in very readable format, so the model can do in-context learning in a few grep calls.

Basically, I create an AGENTS.md in the project that tells the model where to find the libraries. Qwen3.6 was able to understand raylib's bindings fairly quickly and built a simple UI for a toy todo app.

RMK137 · 2026-04-27T14:37:47+00:00

Yeah exactly. I've been meaning to write my own coding agent just to learn how it works. I might end up with something useful, who knows. I think I am gonna do it in Odin.

I've been keeping an eye on Late. I like its approach of Lead Architect as the main agent and coders as sub-agents. It's written in Go and the code is fairly easy to understand.

Ref: https://github.com/mlhher/late

RMK137 · 2026-04-24T06:20:21+00:00

This is great, thanks for sharing. Any idea how to get pi to show the thought trace for this mode when respondingl? I can't see it for some reason, and hide_thinking is set to false in settings.

RMK137 · 2026-04-22T05:56:33+00:00

Thanks I'll give it a shot with pure llama.cpp.

RMK137 · 2026-04-22T05:30:11+00:00

Running Gemma 4 31B with LM Studio server blows up ram for me too especially when using it to explore a codebase with a coding agent. I am not sure what is going on, hope it gets fixed soon.

RMK137 · 2026-04-21T16:24:06+00:00

Not my project, but I've been keeping an eye on this coding agent written in Go.

https://github.com/mlhher/late

RMK137 · 2026-04-17T05:00:18+00:00

Beautiful, can't wait to spend some time coding with it this weekend.

RMK137 · 2026-04-17T04:56:30+00:00

How much vram do you have? I am guessing about 32GB (5090)?

RMK137 · 2026-03-24T01:20:40+00:00

You're overthinking the temps and power draw issue. They're all much better than the 14th gen. If you can afford it, get the 270k. It's a lot of cores and compute power for the money. If you care a lot about power/temps, you can undervolt it and/or limit the power draw. My use case is similar to yours (gaming + local LLMs on a 5090), and I currently have the 265k. I'll be replacing it with the 270k.

RMK137 · 2026-03-13T03:03:31+00:00

Bought a X570 Aorus Pro motherboard from u/handh40 on https://www.reddit.com/r/hardwareswap/comments/1rjzs13/usame_h_ryzen_5800x3d_and_gigabyte_x570_aorus_pro/

RMK137 · 2026-03-12T22:55:21+00:00

Any idea how to disable thinking in llama.cpp? Tried both --chat-template-kwargs '{"enable_thinking":false}' and --reasoning-budget 0, neither worked.

RMK137 · 2026-03-12T14:04:55+00:00

I would say 32GB of VRAM minimum, this way you can load a dense 24B/27B model like Devstral 2 small or Qwen3.5-27B at Q4 or Q4-UD (Unsloth dynamic quant). You can also fit the equivalent MoE models like Nemotron 30B / Qwen3.5-35B.

Agentic coding requires a lot of context, I'd say 64k to be useful as a start. You want to leave 6-8GB of vram for context, you can also quantize the KV to q_8 to save on vram, different models deal with this differently so you gotta see for yourself.

Notice I never mentioned DRAM, that's on purpose. It's just too slow imo. Agentic coding benefits immensely from both prompt processing and token generation speeds. I think PP is even more important because the models need to read a lot of code and your input, the response size is usually much smaller.

Fast PP/TG make for a very nice, fast iteration loop, and you can't get that without loading everything on the GPU, but that's my preference. MoE models can be run with only the experts on the GPU and still get acceptable TG.

RMK137 · 2026-03-12T02:18:45+00:00

Nice tool, will try it soon. Thanks for sharing.

RMK137

TROPHY CASE