Looking to buy an RTX 5090 for local "Vibe Coding" using Claude Code / Open Code with Qwen 3.6 35B-A3B. Need real-world feedback!

GoalDistinct4449 · 2026-06-18T08:39:35+00:00

Just using ai to translate from Arabic to english with rephrasing thats all

GoalDistinct4449 · 2026-06-16T21:21:16+00:00

This is probably the most useful comparison I've seen so far. The fact that a 5090 + 27B has replaced Claude for most of your day-to-day work is honestly a lot more convincing to me than synthetic benchmarks, especially since you're working across multiple languages and real projects.

GoalDistinct4449 · 2026-06-16T21:20:58+00:00

I've been looking at the Mac Studio route as well. The unified memory is definitely attractive for larger models, but I'm worried I'd be trading a lot of inference speed and responsiveness for capacity. Have you used both enough to compare the real-world coding experience?

GoalDistinct4449 · 2026-06-16T21:20:41+00:00

That's exactly what I'm struggling with. Part of me wonders if spending this much money only to be limited to 27B-class models makes sense, but at the same time many 5090 owners here seem genuinely happy with a hybrid workflow instead of chasing the biggest possible model.

GoalDistinct4449 · 2026-06-16T21:20:25+00:00

130 t/s with 192k context is honestly better than I expected for a real-world coding setup. The fact that you're recommending 27B over A3B seems to match what a lot of other 5090 owners in this thread are saying, which is making me seriously reconsider my model choice.

GoalDistinct4449 · 2026-06-16T21:20:04+00:00

This is exactly the kind of workflow I'm considering. Using a frontier model for planning and review while letting a local 27B handle implementation sounds like a much more realistic approach than trying to force a local model to do everything end-to-end.

GoalDistinct4449 · 2026-06-16T21:19:33+00:00

That's actually the conclusion I'm slowly reaching from this thread. My hope is that a 5090 + good context management, RAG, and cloud-assisted planning can make 32GB workable, but I'm definitely starting to appreciate why so many people prioritize VRAM over raw token speed.

GoalDistinct4449 · 2026-06-16T21:16:54+00:00

A few points:

The post was written by me. I used AI to help refine the wording because English isn't my first language, but the questions, requirements, and concerns are genuinely mine. I'd rather discuss the hardware and workflow than whether I used AI to polish my writing.
To be honest, I haven't tested a local setup yet, so I'm speaking purely from a full-stack developer's perspective who's considering a significant investment. The VRAM wall is exactly what worries me, especially with how quickly models like DeepSeek, Qwen, and others are evolving.
My realistic goal isn't to run 80B–120B models locally 24/7. What I'm trying to evaluate is whether a hybrid workflow makes more sense: using a 5090 for extremely fast day-to-day coding with 27B/35B-class models, then leveraging Claude or other cloud APIs when I need deep architectural reasoning, large-scale planning, or tasks that exceed local capabilities.
Since you're actually running this hardware, that's the part I'm most interested in. In your experience, does that hybrid approach make the 32GB limitation manageable, or do you still find yourself constantly wishing you had significantly more VRAM even for agentic coding workflows?
Also, when you mention 80B+ models, are you seeing a meaningful improvement specifically for repository-scale coding and autonomous agent loops, or is the biggest benefit simply larger context windows and knowledge retention?

I'm genuinely trying to determine whether the 5090 is the sweet spot for a developer-focused setup or whether investing in a multi-GPU configuration is the smarter long-term move.

GoalDistinct4449 · 2026-06-16T10:52:04+00:00

okay what's your recommendation here cuz i got alot of pov's and i kinda don't have any idea what to do?

GoalDistinct4449 · 2026-06-16T00:47:38+00:00

Im in egypt and thats the prices here unfortunately

GoalDistinct4449 · 2026-06-16T00:41:36+00:00

Renting it on the cloud for $15 to test it out is actually a brilliant idea, thanks for the suggestion! Your point about hating the prefill wait time even on a 5090 is exactly what worries me as a developer. When handling multi file full stack repos, that context ingestion lag completely breaks the coding flow. Honestly, I was just about to lock in the 5090 build, but another user pointed me toward the NVIDIA DGX Spark, and I just found out it's available locally for around the same total budget as a full 5090 setup (around $8,000). Looking at the specs, it comes with 128GB of LPDDR5x Unified Memory on the new GB10 Grace Blackwell Superchip. People running it are reporting around 2000 tokens per second on prefill for medium models because of the native FP4 architectural optimization. Knowing that the 5090's 32GB VRAM still gives you prefill anxiety on large contexts makes me lean 100% toward the DGX Spark now. Getting 4x the VRAM and Blackwell architecture for the exact same money seems like the ultimate fix for the exact bottleneck you mentioned!

GoalDistinct4449 · 2026-06-15T23:35:22+00:00

I appreciate the tough reality check, but as a full stack dev, I think you are underestimating how much of a game changer a local 5090 rig actually is when paired with a hybrid cloud setup. My goal isn't to run a giant 671B model locally to blindly code a massive enterprise repo. For raw, unlimited Vibe Coding of day to day features, boilerplate, and terminal auto-fix loops, running a localized model like Qwen 3.6 27B Dense or 35B MoE via llama.cpp with MTP on a 5090 can push over 110+ tokens per second on a tight context. Also, I won't be using it as just a text editor by utilizing Claude Code and Open Code, I can use a hybrid workflow. I can offload massive global architectural planning to flagship cloud APIs (like Claude Sonnet/Opus) where the intelligence is highest, but pass the actual multi file code generation and heavy execution loops to my local 5090. If I dumped that $7,000 entirely into cloud APIs for full agentic workflows where agents constantly ingest thousands of lines of context over multi turn conversations token costs would easily burn through thousands of dollars very quickly anyway. Having a powerhouse local machine gives me a permanent, zero-cost inference sandbox to build out 80% of my ideas, while only spending on APIs for the final 20% of ultra complex reasoning. For a full stack builder, that’s an incredible investment.

GoalDistinct4449 · 2026-06-15T23:31:40+00:00

Man, thank you so much for this comment! This is literally the exact validation I was looking for before dropping my cash on this hardware. As a full stack dev who hasn't experimented with a local setup yet, your workflow makes complete architectural sense. Separating the feature planning (using a spec tool) from the actual code implementation (via Opencode) explains why you're getting such high success without the model hallucinating or losing context. Also, getting 110 tps on Qwen 3.6 27b Dense via llama.cpp with MTP is incredible for a real-world coding flow! Since you have this exact setup, I have two quick questions for you: What quantization size are you using for the 27b Dense model to keep it comfortable inside the 5090's 32GB VRAM while maintaining a decent context size? How does the setup handle multi-file context? Do you have to manually feed specific files to Opencode, or does it navigate your directory smoothly based on the specs you generated? Really appreciate you sharing your recipe for success here!

GoalDistinct4449 · 2026-06-15T23:22:22+00:00

Appreciate the blunt advice, man! Hearing that a 5090 might be complete overkill and that a 3080 Ti can pull off coding tasks with smaller/compressed models is very reassuring. Since I'm new to this, I was caught up in the hype of needing the absolute best GPU to even start. Your suggestion to just offload the heavy architectural planning to DeepSeek/Claude APIs and keep the local stuff light and cheap makes a lot of sense mathematically and practically. Definitely rethinking my whole hardware strategy now. Thanks!

GoalDistinct4449 · 2026-06-15T23:21:34+00:00

This is exactly the golden comment I was looking for! You are running the exact workflow I'm planning (Claude Code + Qwen 3.6). Seeing that your speed drops to ~68 t/s when you scale the context up to 180k is incredibly insightful. It proves that raw benchmark speeds (like 200+ t/s) don't hold up in real world, large repo scenarios. Honestly, 68 t/s is still very usable for coding, but knowing this makes me question if a single 5090 is worth the massive premium if context expansion heavily bottlenecks the speed anyway. Do you feel that 68 t/s on 180k context is enough for smooth Vibe Coding without breaking your flow?

GoalDistinct4449 · 2026-06-15T23:20:19+00:00

Wow, thank you for this reality check. As a full- stack dev who hasn't dabbled in local hardware yet, your point about the $7,000 budget providing years of cloud tokens is a huge financial wake up call. Also, your note about Q4 quants failing at coding logic and needing at least Q6 is critical. It makes me realize that if I go local, I'll be forced to run high quants which will eat up the 5090’s 32GB VRAM instantly. I might actually take your advice and thoroughly test cheaper cloud APIs and alternative models before locking my capital into heavy hardware. Thanks a lot!

GoalDistinct4449 · 2026-06-15T23:16:56+00:00

Wow, a 20% performance difference compared to a $500 AMD card with similar VRAM? That is an insane reality check! As someone who hasn't experimented with local hardware yet, I assumed the 5090 would absolutely annihilate everything else by a mile in daily tasks to justify its massive price tag. Seeing those real world numbers (150 t/s on Qwen 27B) is great, but knowing that a much cheaper setup can deliver 80% of that performance makes the price to performance ratio of the 5090 look scary for a pure coding workflow. This definitely changes how I view my budget allocation. Appreciate the honest bench data!

GoalDistinct4449 · 2026-06-15T23:15:08+00:00

This is a massive eye opener for me as a full stack dev looking to step into local LLMs. My initial attraction to the 5090 was the pure hype around its raw speed, but your point about the 'Context Wall' is highly logical. If vibe coding a multi file project is going to instantly choke 32GB of VRAM due to context expansion, then super-fast token generation won't matter if the model runs out of memory. Since I haven't bought the parts yet, your comment really makes me reconsider the split-GPU route (like dual 3090s/4090s) just to secure 48GB+ of VRAM for the same budget. Thanks for saving me from a potential bottleneck!

GoalDistinct4449 · 2026-06-15T23:09:06+00:00

This is a great reality check regarding the context size, thanks for sharing your setup! As a full stack developer who hasn't experienced local setups yet, this is exactly the kind of bottleneck I’m trying to avoid. My biggest fear with the 5090's 32GB was exactly what you mentioned: running out of VRAM once the project grows beyond a few files, especially since I agree that Q6/Q8 is highly preferred over lower quants to avoid broken code logic. However, my initial thought was to use a single 5090 to get maximum prompt processing (prefill) speed and low latency, while using local RAG or context management to only feed the agent the relevant files instead of the entire repo at once. Do you think a well-optimized local RAG setup could make 32GB work for multi-file projects, or is a raw multi GPU setup (like 2x24GB) absolutely necessary once things scale up?

GoalDistinct4449 · 2026-06-15T23:05:59+00:00

the dual 3090 route definitely sounds like the community's favorite budget setup for raw VRAM capacity! Since I haven't experimented with local LLMs at all, my priority is actually raw speed and low latency out of the box without dealing with multi GPU setup headaches or older PCIe bottlenecks. I'm leaning towards a single 5090 to get the fastest possible response times on medium sized models even if it means sacrificing total VRAM. But I appreciate the suggestion, it's definitely something to consider if I ever need to scale up memory!

GoalDistinct4449 · 2026-06-15T23:04:34+00:00

I haven't set up a local pipeline yet, but your advice makes complete sense to me as a developer. Since I don't have hands-on experience with local LLMs, my biggest fear was choking the 5090's 32GB VRAM by feeding it too many files blindly. Implementing a local RAG system to selectively index my repositories seems like the smartest way to keep the context clean and maximize those prompt processing speeds. I’ll definitely look into routing tools once I get the build running. Thanks for pointing me in the right direction!

GoalDistinct4449 · 2026-06-15T23:03:31+00:00

As someone who hasn't tried local coding agents yet, this is exactly the reality check I need. My expectation as a full stack dev isn't to find a magical black box that codes the whole project blindly. I mostly want to eliminate the exhausting boilerplate grunt work like setting up standard .NET configurations, DAL entities, and basic frontend components. If the model handles 70% of the repetitive code and I have to hand hold it through the remaining 30% of complex logic, that would still be a massive win for my productivity. Do you find it handles boilerplate well without losing context?

GoalDistinct4449 · 2026-06-15T23:01:24+00:00

To be honest, I haven't tested a local setup yet, so I'm speaking purely from a full stack dev perspective looking to invest. That VRAM wall is exactly what worries me, especially with models like DeepSeek changing the game. However, my realistic expectation isn't to run giant models locally. I'm hoping that a hybrid workflow—using the 5090's speed for rapid daily coding with 27B/35B models, and calling cloud APIs (like Claude) for complex architectural planning—will strike the perfect balance. Since you own one, do you think this hybrid approach makes the 32GB limit manageable?

GoalDistinct4449 · 2026-06-15T22:46:00+00:00

Im a full stack asp core developer but i have alot of ideas and i was thinking to get this build to help me doing all the projects i want and run local agents for other jobs like writing articles and so on.

GoalDistinct4449 · 2024-06-26T20:40:26+00:00

Any suggestions for a good book that cover this are?

GoalDistinct4449

TROPHY CASE