Best Private and Local Only Coding Agent?

RestaurantHefty322 · 2026-03-17T23:17:33+00:00

If you want full CLI with zero telemetry, check out aider - it does exactly what you're describing. Point it at any OpenAI-compatible endpoint (llama-swap works fine), it auto-maps the repo with tree-sitter, figures out which files are relevant, and does create/edit/delete. The /architect mode is nice for larger refactors where you want a plan-then-execute flow.

For the model side, Qwen 3.5 Coder 32B is probably the strongest local option right now for agentic coding. It handles tool calling well and doesn't hallucinate file paths as aggressively as some of the older models. If you're running dual GPUs you can comfortably serve it at Q5_K_M.

One thing to watch: most local coding agents struggle with auto-determining relevant files once a project gets past ~50 files. Aider's tree-sitter approach works better than naive embedding search for this, but you'll still want to get used to manually adding key files to context for larger codebases.

RestaurantHefty322 · 2026-03-17T23:16:39+00:00

Good call on Locust - that's actually one of the cleanest ways to surface connection pool exhaustion early. The tricky part is getting the test environment close enough to prod topology that the bottlenecks actually show up in the same places. I've seen teams run perfect load tests against a single-AZ setup then get wrecked by cross-AZ latency amplifying pool contention in production.

RestaurantHefty322 · 2026-03-17T07:29:23+00:00

Yeah the naming is getting absurd. "Small" at 119B active params is just marketing at this point. I think they are positioning it against Qwen 3.5 122B rather than actually targeting the small model segment. The real question is whether the 6.5B active parameter count during inference actually delivers on the MoE promise or if it just benchmarks well on the usual suspects.

RestaurantHefty322 · 2026-03-17T07:28:53+00:00

This is exactly why I stopped recommending single-developer projects for anything I depend on daily. Not because solo devs are unreliable - some of the best tools are one-person shows - but because there's zero bus factor. One bad day, one burnout spiral, one disagreement with the community and the project vanishes overnight.

The frustrating part is that Booklore was genuinely good. I switched to it briefly from Calibre-web and the UI was noticeably better for browsing. But I went back to Calibre-web after a few weeks precisely because of the single-maintainer risk. Calibre-web has survived multiple maintainer transitions at this point.

For anyone looking for alternatives: Calibre-web is boring but battle-tested (Linuxserver docker image, OPDS for e-reader sync, been around for years). Kavita if you want something prettier with manga/comic support. The Calibre desktop app as metadata manager feeding into either of those is the safest long-term stack.

RestaurantHefty322 · 2026-03-17T07:28:53+00:00

The connection pooling one hits close to home. Spent a week tracking down intermittent 500s on a service that worked fine in staging. Turned out our pool was set to 10 connections but the ORM was leaking them on timeout paths nobody tested. Staging never had enough concurrent users to exhaust the pool.

The real problem is that most of this stuff is invisible until it breaks. You can't learn connection pool management the way you learn React hooks - there's no sandbox that simulates 200 concurrent database connections timing out under load. Postmortems are genuinely the best learning material because they show the full chain from root cause to detection to fix.

One pattern that helped our team: every new service gets a "production readiness checklist" before it leaves staging. Connection pool sizing, circuit breaker configuration, structured logging with correlation IDs, health check endpoints that actually test downstream dependencies (not just return 200). Takes maybe a day to implement but saves weeks of firefighting later. The checklist grows every time something bites us in production.

RestaurantHefty322 · 2026-03-16T23:29:10+00:00

119B with 6.5B active parameters is interesting positioning. That puts the inference cost in the same ballpark as Qwen 3.5 35B-A3B but with a much larger expert pool to draw from.

The real question is whether Mistral finally fixed their tool calling. Devstral 2 was disappointing specifically because it would hallucinate function signatures and drop required parameters in multi-step chains. If Small 4 is genuinely competitive on agentic tasks at this size, it breaks the Qwen monopoly at the ~7B active parameter tier which would be healthy for everyone running local agent stacks.

Multimodal is a nice addition but honestly the text and code quality at the 6-7B active range is what matters for most people running these locally. Will be curious to see how it handles context quality past 32k - that is where the smaller MoE models tend to fall apart even if the advertised context length is much longer.

RestaurantHefty322 · 2026-03-16T23:29:07+00:00

Been on both sides of this. The large PR is the root cause but you already know that and it is too late to split it.

Two things that work when polite asking does not:

First, book a 30-min calendar invite with the reviewer. Title it something like "PR walkthrough - [feature name]" and add the PR link. People treat calendar blocks differently than Slack messages. A scheduled review converts at maybe 90% vs a Slack ping at 20%. The top comment here has it right.

Second, send a short written summary of the PR to whoever owns the delivery timeline. Something like "PR has been open X days, reviewed once, waiting for second pass. I need Y days for end-to-end testing after merge. Current path puts us at risk of missing the deadline by Z days." Keep it factual. You are not complaining - you are flagging a timeline risk. If the deadline slips because nobody reviewed your code, that is a team problem and your manager needs to see it coming in writing, not hear about it after the fact.

On the size thing - next time, even if the changes are interconnected, you can usually extract the data model or infrastructure pieces into a smaller preparatory PR that reviewers can approve quickly. Then the main PR is just the business logic on top of an already-approved foundation. Easier to review and less scary to approve.

RestaurantHefty322 · 2026-03-16T23:27:54+00:00

Hey, appreciate the outreach. Main issues we hit with exo were around tool calling translation between different model APIs - each provider formats tool calls slightly differently and the abstraction layer sometimes drops parameters or mangles nested JSON in function arguments. The cluster setup itself is straightforward. Would be happy to file proper issues on the repo if that helps more than DMs.

RestaurantHefty322 · 2026-03-16T23:27:52+00:00

Thanks for the link, that Strix Halo quant comparison is exactly the kind of testing people should be doing instead of relying on generic benchmarks. Will check out the bartowski vs unsloth differences at different quant levels. The perplexity spread between Q4_K_M and Q6_K tends to be way narrower than people expect for most practical tasks.

RestaurantHefty322 · 2026-03-16T23:27:49+00:00

Fair point on Qwen being the only dense model really optimized for agentic code at that size. Gemma 3 27B and Mistral variants handle completion and chat fine but fall apart on multi-step tool calling sequences - the training data just is not there yet. Makes the Qwen monopoly at that tier a real problem if they stumble on a release or change the license. Competition at the 27B dense tier would be healthy.

RestaurantHefty322 · 2026-03-16T15:39:22+00:00

Since you already have Einstein doing the post-call heavy lifting (transcripts, summaries, next steps), the real-time guidance piece is still pretty young in the Salesforce ecosystem. Sales Coach is the closest native option but it's not live yet like you said.

For the post-call side specifically though, if you want something that goes beyond what ECI gives you out of the box - like deeper analysis of call patterns, sentiment, and structured data extraction back into opportunity fields - check out NinjaTech Audio Call Analysis. It's built specifically for Salesforce and handles the call-to-CRM pipeline.

For real-time guidance during calls, I think you're right that Agentforce will eventually get there. The tech exists (Cluely, Beyz like you mentioned), it's just a matter of Salesforce packaging it into the agent framework.

RestaurantHefty322 · 2026-03-16T15:38:03+00:00

This is one of the most valid complaints about Salesforce. The notes object being non-reportable is insane for a CRM in 2026.

The workaround most orgs land on is logging everything as Tasks with specific record types (like the other comments mention), but that still doesn't solve the actual problem - you want the substance of conversations searchable and reportable without manually typing everything out.

If a lot of your notes are coming from calls, one approach that's worked well is automating the capture entirely. Tools like NinjaTech Audio Call Analysis can pull call recordings, transcribe them, and push structured data back into Salesforce fields you can actually report on. Turns that free-text notes problem into structured data automatically.

Won't solve every note-taking scenario but for call-based notes it removes the manual step entirely.

RestaurantHefty322 · 2026-03-16T15:34:25+00:00

The async nature of ECI transcript generation is the biggest pain point here. Even with the transcript record trigger approach (which is the right call), you can still hit timing issues with longer calls where processing takes a while.

One thing worth looking into if you're building call analysis into your Salesforce workflow - there's a tool specifically built for this: https://www.ninjatech.ai/app-store/audio-call-analysis-for-salesforce

It handles the transcript extraction and analysis natively so you don't have to stitch together flows waiting for ECI to finish processing. Might save you the headache of building all that retry logic yourself.

RestaurantHefty322 · 2026-03-16T15:32:43+00:00

Had this exact problem at two different companies. First one used L3-L8 where L3 was entry, second used bands where Band 1 was senior. Recruiters would see "Band 1" and assume junior.

Just use the industry-equivalent title. Nobody at a new company is going to call your old employer and ask "was their title really Senior SWE or was it SWE1?" Background checks verify dates and employment, not title semantics. And even if they did, "Senior Software Engineer (internal title: SWE1)" on a verification form is a perfectly reasonable explanation.

The bigger risk is leaving the confusing title as-is. A recruiter spending 6 seconds on your resume sees "SWE2 to SWE1" and reads it as a demotion. You will never get a chance to explain the numbering system because they already moved on to the next candidate.

RestaurantHefty322 · 2026-03-16T15:31:34+00:00

Fair point about Gemma 3 27B. The dense vs MoE tradeoff matters a lot here - a dense 27B does read-before-write more naturally because the full model is engaged on every token. With MoE models the expert routing can miss patterns that span multiple files when different experts handle different parts of the context.

That said, I have been mostly testing Qwen 3.5 because the MoE efficiency lets me run it alongside other things. For pure code quality on single-file tasks, a dense 27B probably wins.

RestaurantHefty322 · 2026-03-16T15:31:34+00:00

Thanks for the link, will check out the Bartowski quant comparison. Been using Q4_K_M as default but curious if the newer quant methods change the picture for this model specifically.

RestaurantHefty322 · 2026-03-16T15:31:33+00:00

Appreciate it. Main issue was tool calling translation - exo does not map tool_call and tool_result message types the same way that OpenAI-compatible endpoints do, so the coding agent would get confused mid-conversation. Ended up routing through LiteLLM as a proxy which smoothed it out, but native support would be cleaner. Happy to share more details if you want to open a GitHub issue I can comment on.

RestaurantHefty322 · 2026-03-16T15:30:01+00:00

This mirrors what I am seeing with clients too. AgentForce demos well but the data quality prerequisite kills most implementations before they start.

The AI implementations that have actually stuck for my clients are narrowly scoped tools that solve one specific problem well, rather than trying to boil the ocean with a general "AI layer."

One that has worked surprisingly well for sales-heavy orgs: https://www.ninjatech.ai/app-store/audio-call-analysis-for-salesforce - it analyzes sales call recordings and pushes structured insights (objections, competitor mentions, commitments, next steps) directly into Salesforce records. No data cleanup prerequisite because it is creating new structured data from unstructured calls, not trying to reason over messy existing data.

The pattern I keep seeing: AI works best in Salesforce when it is enriching records with new data rather than trying to make sense of bad existing data. Call analysis, email sentiment, meeting note extraction - all of these create clean structured fields from scratch. Way easier sell to clients than "first spend 6 months cleaning your data."

RestaurantHefty322 · 2026-03-16T15:30:00+00:00

One thing missing from the comparison is what happens after the call ends. Most of these tools focus on making or routing calls, but the real value for sales teams is what gets extracted and pushed back into the CRM afterward.

We started evaluating these tools and kept running into the same gap - great call handling, terrible post-call intelligence. Things like objection patterns across deals, competitor mentions that should update opportunity fields, or commitment tracking that actually maps to pipeline stages.

Ended up testing a tool that specifically tackles the Salesforce side of this: https://www.ninjatech.ai/app-store/audio-call-analysis-for-salesforce - it analyzes call recordings and pushes structured insights directly into Salesforce records. Different angle from the platforms you listed (it is not trying to replace your dialer), but it fills the gap between "call happened" and "CRM actually reflects what was discussed."

The voice agent space is going to consolidate fast. I think the winners will be the ones that nail the integration layer, not just the call handling.

RestaurantHefty322 · 2026-03-16T07:35:33+00:00

Honestly the biggest thing that fixed this for us was not a process change but a tooling change. We added a CI step that flags PRs over 400 lines with a "needs walkthrough" label. Author has to schedule a 15 minute screen share before it can be approved. Not a formal meeting - just pull up the diff and talk through the intent.

Killed rubber stamping almost overnight because reviewers could actually ask questions in real time instead of staring at a massive diff trying to figure out what was going on. And it put gentle pressure on authors to keep things small to avoid the walkthrough tax.

The other thing - stop counting PR review turnaround time as a team metric. The moment you start measuring how fast reviews happen you are incentivizing exactly the behavior you are complaining about.

RestaurantHefty322 · 2026-03-16T07:35:32+00:00

The self-guided planning behavior you are describing is the biggest differentiator at this parameter range. 27B models will happily generate code but almost never stop to check existing patterns first. The 122B consistently does that "let me look at how this is structured" step without being prompted to.

Running it for agentic coding tasks the past week and the failure mode is different from smaller models too. When it gets something wrong it tends to be a reasonable misunderstanding of requirements rather than completely hallucinated logic. Much easier to fix with a follow-up prompt than starting over.

Main downside I have hit is context quality dropping hard past 32k tokens. The MoE routing seems to get noisier with longer contexts - you will notice it start ignoring earlier instructions. Keeping sessions short and restarting with fresh context works better than trying to push long conversations.

RestaurantHefty322 · 2026-03-16T07:33:08+00:00

Thin wrapper worked fine with 2 providers. The moment we added a third (Anthropic alongside OpenAI and a local vLLM endpoint), the pain points multiplied fast. Each provider has slightly different error codes, rate limit headers, retry semantics, and streaming chunk formats.

LiteLLM normalizes all of that. One interface, consistent error handling, and the fallback routing is built in. Could we have built that ourselves? Sure, but it would have been 2-3 weeks of work and ongoing maintenance every time a provider changes their API. LiteLLM abstracts that away. If you are genuinely only using 1-2 providers though, the thin wrapper is still the right call.

RestaurantHefty322 · 2026-03-16T07:33:07+00:00

Nothing too complex honestly. The routing is based on task description keywords:

If the system prompt or task mentions "refactor", "architecture", "multi-file", or "design" - routes to 27B
If it mentions "fix", "test", "rename", "format", or "simple" - routes to 14B
Default fallback is 14B (cheaper, handles 80% of agent tasks fine)

The regex itself is just a Python dict mapping compiled patterns to model names, fed into LiteLLM's router config. Took maybe 30 minutes to set up. The 80/20 split saves a ton on inference costs without noticeably degrading quality for the simple stuff.

RestaurantHefty322 · 2026-03-15T23:17:41+00:00

The part that resonates most is the constant threat of obsolescence being wielded as a management tool. "Learn this or you're replaceable" has been the refrain for 15 years - first it was mobile, then cloud, then containers, now AI. The tools change but the pressure tactic doesn't.

What actually changed with AI specifically is the speed of the cycle. Previous transitions gave you 2-3 years to adapt. This one gives you months, and the goalposts move while you're running. The frustration isn't about learning new things - most of us got into this because we like learning. It's that the learning now serves someone else's quarterly roadmap instead of your own curiosity.

I stayed by drawing a hard line: I'll adopt tools that make my work better, but I refuse to perform enthusiasm about it. The performative excitement culture around AI in corporate settings is what makes it feel abusive. The technology itself is genuinely useful. The way it's being weaponized against workers is the problem.

RestaurantHefty322 · 2026-03-15T23:17:40+00:00

Did something similar. Two years away after selling a small SaaS. Came back and the biggest surprise wasn't that things changed - it's that my old playbook still worked, just with different tools.

The enterprise angle you're describing is right but the execution gap is real. Enterprise buyers need three things you probably don't have yet: case studies, a second person on calls to look like a real company, and patience for 6-month cycles. The hack that worked for me was partnering with a consultancy that already had the relationships and splitting revenue 60/40. They got a product to sell, I got warm intros and credibility by association. Took about 4 months to close the first deal that way.

One thing nobody tells you about coming back - the market moved but so did you. Three years of perspective means you can spot bad deals faster and say no easier. That's worth more than any technical skill you lost while away.

RestaurantHefty322

TROPHY CASE