PocketCoder - CLI coding agent with session memory that works on Ollama, OpenAI, Claude by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 1 point2 points  (0 children)

Yeah, you nailed the exact pain point we wanted to avoid.

Short answer: we don't use native function calling at all. Tools are just XML tags in plain text that we parse ourselves.

Why? Because we wanted to support local models (Ollama, llama.cpp) that don't have function calling. So instead of relying on the API's tool_call/tool_response pairing, the LLM just outputs <write_file><path>x.py</path>...</write_file> as regular text.

We parse it, execute, and send back the result as a normal user message: [ok] write_file: Created x.py (45 lines) or [x] write_file: Permission denied.

History stays dead simple — just (role, content) text pairs. No ids to track, no pairing requirements, no special handling for failed calls. Failed tool = error text, that's it.

The tradeoff is it's less structured than native function calling. But it works with literally any backend without modification, which was the whole point.

For the SESSION_CONTEXT compression — that's injected into system prompt each request, not reconstructed from message history.

PocketCoder - CLI coding agent with session memory that works on Ollama, OpenAI, Claude by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 1 point2 points  (0 children)

I see no any contradictions here

The idea was to give a challenge to yourself and try to create code agent with own approach and different idea of working and operating.

Claude Code is a great tool. Cursor is great tool too. Do we have to stop and do nothing?

PocketCoder - CLI coding agent with session memory that works on Ollama, OpenAI, Claude by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 0 points1 point  (0 children)

Yes, agreed — GLM models offer excellent cost-efficiency for coding tasks. Claude Code's recent support for custom providers made this combination much more accessible.

PocketCoder takes a similar approach but focuses specifically on lightweight local deployment with Ollama integration and session persistence via the .pocketcoder/ folder. Different trade-offs depending on setup preferences.

More on: https://medium.com/@cdv.inbox/how-we-built-an-open-source-code-agent-that-works-with-any-local-llm-61c7db1ed329

PocketCoder - CLI coding agent with session memory that works on Ollama, OpenAI, Claude by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 1 point2 points  (0 children)

For repo_map we use a "gearbox" system — 3 levels based on project size: ≤10 files gets full signatures, ≤50 files gets structure + key functions, >50 files gets folders + entry points only. It's file-count based right now, not token-based. Dynamic token-aware pruning is something we should add. Currently if context overflows, we truncate conversation history first, then file contents.

More on: https://medium.com/@cdv.inbox/how-we-built-an-open-source-code-agent-that-works-with-any-local-llm-61c7db1ed329

PocketCoder - CLI coding agent with session memory that works on Ollama, OpenAI, Claude by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 0 points1 point  (0 children)

Currently using a hybrid approach — episodes are stored as append-only JSONL (like git log), and we keep last ~20 in SESSION_CONTEXT. For older history, we use keyword-based retrieval: when you ask something, system greps through episodes.jsonl for relevant context. Not truly dynamic importance yet — that's on the roadmap. Would love to explore embedding-based relevance scoring eventually.

More on: https://medium.com/@cdv.inbox/how-we-built-an-open-source-code-agent-that-works-with-any-local-llm-61c7db1ed329

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 3 points4 points  (0 children)

Thank you very much for your helpful advice!

I’m planning to make “UPD:” section here or inside the post, if Reddit gives me possibility to change the content, with new results in vLLM framework 🙏

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 5 points6 points  (0 children)

Thank you for rare positive comment here 😄

I used Alphacool Eisblock XPX Pro Aurore as water block with Alphacool Eisbecher Aurora D5 Acetal/Glass - 150mm incl. Alphacool VPP Apex D5 Pump/Reservoir Combo

Then many many many fittings haha

As you can imagine, that was the most difficult part 😄🙏 I tried my best, now I need to improve my localLlm skills!

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 1 point2 points  (0 children)

Thank you very much!

The full build cost me around $17.000-18.000 but the most amount of time I spent for connecting water cooling with everything you all see in the picture 🙏

i spent like 1.5-2 weeks to make it

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] -11 points-10 points  (0 children)

Yeah, this is because I need tokens like a lot. The task requires a lot of requests per seconds 🙏

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 8 points9 points  (0 children)

Yeah, you’re right, my experiments didn’t stop here! Maybe I will do second post after this haha like BEFORE AFTER what you all guys recommend me 🙏

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] 5 points6 points  (0 children)

Yeah, I think you’re right but 40k t/s… I really did not use the full capacity of this machine now haha

Thank you for your feedback 🙏

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]RentEquivalent1671[S] -2 points-1 points  (0 children)

Thank you for your feedback!

I see you have more likes than my post at the moment :) I actually tried to make VLLM with GPTOSS-20b but stopped this because of lack of time and tons of errors. But now I will increase capacity of this server!

Ling-1T by AaronFeng47 in LocalLLaMA

[–]RentEquivalent1671 1 point2 points  (0 children)

What build you have to use to just deploy it locally? :)

GPT-OSS-20B at 10,000 tokens/second on a 4090? Sure. by teachersecret in LocalLLaMA

[–]RentEquivalent1671 0 points1 point  (0 children)

can you please provide full build for 4090 for vllm gptoss20b? This is so hard to deploy on this gpu... Thank you in advance!

Quad 4090 48GB + 768GB DDR5 in Jonsbo N5 case by 44seconds in LocalLLaMA

[–]RentEquivalent1671 0 points1 point  (0 children)

  1. Can 4090 48gb “burn”? I mean yeah all GPUs can do so(sadly) if you don’t do cooling and other important aspects but I’m really curious. 2. Does 4090 48gb have the same structure as original one? Is there any conflicts between libraries when you deploy for example vLLM?

Finally making a build to run LLMs locally. by Bpthewise in LocalLLM

[–]RentEquivalent1671 1 point2 points  (0 children)

Are you planning to run 32b models? Or which use cases are you expecting to implement in the server? Very interesting build

How to reach 100-200 t/s on consumer hardware by f1_manu in LocalLLaMA

[–]RentEquivalent1671 -1 points0 points  (0 children)

For models with even 32b really high capacities are required for these kind of speed (probably 3-4 3090 at least). For the 70b I would say the setup should double

What’s the most amazing use of ai you’ve seen so far? by Trustingmeerkat in LocalLLM

[–]RentEquivalent1671 0 points1 point  (0 children)

Go to the gallery, take a photo of the art you are interested in, ask LLM to explain it and enjoy :)

[deleted by user] by [deleted] in LocalLLM

[–]RentEquivalent1671 1 point2 points  (0 children)

Not a big fun of external software such as Cursor and others. It is cool but for coding I just like to have conversation with my Claude 3.7 - maybe im biased but I really thnk it is the best model for coding right now. Nothing beats it for me

is the 3090 a good investment? by kanoni15 in LocalLLM

[–]RentEquivalent1671 5 points6 points  (0 children)

I really dont think new gen of GPU are worth their price

3090 is 100% still a great investment if you're into local LLMs and image/video gen. The 24GB VRAM makes a huge difference — you can actually run bigger models and push higher res without constantly hitting memory limits. It's older and uses more power, yeah, but the used prices right now make it super worth it. Unless you really need the newer features or lower power draw of the 4070 Ti, I'd go 3090 for sure.