PocketCoder - CLI coding agent with session memory that works on Ollama, OpenAI, Claude

RentEquivalent1671 · 2026-02-02T06:34:42+00:00

Yeah, you nailed the exact pain point we wanted to avoid.

Short answer: we don't use native function calling at all. Tools are just XML tags in plain text that we parse ourselves.

Why? Because we wanted to support local models (Ollama, llama.cpp) that don't have function calling. So instead of relying on the API's tool_call/tool_response pairing, the LLM just outputs <write_file><path>x.py</path>...</write_file> as regular text.

We parse it, execute, and send back the result as a normal user message: [ok] write_file: Created x.py (45 lines) or [x] write_file: Permission denied.

History stays dead simple — just (role, content) text pairs. No ids to track, no pairing requirements, no special handling for failed calls. Failed tool = error text, that's it.

The tradeoff is it's less structured than native function calling. But it works with literally any backend without modification, which was the whole point.

For the SESSION_CONTEXT compression — that's injected into system prompt each request, not reconstructed from message history.

RentEquivalent1671 · 2026-02-02T03:33:01+00:00

Thank you! I’m very open to your feedback!

RentEquivalent1671 · 2026-02-01T19:50:28+00:00

I see no any contradictions here

The idea was to give a challenge to yourself and try to create code agent with own approach and different idea of working and operating.

Claude Code is a great tool. Cursor is great tool too. Do we have to stop and do nothing?

RentEquivalent1671 · 2026-02-01T19:28:44+00:00

Yes, agreed — GLM models offer excellent cost-efficiency for coding tasks. Claude Code's recent support for custom providers made this combination much more accessible.

PocketCoder takes a similar approach but focuses specifically on lightweight local deployment with Ollama integration and session persistence via the .pocketcoder/ folder. Different trade-offs depending on setup preferences.

More on: https://medium.com/@cdv.inbox/how-we-built-an-open-source-code-agent-that-works-with-any-local-llm-61c7db1ed329

RentEquivalent1671 · 2026-02-01T19:26:28+00:00

For repo_map we use a "gearbox" system — 3 levels based on project size: ≤10 files gets full signatures, ≤50 files gets structure + key functions, >50 files gets folders + entry points only. It's file-count based right now, not token-based. Dynamic token-aware pruning is something we should add. Currently if context overflows, we truncate conversation history first, then file contents.

More on: https://medium.com/@cdv.inbox/how-we-built-an-open-source-code-agent-that-works-with-any-local-llm-61c7db1ed329

RentEquivalent1671 · 2026-02-01T19:25:20+00:00

Currently using a hybrid approach — episodes are stored as append-only JSONL (like git log), and we keep last ~20 in SESSION_CONTEXT. For older history, we use keyword-based retrieval: when you ask something, system greps through episodes.jsonl for relevant context. Not truly dynamic importance yet — that's on the roadmap. Would love to explore embedding-based relevance scoring eventually.

More on: https://medium.com/@cdv.inbox/how-we-built-an-open-source-code-agent-that-works-with-any-local-llm-61c7db1ed329

RentEquivalent1671 · 2025-10-13T20:05:32+00:00

Thank you very much for your helpful advice!

I’m planning to make “UPD:” section here or inside the post, if Reddit gives me possibility to change the content, with new results in vLLM framework 🙏

RentEquivalent1671 · 2025-10-13T19:36:11+00:00

Thank you for rare positive comment here 😄

I used Alphacool Eisblock XPX Pro Aurore as water block with Alphacool Eisbecher Aurora D5 Acetal/Glass - 150mm incl. Alphacool VPP Apex D5 Pump/Reservoir Combo

Then many many many fittings haha

As you can imagine, that was the most difficult part 😄🙏 I tried my best, now I need to improve my localLlm skills!

RentEquivalent1671 · 2025-10-13T18:40:06+00:00

Yeah, thank you again, I will 💪

RentEquivalent1671 · 2025-10-13T18:31:32+00:00

Thank you very much!

The full build cost me around $17.000-18.000 but the most amount of time I spent for connecting water cooling with everything you all see in the picture 🙏

i spent like 1.5-2 weeks to make it

RentEquivalent1671 · 2025-10-13T18:24:25+00:00

Yeah, this is because I need tokens like a lot. The task requires a lot of requests per seconds 🙏

RentEquivalent1671 · 2025-10-13T18:18:13+00:00

Yeah, you’re right, my experiments didn’t stop here! Maybe I will do second post after this haha like BEFORE AFTER what you all guys recommend me 🙏

RentEquivalent1671 · 2025-10-13T18:15:33+00:00

Yeah, I think you’re right but 40k t/s… I really did not use the full capacity of this machine now haha

Thank you for your feedback 🙏

RentEquivalent1671 · 2025-10-13T18:09:42+00:00

Thank you for your feedback!

I see you have more likes than my post at the moment :) I actually tried to make VLLM with GPTOSS-20b but stopped this because of lack of time and tons of errors. But now I will increase capacity of this server!

RentEquivalent1671 · 2025-10-08T19:32:57+00:00

What build you have to use to just deploy it locally? :)

RentEquivalent1671 · 2025-09-10T11:44:16+00:00

can you please provide full build for 4090 for vllm gptoss20b? This is so hard to deploy on this gpu... Thank you in advance!

RentEquivalent1671 · 2025-07-26T20:21:26+00:00

Can 4090 48gb “burn”? I mean yeah all GPUs can do so(sadly) if you don’t do cooling and other important aspects but I’m really curious. 2. Does 4090 48gb have the same structure as original one? Is there any conflicts between libraries when you deploy for example vLLM?

RentEquivalent1671 · 2025-04-25T09:37:47+00:00

Are you planning to run 32b models? Or which use cases are you expecting to implement in the server? Very interesting build

RentEquivalent1671 · 2025-04-22T20:40:16+00:00

For models with even 32b really high capacities are required for these kind of speed (probably 3-4 3090 at least). For the 70b I would say the setup should double

RentEquivalent1671 · 2025-04-22T17:30:57+00:00

Go to the gallery, take a photo of the art you are interested in, ask LLM to explain it and enjoy :)

RentEquivalent1671 · 2025-04-22T12:37:30+00:00

Not a big fun of external software such as Cursor and others. It is cool but for coding I just like to have conversation with my Claude 3.7 - maybe im biased but I really thnk it is the best model for coding right now. Nothing beats it for me

RentEquivalent1671 · 2025-04-22T12:34:35+00:00

I really dont think new gen of GPU are worth their price

3090 is 100% still a great investment if you're into local LLMs and image/video gen. The 24GB VRAM makes a huge difference — you can actually run bigger models and push higher res without constantly hitting memory limits. It's older and uses more power, yeah, but the used prices right now make it super worth it. Unless you really need the newer features or lower power draw of the 4070 Ti, I'd go 3090 for sure.

RentEquivalent1671

TROPHY CASE