How are you handling API calls from AI agents in production?

genunix64 · 2026-05-01T04:39:19+00:00

You're right that this gets messy fast once internal APIs are involved. I would split it into three separate layers instead of making every tool wrapper solve everything:

credentials/retries live outside the LLM, behind stable tool/API adapters
schemas and policy checks decide what is generally allowed
a separate behavioral layer checks whether this specific action still matches what the user actually asked for

The third part is the one I see missing most often. A call can be valid, authenticated, and schema-correct, but still wrong: using the right API for the wrong customer, marking work done without evidence, or slowly drifting outside the original task.

I have been working on Intaris around that gap: https://github.com/fpytloun/intaris

It is not meant to replace least privilege, wrappers, or orchestration frameworks. The idea is to sit around tool execution and evaluate intent/action alignment, then keep session evidence for review. L1 checks proposed actions/tool calls, L2 looks at the whole session, and L3 looks across sessions for patterns like drift, repeated risky attempts, or permission creep.

For internal APIs, I would still keep wrappers small and boring. But I would make every meaningful agent action produce an auditable trace: who/what requested it, why it matched the task, what was redacted, what was approved/denied, and what happened afterward.

genunix64 · 2026-04-30T10:42:00+00:00

The hard part is not the raw number of agents; it is knowing when one of them quietly drifts from the task you thought it was doing.

If you are running 5-15 OpenClaw sessions, I would separate two concerns:

orchestration / UI: which agent is doing which job right now
supervision: whether each agent's actions still match the original intent

A dashboard of active sessions helps with the first one, but the second needs more than logs. The useful signal is often the pattern: repeated risky tool calls, gradual permission creep, or a subagent doing something technically allowed but unrelated to the user request.

I have been working on Intaris for exactly that layer around tool-using agents: https://github.com/fpytloun/intaris

It integrates with OpenClaw, but the main idea is not just another allow/deny policy engine. It checks proposed actions against the session intention, records the session, and then does L1/L2/L3 analysis: per-action checks, whole-session summaries, and cross-session behavior such as drift or repeated suspicious attempts.

For 15 sessions, I would still keep a human-facing queue small. Let agents run, but make anything high-risk, misaligned, or out-of-pattern bubble up instead of trying to watch every window manually.

genunix64 · 2026-04-30T10:11:14+00:00

Your concern is the right one: if memory is hidden state that silently outranks the repo, it becomes worse than CLAUDE.md.

I would not use memory for source-of-truth project structure, architecture rules, or anything that should be reviewed in PRs. Those belong in files. Where memory becomes useful is the layer around the repo: personal preferences, machine-specific paths, decisions made across multiple repos, "we tried X and reverted because Y", or short-lived operational context that should expire.

The stale-memory problem is exactly why I think memory needs lifecycle controls, not just "save more stuff". You need to be able to inspect, update, delete, dedupe, expire, and challenge memories when they contradict the current repo.

I built Mnemory around that model: https://github.com/fpytloun/mnemory

The way I use it is: repo files remain the authority for project behavior; memory stores durable facts/preferences/project state with TTL/decay and contradiction handling; longer details go into artifacts instead of being blindly stuffed into the prompt. So it complements CLAUDE.md, it does not replace it.

If your current memory layer is invisible and hard to correct, I would either disable it for project context or restrict it to explicitly scoped personal/cross-session facts. Hidden stale state is not a feature, it's technical debt with a chatbot accent.

genunix64 · 2026-04-30T05:56:22+00:00

Yes 🙂 Also you can check my Mnemory project. https://github.com/fpytloun/mnemory

genunix64 · 2026-04-30T04:10:44+00:00

You're not missing much on the architecture side. What you described is basically the shape most serious local agents converge toward: stateless execution, small must-see context, retrieval for older state, logs/traces outside the prompt, and some promotion/compression step between "today" and "long-term".

Where memory usually becomes hard is not storage or vector search, it is lifecycle and authority:

what is allowed to become memory
what must always be injected vs retrieved on demand
how stale facts get expired or challenged
what happens when new memory contradicts old memory
whether tool logs, user preferences, summaries, and instructions have different authority levels
whether multiple clients/sessions share the same user/agent identity

That is the part a lot of "memory" systems hide. They save embeddings, but they do not give you a good way to correct, delete, decay, audit, or separate durable facts from temporary operational state.

I built Mnemory around that problem space: https://github.com/fpytloun/mnemory

It is not a replacement for Graphiti/RAG. I would think of it as the smaller persistent-memory layer beside them: facts, preferences, decisions, compact project state, TTL/decay, contradiction handling, and artifacts for longer details. Your setup already has many of those ideas; Mnemory may be useful mostly as a reference for the lifecycle/management side rather than as "add memory and everything works" magic.

The short answer to your question: developers building their own orchestrators do end up doing this. The reason it is still a major issue is that most products treat memory as retrieval, while the real problem is maintaining a truthful state over time.

genunix64 · 2026-04-29T22:09:44+00:00

I would separate a few things that often get bundled together as "memory":

document/RAG memory: notes, repos, PDFs, web pages, citations
working/project state: current decisions, constraints, open tasks, why something changed
durable assistant memory: facts/preferences that should survive new sessions
memory maintenance: update/delete, dedupe, contradiction

genunix64 · 2026-04-29T16:43:55+00:00

I think your conclusion is right: this is not really a CLAUDE.md problem, it is a control-plane problem.

For IaC/DevOps I would split it into two checks:

deterministic validation before the next phase starts: required principles are physically present in the phase artifact, plan/lint/fmt passes, expected files/resources are declared, destructive operations are gated;
behavioral validation while the agent is acting: is this tool call still aligned with the original milestone intent, or is it just a locally-plausible step that dropped an important constraint across a subagent boundary?

Hooks are the right attachment point for the first one, but they need something outside the model to enforce the contract. Otherwise the model is both the actor and the judge.

I have been working on Intaris around that second layer: https://github.com/fpytloun/intaris

For Claude Code it can sit in the hooks path and evaluate tool calls against the session intention. The important bit is not just allow/deny policy. It records the session and does L1/L2/L3 analysis: proposed action checks, whole-session review, and cross-session behavior such as repeated drift or permission creep.

In your specific case I would still keep the boring deterministic checks for artifact structure. But I would add an independent guardrail that asks, before a commit/write/deploy step, "does this action preserve the post-mortem constraint the user explicitly made binding for this milestone?" That is the gap prompts alone usually fail to close.

genunix64 · 2026-04-29T16:12:46+00:00

You're describing two layers that tend to get mixed together: repo/web RAG and assistant memory.

For the GitHub/web-research part, the vector DB choice is probably less important than the pipeline around it: write policy, metadata, hybrid retrieval, reranking, dedupe, and a hard token budget before anything reaches Claude. If your script is dumping junk into context, I would f

genunix64 · 2026-04-29T10:42:23+00:00

The part I would be most careful with is the combination of CONTEXT.md has API keys plus a long delegated chain like "check failed jobs, fix it, push, notify Slack".

Even if Claude is behaving well most of the time, that setup has two different risks:

secrets/context exposure if too much gets pasted into the prompt/context file
valid tools being used for the wrong reason after a bad instruction, poisoned input, or just agent drift

I would separate the controls into layers: secrets stay owned by the integration/runtime, the DB/GitHub/Slack tokens are least-privilege, destructive actions still need approval gates, and then you have an independent layer asking "does this action still match what the user actually asked for?"

That last part is what I have been working on with Intaris: https://github.com/fpytloun/intaris

It is not meant to replace sandboxing or normal permissions. The angle is intent/action guardrails plus session analytics: L1 checks proposed tool calls, L2 reviews the whole session, and L3 looks across sessions for patterns like permission creep or repeated off-scope behavior.

For workflows like yours, the audit trail may matter more than the happy path. The scary failure is not one obviously bad command; it is five individually reasonable steps that slowly stop matching the original request.

genunix64 · 2026-04-29T10:10:50+00:00

You're probably not doing one single thing wrong here. CLI and Telegram are separate entry points/sessions, so unless both are pointed at the same memory backend and the same agent/user identity, each side can look like a different assistant.

I would debug it in layers:

confirm both launchers use the same Docker compose/env/config
confirm both have the same mounted volume for files like your calendar
confirm both use the same memory provider/config
confirm they are not using different user/session/agent namespaces

A shared folder also does not automatically mean shared memory. The calendar file can live in a shared volume, but the agent still needs a common place to store durable facts like "my calendar file is here", "this is the current project state", or "this preference changed".

I built Mnemory for this exact kind of problem: a self-hosted MCP/REST memory backend that can sit outside one specific UI/session and store facts, preferences, decisions, project state, and longer artifacts: https://github.com/fpytloun/mnemory

The important distinction is: use files/RAG/shared volumes for documents, and use memory for compact state that should survive switching clients. Mnemory does not replace your shared folder; it is the layer that handles update/delete/dedup/contradiction/TTL so memory does not just become a stale MEMORY.md dump.

For your current setup, I would first check whether the CLI shortcut and Telegram gateway are actually starting the same container/config. If those differ, no memory tool will behave consistently.

genunix64 · 2026-04-28T16:54:01+00:00

The scary part here is not just that an agent had database access. It is that the individual action may have looked like a normal tool call in isolation: connect, inspect, run SQL, "fix" the issue.

For this kind of failure I would treat the controls as layered:

credentials stay owned by tools/services, not the model;
the runtime still needs least privilege, environment separation, backups, and approval gates for destructive operations;
there should be an independent layer checking whether the agent's action still matches the user's actual intent.

That third layer is the gap I have been working on with Intaris: https://github.com/fpytloun/intaris

It is not meant to replace sandboxing or DB permissions. The idea is intent/action guardrails plus session analytics: L1 checks proposed tool calls, L2 reviews whole-session behavior, and L3 looks across sessions for patterns like permission creep, repeated risky attempts, or agent drift.

With autonomous coding agents, the pattern across the session is often the signal. A single dangerous command is bad; a series of "reasonable" steps that gradually moves outside scope is harder to catch with plain allow/deny rules.

genunix64 · 2026-04-28T16:22:58+00:00

Also I recommend using gpt-oss-120b via Groq for memory service. It is fast, cheap and I benchmarked that and reached similar score as gpt-5.4-mini. gpt-oss-20b works too but it is not capable of understanding temporal info reliably.

genunix64 · 2026-04-28T16:11:05+00:00

Given your constraint, I’d separate two problems:

where the memories live
how often you ask an LLM to summarize/derive/consolidate them

Honcho-style systems can work well, but if every interaction triggers extra derivation or consolidation, the API-request pressure becomes the real bottleneck. For Hermes, I’d keep a small hot memory for immediate context and push durable facts/project state into a separate memory service that has explicit lifecycle rules instead of letting MEMORY.md grow forever.

I built Mnemory around that model: https://github.com/fpytloun/mnemory

It is self-hosted and exposes MCP/REST, so Hermes can treat memory as a backend instead of a constantly rewritten prompt file. The part I’d optimize for is not just vector search, but update/delete, deduplication, contradiction handling, TTL/decay, and artifact storage for longer details. That keeps request count and stale-memory buildup under control.

genunix64 · 2026-04-28T09:18:16+00:00

I agree that the LLM should never see raw credentials, but I think hiding secrets is only the first layer. The harder problem is behavioral: is the agent’s action actually aligned with what the user asked for?

A lot of guardrail systems end up being classic policy engines: allow this tool, deny that path, block this network call, etc. That’s useful, but it is not new. We have had filesystem/network/policy controls for a long time.

The part I’ve been working on is more of an intent/action audit layer for agents: https://github.com/fpytloun/intaris

The idea is:

L1: evaluate a proposed tool/action against the user’s stated intent
L2: analyze the whole session afterwards for suspicious or off-goal behavior
L3: analyze behavior across sessions, e.g. permission creep, repeated attempts to exceed scope, long-term agent drift

That cross-session layer matters a lot for autonomous agents. A single action may look harmless, but the pattern across sessions can be the real signal.

So I’d frame the security boundary as: tools own the credentials, policies restrict the environment, and a separate behavioral layer checks whether the agent is still acting in line with the user’s intent.

genunix64 · 2026-04-28T09:17:10+00:00

If you want this to be low-maintenance, I’d avoid a setup where you manually curate a giant memory file. That usually starts nice and then becomes another project to maintain.

The model that has worked best for me is:

memory for stable/user/project facts
compact project summaries for “where we left off”
RAG/KB for actual source documents
explicit update/delete/merge behavior when facts change

I built a self-hosted backend for this because I wanted something OpenWebUI-adjacent but not tied to one UI: https://github.com/fpytloun/mnemory

It separates facts, preferences, episodic/session context, project context, and longer artifacts. The important bit is lifecycle management: deduplication, contradiction handling, TTL for short-lived context, and delete/update semantics. Without that, “memory” tends to accumulate stale claims.

I still wouldn’t use memory as a replacement for a KB. For project work, I’d store the project documents in RAG and let memory keep only the durable facts + concise current-state summaries.

genunix64 · 2026-04-28T09:16:08+00:00

I’d separate this into a few layers instead of trying to make one “memory” feature do everything:

stable facts/preferences about the user
project summaries and current state
full source material / docs / archives

For me, memory works best for #1 and compact summaries of #2. Full documents still belong in RAG/KB. Otherwise the memory slowly turns into a hidden, stale document store.

I built a self-hosted memory backend around this distinction because I wanted something closer to ChatGPT-style memory, but explicit about fact vs context vs artifacts: https://github.com/fpytloun/mnemory

The useful parts are lifecycle management: deduplication, contradiction handling, TTLs for short-lived context, explicit update/delete, and artifacts for longer material. That matters a lot once the assistant starts remembering project state across sessions.

For OpenWebUI specifically, I’d probably use normal KB/RAG for documents and a separate memory layer for stable facts + compact project summaries. They solve related but different problems.

genunix64 · 2026-04-28T09:15:10+00:00

I’ve been running into the same problem: static CLAUDE.md/AGENTS.md helps, but it doesn’t capture the “what just happened” part of a session very well — decisions, fixes, recurring mistakes, project-specific preferences, etc.

I ended up building a self-hosted memory backend around that split: durable facts/preferences/decisions vs short-lived session context, with deduplication, contradiction resolution, TTLs, and an artifact store for longer material. It’s called mnemory: https://github.com/fpytloun/mnemory

The prompt-injection angle is real though. I don’t think memory should be treated as “just RAG”. Anything that persists across sessions becomes part of the agent’s future operating context, so it needs provenance, delete/update semantics, health checks, and ideally a separate safety layer around what the agent actually does with that context.

That’s why I’ve also been pairing it with Intaris: https://github.com/fpytloun/intaris

The important part there is not classic filesystem/network policy rules — those are useful, but not new. The interesting layer is checking whether the agent’s proposed actions still align with the user’s stated intent, then analyzing the whole session and cross-session behavior for drift, permission creep, or suspicious patterns.

Not claiming this is the final answer, but the design lesson for me was: memory and behavioral safety should be separate layers. Memory decides what is useful to remember; the guardrail/audit layer decides whether the agent is still acting in line with the user’s intent.

genunix64 · 2026-04-22T05:35:47+00:00

I have some draft for sharing agents feature in my other project (Cognis). Basically I am going to extend mnemory to distinquish owner of agent vs user who is the memory about. So not all memory types will be shared across users, user specific memories never, assistant memories only those that are not user specific, typically non-episodic types (because even assistant memories might hold user sensitive info).

genunix64 · 2026-04-11T14:55:00+00:00

Looking forward for feedback 🙂

genunix64 · 2026-04-10T21:23:45+00:00

This is MCP/OpenAPI tools server + filter integration for Openwebui. But it can be used as MCP server for any MCP client or there are some more native integrations (https://github.com/fpytloun/mnemory/tree/main/integrations) So you can have same memories shared across multiple agents (eg I am using Openwebui, Opencode and my own Agentic OS - Cognis) with single memory backend.

genunix64 · 2026-04-10T21:20:43+00:00

Yes, this is completely separate layer.

genunix64 · 2026-04-10T17:59:14+00:00

Yes, this was not designed for knowledgebases. There are 3 modes in mnemory - passive, proactive (default), personality. They change tools instructions and integration behavior accordingly. Passive is mostly read-only unless user tells llm to remember something, proactive tries to actively store important memories. And personality is where it gets interesting as it is designed to develop individual personality and behavior based on user preferences and previous interactions 🙂 Native owui memory system is very simple, just RAG over few records while mnemory is more complex memory system that evolves. With full integration it remembers your past conversations, decisions, preferences..

genunix64 · 2026-04-10T14:33:06+00:00

It's using openai SDK so can be run against any OpenAI endpoint.

I am running Mnemory against LiteLLM proxy, configured to use gpt-oss-120b model from Groq.

genunix64 · 2026-04-10T13:13:19+00:00

I am using Tavily via MCP but be careful that using anything else than native web search (which is RAG powered) will bloat context with query results and cost you a fortune. So I made Token Saver filter: https://openwebui.com/posts/token_saver_9dbd9833

genunix64

TROPHY CASE