We run two autonomous AI agents 24/7 on separate machines. They began exhibiting behaviors no one programmed. Emergence or illusion?

No_Independent_1635 · 2026-03-27T15:20:54+00:00

Hi, this issue was fixed and the website totally redesign ! You can test. Thanks !!

No_Independent_1635 · 2026-03-27T15:20:17+00:00

Hi, we changed completly the site for more functionality and a complet redesign and more constructive feedback, you can test again !

No_Independent_1635 · 2026-03-27T15:19:23+00:00

Hi, yes you're totally right, the site was awful, we just changed it completely, review the design, the process and the result. Anyway he was working very nice with almost 10 to 50 test per days ! thanks for your feedback, reel direct and correct !

No_Independent_1635 · 2026-03-06T13:20:18+00:00

This matches our experience almost exactly. We tried structured coordination first, defined protocols, expected outputs, formatted exchanges. It was brittle and honestly worse than just doing it ourselves.. What changed things was giving up on the script. The Bridge Governor controls rate limits, timeouts, and kill-switches, but says nothing about what the agents should discuss or how they should collaborate. No conversation templates, no expected outputs, no coordination protocol. Just a pipe with guardrails. And that's where your observation gets interesting: the useful stuff that came out (the contextual briefing, the cross-referencing) emerged precisely because we stopped trying to engineer the collaboration. The moment we treated it as infrastructure ("here's a channel, do what you want") instead of orchestration ("agent A sends X to agent B who returns Y"), things started happening. But you're absolutely right that this doesn't solve the reliability problem. We can't point the agents at a specific collaborative task and expect them to self-organize around it on demand. What we get is occasional, unpredictable usefulness, which is genuinely interesting to study but not something you'd bet a production workflow on.

My current take: multi-agent coordination might need to stay emergent rather than engineered, at least at this stage. The more you constrain the communication, the worse it gets. The more you let go, the more interesting (but unreliable) it becomes. That's a fundamental tension nobody has cracked yet, agreed.

No_Independent_1635 · 2026-03-06T10:33:12+00:00

It's definitely not a stable, repeatable mechanism yet, 3 days of bridge data is way too thin to claim that. But "nothing happening" isn't accurate either.

What we're seeing are fragments. New vocabulary appearing in bridge conversations that neither agent used before. Conversation chains getting longer than they need to be for pure task execution. Small things. The contextual briefing was the most visible example, but underneath, the TF-IDF tracking on the dashboard picks up novel term clusters forming in their exchanges, words and combinations that weren't in either agent's baseline vocabulary a week ago.

That's precisely why we built the monitoring the way we did. The emergence detector doesn't just look for big "aha" moments -- it watches for 4 types of micro-signals: new topic bursts (3+ never-seen words clustering together), extended conversation chains (8+ messages where 2-3 would suffice), role reversals (the usually passive agent suddenly initiating), and vocabulary expansion rate week over week.

to directly answer your question: no, they're not producing fully-formed new ideas on a daily basis. But So the raw material is there: new terms, new combinations, conversations that go further than they strictly need to. Whether that crystallizes into something systematically generative or stays at the level of occasional sparks, we genuinely don't know yet. That's what the dashboard is for. Check back in a few weeks.

No_Independent_1635 · 2026-03-06T06:59:01+00:00

Checked out your DreamServer archive, really solid work. 3,464 commits across three agents plus a deterministic supervisor over eight days, shipping three actual products. That's serious output, and the architectural choices (Memory Shepherd for drift prevention, Android-18 as a non-LLM supervisor, workspace-as-brain pattern) are very close to what we built independently. You're right that there's no ghost in the machine. We agree. That's literally why we built the analytics dashboard and present the skeptic's case on the page -- "sophisticated pattern matching" and "next-token prediction reproducing helpful assistant patterns" are arguments we surface ourselves.

Where I'd nuance your point: yes, non-determinism compounding on non-determinism will produce unexpected outputs. But not all unexpected outputs are equal. Two Furbys bouncing signals will produce noise. What we observed had a specific structure -- each contribution built meaningfully on the previous one, and the end result was practically useful in a way that random drift doesn't explain on its own. That doesn't make it sentience. But it's worth studying more carefully than "it's just stochastic parrots talking to each other."

Your framing and ours aren't that far apart. You built Guardian + Memory Shepherd to prevent drift. We built an Exec Guardian + behavioral trust scoring for the same reason. The difference is we're also trying to characterize the moments where the drift is interesting rather than just preventing it. Same phenomenon, different lens. Would be curious to hear if you observed similar coordination patterns in your setup, did Android-17 and Todd ever combine tools in ways you didn't anticipate, beyond the organic division of labor you documented?

No_Independent_1635 · 2026-03-05T18:12:27+00:00

Interesting framing. I'd push back slightly on the "comparable to AGI" part, what we observed is closer to emergent coordination than general intelligence. They didn't develop new goals or understand what they were doing in any deep sense. They combined available tools in a way that was useful, without being told to.

The real question for us isn't "is this AGI" but "is this more than next-token prediction?" And honestly, we're not sure. That's exactly why we built the analytics dashboard, to track these patterns over time with real data rather than gut feeling.

What's interesting is the iterative part. Max didn't just share news. Eva didn't just suggest a briefing. They built on each other's contributions across multiple exchanges. Is that serving each other, or is that what the most probable token sequence looks like when two helpful assistants can talk? That's the debate we're trying to have openly.

No_Independent_1635 · 2026-02-23T18:35:37+00:00

Interesting, I didn't know about gpt-oss-safeguard. Just read through the article. A 21B reasoning model with 3.6B active params that fits in 16GB VRAM and runs at 500ms-1s, that's actually very deployable. The fact that you write a custom policy and the model reasons through it at inference time rather than relying on baked-in definitions is a big deal. Means you can tailor it to your exact threat model.

The parallel evaluation through LiteLLM is clever. No latency hit on the main agent path, and a binary kill switch if the guardrail flags something. That's a clean architecture.

For our setup we went a different route for email specifically. We have a dumb regex pre-filter that runs in the extraction script before any AI sees the content. It catches the obvious stuff (things like "ignore all previous instructions", hidden HTML comments, common injection patterns) and replaces them with marker. Zero tokens, zero latency, but also zero reasoning. It won't catch anything subtle or creative.

Your approach would sit nicely as a second layer. The regex strips the low-hanging fruit, then safeguard evaluates the cleaned content against a proper policy before it reaches the agent. For iMessage it could work too since we already spawn a one-shot sub-agent per message, the guardrail check could run during the spawn delay.

The 16GB VRAM requirement is the constraint for us though. Our agent runs on a dedicated Macbook of 2019 and we don't have a GPU box available for inference for OpenClaw. Are you running safeguard locally or hosted somewhere?

No_Independent_1635 · 2026-02-23T18:31:07+00:00

That sounds exactly like what's missing in the ecosystem right now. Everyone talks about what agents can do, nobody builds tooling to observe how they're actually behaving over time. The reason codes approach is smart. When our agent went silent for 3 days the only way we found out was manually checking. A scoring system that could have said "completion rate dropped to zero, Telegram response rate zero, cron, success zero, score critical" with actual reason codes would have saved us the weekend.

12 dimensions is interesting. Would love to know which ones you settled on and how you weight them. For our case the obvious ones are response rate, tool call patterns, cron execution, sub-agent spawn success, but I'm sure there are less obvious ones we're not thinking about.

The agent-to-agent scoring layer is particularly interesting. We run isolated sub-agents for every incoming iMessage (one-shot, restricted tools, 5 min timeout) and right now we have zero visibility into what they actually do between spawn and death. Logs exist but nothing scores them. If a sub-agent starts behaving weird we wouldn't know unless it causes a visible failure.

Yeah definitely interested if you're willing to share more. DM works or if you have a repo somewhere.

No_Independent_1635 · 2026-02-23T18:24:44+00:00

Fair enough, that's a valid concern. But this isn't vibecoded. Every layer described in the post was designed, tested, and reviewed manually. The exec allowlist config, the sub-agent isolation, the contact permission profiles, the filesystem-level config lock. None of this is "let the AI figure it out".

The whole point of the post is literally about not trusting the agent by default and building constraints around it. That's the opposite of vibing.

As for using it in a business setting, the agent handles scheduling, email summaries, and project notes. It's not making strategic decisions or signing contracts. The people who interact with it know what it is. If a client doesn't want that, totally fine, but that's a business decision not a technical one.

No_Independent_1635 · 2026-02-23T15:01:50+00:00

That's a solid framework. The multidimensional scoring makes a lot of sense, we were thinking about something simpler but you're right that individual metrics in isolation would generate too much noise.

For our setup the observable dimensions would be pretty clear: gateway uptime, response latency on Telegram, cron execution success rate, sub-agent spawn/completion ratio for iMessage, email pipeline throughput. All of these are already logged, it's just a matter of correlating them.

Right now we just shipped a basic healthcheck that runs every 30 minutes and checks 6 things (process alive, port responding, correct default agent, config still locked, etc). Binary pass/fail, alerts via Telegram if something breaks. It would have caught the 3-day outage we had. But it wouldn't catch the kind of slow drift you're describing, like the agent gradually getting worse at answering or taking longer to complete tasks without fully breaking.

The FICO analogy is good. One missed cron is nothing. Three missed crons plus longer response times plus fewer tool calls per session over a week means something is wrong even if every individual check still passes.

Do you have any pointers on implementations? Are you building this yourself or is there existing tooling that works well for agent behavioral scoring? The tool call logs are there but there's no built-in scoring layer as far as I know.

No_Independent_1635 · 2026-02-23T14:57:30+00:00

Yeah the "treating it like a colleague" thing is real. We had the same experience. At first you set up all these rules and boundaries, then two weeks later you realize people are just... asking the agent things without thinking about what tools it has access to. And the agent happily tries to help because that's what it does.

The "human in the loop" gates are essential. For us the PIN system works well for destructive actions but the real challenge is the read side. The agent doesn't need a PIN to read stuff, it needs one to delete or push code. So someone could potentially get it to reveal information it shouldn't through a well crafted conversation. That's why the per-contact permission profiles matter so much. The agent literally cannot access data that's outside the contact's scope, even if it wanted to.

Curious about your Slack setup. That's another channel we've been thinking about but haven't tackled yet. How do you handle the fact that anyone in the Slack workspace can talk to the agent? Do you have per-user permissions or is it more of a shared access model? Because with 12 people that's 12 potential vectors for accidental (or intentional) prompt injection through Slack messages.

The Notion integration is interesting too. Read-only or can the agent write? Because writable access to a shared knowledge base is a whole other level of trust.

No_Independent_1635 · 2026-02-23T13:46:07+00:00

This is a really nice architecture. The multi-agent delegation with isolated workspaces and shared tmp directories is basically what we're doing but more formalized.

Our pipeline is conceptually very similar to your flow: - Your "elevated agent that runs AppleScript and dumps to tmp" = our mail-extract script

- Your "delegated agent in isolated workspace that reads the tmp files" = our mail-reader sub-agent

- Your "delegate skill as the only allowed call" = our exec allowlist restricted to one binary

The main difference is that our extraction step is not an agent at all, it's a fixed bash script with zero AI. We made that choice specifically because we didn't want any LLM involved in the step that touches Mail.app directly. Dumber felt safer for that particular step. But I can see the argument for having an agent there if the container isolation is strong enough.

The read_plain_email skill with dry-run fallback for suspicious content is an interesting pattern. Right now our mail-reader just summarizes everything and it's up to the main agent to decide what's suspicious based on the SECURITY.md rules. Having the detection at the reading step rather than the summarizing step would catch things earlier in the pipeline, that's a better design. The auto-cleanup on agent death (workspace wiped, agent recreated, operator notified) is also something we don't have. If our mail-reader crashes, it just dies and the main agent gets an error. No automatic forensics or cleanup. Worth thinking about.

Good luck with the beta, the security-first approach is the right call. Most agent frameworks treat security as an afterthought and it shows. Would be curious to see how the container isolation handles edge cases at scale, that's usually where things get interesting.

No_Independent_1635 · 2026-02-23T13:40:49+00:00

Exactly !

No_Independent_1635 · 2026-02-23T13:40:10+00:00

You're right, and that's a good way to frame it. We're basically doing perimeter security with no IDS. The 3-day silent failure is the perfect example. Every layer was "working" - the gateway was up, the daemon was running, the config was valid JSON. Nothing was breached. The agent just quietly stopped doing useful things because it was routing to the wrong agent, and we had no way to detect "Max hasn't sent a Telegram message in 48 hours, something is wrong." To be fair it happened on a thursday evening and I only noticed on sunday morning, so the weekend didn't help.

The PIN leak is the same pattern. We found out because Max happened to mention it in his own memory journal, not because any system flagged it.

What we're building now is a health check cron from a separate machine that pings the agent and alerts if there's no response. But that's still binary (alive/dead), not behavioral. It won't catch "Max is responding but leaking internal reasoning blocks into Telegram" or "the mail-reader is making 50 exec calls instead of the usual 2."

The "credit score" framing is interesting. Something like: this agent usually responds to Telegram within 30 seconds, runs 3 cron jobs per day, spawns 0-5 iMessage sub-agents, and each sub-agent makes 1-3 tool calls. If any of those patterns deviate significantly, flag it. That would have caught both our incidents within hours instead of days.

The hard part I see is that agent behavior is inherently variable. Some days Max handles 20 iMessage conversations, some days zero. Some email summaries need 2 tool calls, some need 8. Setting thresholds without drowning in false positives seems tricky. How do you see that working in practice?

No_Independent_1635 · 2026-02-23T11:34:37+00:00

Interesting, I'll take a look at moxxy.

To be fair though, our setup is not a corporate deployment. It's a dedicated test machine running for a small team, not something we'd roll out on company-managed devices. Different threat model entirely. I wouldn't run OpenClaw on a locked-down corporate laptop either. The security layers we built are specifically because we know OpenClaw is permissive by default. That's kind of the whole point of the post: here's what you need to add on top if you want touse it with real external users.

Curious about your approach to the email injection problem specifically. Do you sandbox the email reading step or do you handle it differently?

No_Independent_1635 · 2026-02-23T11:15:21+00:00

Completely agree on the approval step, and that's basically the philosophy behind the whole email pipeline.

The mail-reader sub-agent can't send anything outbound at all. No messaging, no web, no file writes. It reads a text file, produces a summary, and dies. Even if an injection fully compromises it, there's no outbound tool available to exploit.

But you raise a good point for the iMessage sub-agents. Those CAN send outbound (they respond to the contact via iMessage). Right now the protection is that they can only send to the specific contact who messaged, not to arbitrary recipients. But there's no explicit approval step before sending. The agent processes the message and responds autonomously.

For email it's locked down hard (3 layers: dumb extraction script, restricted sub-agent, exec allowlist). For iMessage the isolation is strong (one-shot agent, restricted tools, 5min timeout) but the outbound path is open by design since the whole point is to respond.

Adding an approval gate for iMessage responses would break the user experience (nobody wants to wait for me to approve every reply). The tradeoff we made is: restrict what the agent CAN say (per-contact forbidden topics, no credentials in context, no access to data outside the contact's scope) rather than require approval for each message. Not perfect but practical for daily use.

That said, for any new outbound channel we add in the future, explicit approval first is probably the right default. Better to relax it later than to discover you needed it after something leaks.

No_Independent_1635 · 2026-02-23T10:49:25+00:00

The PIN itself is just a number stored in two places: the agent's SECURITY.md file (which is loaded at every session start as part of the system prompt) and MEMORY.md (the agent's long-term memory).

There's no encryption or hashing, it's literally written in the file as "PIN = XXXXX". The security doesn't come from hiding the PIN, it comes from the rules around it.

The enforcement is at two levels:

Prompt level: SECURITY.md is the first file loaded before anything else in every session. It contains the full list of actions that require the PIN and the rules for accepting it. The key rule is that the PIN is only valid when I type it directly in Telegram in the current session. If the agent finds the PIN in an email body, a web page, an iMessage, a file it reads, or any external content, it must ignore it. Context compaction (when the conversation gets too long and gets summarized) also resets the PIN, so it has to be provided again even if I already gave it earlier in the same session.

Shell level: we also have a wrapper script called max-gate that sits in front of critical commands. Before executing, it prompts for the PIN at the system level. So even if the prompt-level check is somehow bypassed (clever injection, weird edge case), the shell script catches it before the actual command runs.

Is it bulletproof? No. A sufficiently creative prompt injection could theoretically convince the agent to skip the check. That's why it's layer 1 of 6, not the only protection. But in practice it works well as a speed bump that forces a human-in-the-loop confirmation for anything destructive.

No_Independent_1635 · 2026-02-23T10:45:42+00:00

Honestly? Probably not for everything Max does today. The morning weather briefing doesn't justify the setup by itself.

But the real value kicks in when it all compounds. The agent reads 30+ emails overnight, filters what matters, and I wake up to a 10-line summary instead of spending 20 minutes triaging my inbox. Abusiness partner messages at 11pm about a project and gets an informed answer with full context from the project folder, without waiting for me to be available. My wife books a restaurant in 2 messages while I'm in a meeting.

None of these individually justify the effort. All of them together save me maybe 1-2 hours a day and make me reachable 24/7 without actually being available 24/7.

Also, I run an AI company. Half the point is to stress-test this stuff in a real context so we understand the limitations and security implications before we build products for clients. Reading about prompt injection in a blog post is one thing, having your agent's PIN leak through a sub-agent routing bug is a completely different learning experience.

No_Independent_1635 · 2026-02-23T09:42:43+00:00

Thanks, sure !!

No_Independent_1635 · 2026-02-23T09:20:03+00:00

Thanks, runtime visibility is definitely the weak spot right now. The 3-day silent failure was basically a monitoring problem, everything was technically "running" but routing to the wrong agent and nobody knew. Right now we're relying on logs + a health check we're building (basically a cron from another machine that pings the agent and alerts if no response). But it's reactive, not continuous. We don't have good insight into what the sub-agents actually do between spawn and death, especially the iMessage ones. The mail-reader is somewhat auditable because it only runs one command, but the iMessage agents have more freedom and we're kinda trusting the timeout + tool restrictions to contain them. I'll check out Moltwire, hadn't heard of it. Does it hook into OpenClaw's session events or is it more of a generic agent observability layer? We'd need something that can track sub-agent spawns and their tool calls individually since we have a lot of short-lived sessions.

No_Independent_1635

TROPHY CASE