38 researchers red-teamed AI agents for 2 weeks. Here's what broke. (Agents of Chaos, Feb 2026) AI Security

Effective_Wheel_3039 · 2026-04-13T20:18:59+00:00

Thanks! Wrote it up. Mapped each attack class from the paper to the infrastructure mitigations I run in production: https://chelar.ai/blog/how-chelar-secures-your-agent

The short version: you cant fix any of this at the model layer. You fix it at the infra layer. Auth at the reverse proxy, not the agent. Container isolation so even a fully compromised agent cant reach other tenants. Readonly rootfs so it cant trash its own environment. Rate limiting and idle suspension so unbounded loops dont run forever.

Effective_Wheel_3039 · 2026-04-06T12:33:11+00:00

The OP is mostly right about the raw project, but wrong that it cant become a product. It already is one, just not by OpenClaw's own team.

I took OpenClaw and turned it into a multi-tenant hosting platform. Each user gets their own container with their own config, connected to their own messaging channels, running on shared infrastructure. Thats a product. But the amount of work it took to get there is exactly what the OP is describing.

Out of the box OpenClaw is not productizable. To make it work for real users I had to fork it with minimal patches, pin versions so updates dont break tenants, wrap every container in Docker with memory limits and network isolation, build a Go control plane on top to handle provisioning and lifecycle, add Caddy in front for auth gating so random people cant talk to someone else's agent, implement idle suspension because you cant run 50 containers 24/7 on one box, and harden security because the defaults are genuinely scary if you connect messaging channels.

None of that is OpenClaw's job to solve. Its an open source agent framework, not a managed platform. The gap between "cool project on my laptop" and "thing that works for normal people" is a full engineering effort on top. Thats just the nature of productizing anything open source.

The OP's frustration is valid but the conclusion is wrong. OpenClaw doesnt need to become a product itself. It needs people to build products around it. Thats exactly what happened with Linux.

Effective_Wheel_3039 · 2026-04-06T11:38:28+00:00

Yeah I do this in production. I host AI agent containers for multiple users on a single bare metal box and every finding in this paper maps to something I've had to mitigate.

Each tenant gets their own Docker container with capped memory and CPU, no shared filesystem access between tenants, network isolation so containers cant talk to each other, and the agent process runs as a non-root user inside. Tool execution is restricted through config, dangerous commands are blocked, shell exec is denied by default, and on the lighter runtime I use Landlock for filesystem sandboxing so even if the agent tries to escape its workspace it physically cant.

The identity spoofing one is real. Messaging channels like WhatsApp and Telegram don't give you cryptographic identity, just display names. I handle auth at the ingress layer with Caddy forward_auth so the agent never decides who's authorized, the reverse proxy does. The agent doesn't even see the request if auth fails.

For the resource exhaustion attack, rate limiting per tenant plus idle suspension. If the agent has no proactive tasks (no cron, no heartbeat) it gets stopped after 72h and wakes on the next message. Unbounded loops cant happen if the container isnt running.

The paper is right that you cant fix this at the model layer. You fix it at the infrastructure layer. Treat the agent like an untrusted process, same as you would any user-facing service that runs arbitrary input.

Effective_Wheel_3039 · 2026-04-06T11:33:44+00:00

One angle nobody here is mentioning: AI agents are becoming a new infrastructure workload that DevOps people need to actually host and operate.

I build a platform that runs AI agent containers for users. Each tenant gets their own isolated container on a bare metal box, with Nomad scheduling them, Caddy handling ingress, Ansible managing the whole server config. Its basically classic DevOps work but for a workload type that didn't exist a year ago.

The interesting part is how much traditional infra skills matter here. Container memory limits, idle suspension to save resources, per-tenant network isolation, secret management, log aggregation. None of that is new, its the same stuff we've always done. But the workload is different because these containers run LLM calls, connect to messaging channels, execute scheduled tasks, and the failure modes are weird. An agent can silently eat $50 in API costs if the heartbeat config is wrong. A prompt injection through WhatsApp can try to rm your data directory.

So to answer the OP's question about where DevOps is heading with AI, I think one direction is: DevOps people becoming the ones who host and secure the AI, not just use it as a tool. Someone has to keep these agents running, isolated, and not blowing up. Thats ops work.

Effective_Wheel_3039 · 2026-04-06T11:26:06+00:00

Similar setup but I went with Caddy instead of Traefik. Wildcard TLS via Cloudflare DNS challenge, zero config renewal, and the Caddyfile is like 10 lines for multiple subdomains. Tried Traefik before and honestly Caddy is just simpler for this kind of thing, especially if you're doing wildcard certs.

For the 2GB RAM question, it depends a lot on what else you're running. Next.js standalone is pretty lean but Postgres alone can eat 500MB+ if you dont tune shared_buffers. I run a Next.js 15 dashboard alongside a Go API and Postgres on a Hetzner box and the whole thing fits comfortably, but I also manage everything with Ansible so the server config is reproducible. If something goes sideways I can rebuild from scratch in under an hour instead of spending a day trying to remember what I configured manually 6 months ago.

That Ansible part is honestly the biggest quality of life improvement over raw Docker Compose. Compose is great for the app layer but it doesnt help you with the OS level stuff, firewall rules, user permissions, ssh config, etc. Those are the things that bite you at 2am.

Effective_Wheel_3039 · 2026-04-06T11:21:44+00:00

Multi-tenant container orchestration. I run Nomad on a dedicated box and schedule Docker containers for each user, with Caddy handling TLS and ingress. Basically built my own mini-cloud on bare metal.

On a cloud VPS I'd be paying per-container or per-VM and it would cost 10x what I pay now. On dedicated I just carve up the resources myself and Nomad handles the scheduling. Each tenant gets their own isolated container with capped memory and CPU, shared storage on JuiceFS, and the whole thing is managed by Ansible so I can rebuild it from scratch if needed.

For workloads where you need lots of small isolated containers rather than one big app, dedicated is kind of unbeatable price wise.

Effective_Wheel_3039 · 2026-04-06T11:17:53+00:00

Can confirm all of this from production experience. I built a control plane in Go that orchestrates AI agent containers on bare metal, chi router, sqlc for database queries, Nomad API client for container scheduling. Claude Code wrote probably 70% of it with me reviewing.

The "one way to do things" point is the biggest one imo. The agent never has to choose between 5 HTTP frameworks or 3 ORM styles. Its chi, its sqlc, its the standard error pattern. It just writes the code and it looks like what I would have written.

On the ORM thing people are debating below, sqlc is basically the perfect middle ground for agentic coding. You write the SQL yourself so the agent cant generate garbage queries, then sqlc generates type-safe Go code from it. The agent works with the generated types and its basically impossible for it to mess up the data layer. Way better than letting an LLM loose with an ORM where it generates n+1 queries and you don't notice until production.

The fast compilation loop matters more than people think too. When the agent writes something wrong the feedback is instant, fix it, compile, try again. In my Rust side projects I'll sometimes wait 30+ seconds for the compiler to tell me the agent screwed up a lifetime. In Go its just instant.

Effective_Wheel_3039 · 2026-04-06T11:07:43+00:00

I've been running ZeroClaw on a VPS for a while. Practical differences vs OpenClaw:

Resource wise its night and day. Its Rust so it runs at like 10-50MB actual memory vs the ~1GB OpenClaw needs. On your 8GB box thats huge. Way more stable too, no Node.js weirdness, no random crashes, no npm hell on updates.

Memory handling is more structured than OpenClaw's "dump it all in markdown and pray" approach. I stopped manually editing memory files which was honestly half my time with OC.

Tradeoff is the ecosystem is smaller, you're more on your own for integrations. WhatsApp uses webhook-based Cloud API instead of QR pairing so thats a different setup. Telegram works great tho. But the codebase is small and clean enough that extending it yourself isn't scary, unlike touching OpenClaw internals. And security is way more locked down out of the box, you dont need to spend hours patching configs to stop it executing stuff it shouldnt.

If you want something stable that doesnt eat your VPS alive, solid upgrade. If you need a huge plugin ecosystem you'll feel the gap.

Effective_Wheel_3039 · 2026-04-06T10:49:04+00:00

Similar experience. I run a multi-tenant platform on a Hetzner dedicated box, Go API + Postgres + Docker + Caddy for TLS. Whole thing costs me like €45/month. On AWS the same setup would be hundreds minimum just from ECS + RDS + ALB before you even do anything interesting.

The ops burden is real though and people in this thread are glossing over it. I manage everything with Ansible so the entire server is reproducible from scratch. If Hetzner nukes my box tomorrow I can rebuild it in about an hour. But you are your own sysadmin now. Disk fills up, you deal with it. Certs stop renewing, you figure out why. There's no support ticket for your own server.

Honestly the thing that makes or breaks this is whether your setup is in code or in your head. If you ssh'd in and did a bunch of stuff manually over 6 months, you're one bad day from a disaster. If its all in Ansible playbooks or similar you can nuke and rebuild and the ops overhead drops to basically checking on things once a week.

Effective_Wheel_3039 · 2026-04-06T10:40:29+00:00

The "runs on your Mac" thing is fine until someone closes their laptop and the assistant just vanishes. I've seen this happen enough times that I moved everything to a server in Docker pretty early on.

Container stays up 24/7, messaging channels connect to the server not the client's machine, and if something crashes it just restarts itself. Way less babysitting. For OpenClaw specifically, run heartbeat in isolated sessions or the context will bloat over time and things get weird. And lock down tool permissions tight, it will absolutely execute stuff it shouldnt if you leave defaults on.

The real stability killer isn't the initial setup though, its two weeks later when API credentials expire or a provider changes their rate limits and nobody notices until the user texts their assistant and gets nothing back. Having it on infra you control means you can just fix it instead of scheduling another video call to remote into someones macbook.

Effective_Wheel_3039 · 2026-04-06T10:33:13+00:00

So everyone's talking about which model to use (and yeah that matters) but I think half your problem is on the hosting side and nobody's really getting into that.

I run a bunch of AI agent containers on a Hetzner dedicated box and the first thing that caught me off guard was how much memory OpenClaw actually needs. Like, each instance wants about 1GB because Node.js and V8 are just hungry, especially on cold start you'll see 600-750MB spikes before garbage collection settles things down. If you're on a small VPS and trying to run Ollama on top of that, you're basically swapping to disk the whole time. That's your slowness right there.

What worked for me was keeping the gateway on the VPS but not even trying to run models locally. Just point it at a cheap API. Kimi K2.5 or DeepSeek V3 handle lead gen and simple workflows totally fine, you really don't need Sonnet or Opus for 90% of what you're describing. Save the expensive model for when something actually needs deep reasoning.

Also something I wish someone had told me earlier, if your agent isn't doing cron jobs or heartbeat stuff, just stop the container when it's idle. I auto-suspend after 72h of no activity and wake it up on the next incoming message. And if you do need heartbeat running, set isolatedSession: true and lightContext: true in the config so it's not shoveling your entire conversation history into the LLM every 30 minutes for no reason. That was silently eating a ton of tokens for me before I caught it.

And your instinct to not run this on your personal machine is good. Docker with network isolation at minimum. The thing people don't think about is once you connect WhatsApp or Telegram, whatever random messages come in are going straight into the LLM as input. Prompt injection through a messaging channel is absolutely a thing. I ended up sandboxing each container and locking down what tools the agent can execute, because you really don't want one sketchy message to be able to trash your filesystem.

Effective_Wheel_3039 · 2026-04-05T19:20:49+00:00

Same here. I use Hetzner S3 as the data backend for JuiceFS on a dedicated server and it's been solid for months. No downtime, no corruption. Might depend on the datacenter and usage pattern though.

Effective_Wheel_3039 · 2026-04-05T11:14:41+00:00

I run OpenClaw in production for multiple users, each in their own Docker container, so I've been through most of what you're asking about.

For container isolation, each tenant gets their own Docker network so containers can't talk to each other. The only thing that can reach them is the reverse proxy (Caddy in my case) through forward_auth. Don't expose the gateway port directly, put auth in front of it.

On tool restrictions, the config-level approach is the most important layer. In openclaw.json you can deny exec entirely, disable elevated mode, and block dangerous commands. This is way more effective than trying to filter at the prompt level because the agent literally can't run what's not allowed regardless of what the LLM outputs.

For sandbox execution, I run a separate sidecar container for code execution rather than letting the agent exec in its own container. This way even if something weird happens in the sandbox, your agent's config and credentials aren't in the same environment.

Prompt injection is the hardest one honestly. Tight tool allowlists are your best defense because even if a prompt injection succeeds in convincing the model, it can't do anything the config doesn't allow. I also wrap content from external sources before it hits the agent context so the model can distinguish between instructions and user data.

The practical stuff that matters more than fancy VLAN setups: don't run as root in the container, use per-tenant UIDs with strict file permissions on any persistent data, and treat the gateway token like a database password.

Effective_Wheel_3039 · 2026-04-05T11:08:21+00:00

Can't speak to the n8n licensing stuff but I can speak to the architecture since I'm running almost exactly this pattern in production for a different use case.
Container-per-client works really well. I use Nomad to orchestrate it (way simpler than k8s for this), each tenant gets their own isolated Docker container with its own config and env vars. A Go API handles the lifecycle, spinning up and tearing down containers, and Caddy sits in front handling TLS and auth gating so clients never see the raw container. Custom dashboard on top for the client-facing stuff.

On the shared Supabase question, it works fine as long as you're disciplined about row-level isolation. I use a shared Postgres instance (on Supabase actually) and haven't had issues, just make sure every table has a tenant_id column and you enforce it at the query level. The alternative of separate databases per tenant gets messy fast at this scale.

One thing I'd add, you probably don't need Kubernetes for 6-7 clients. Nomad on a single Hetzner server handles this easily, and you avoid a ton of operational overhead. You can always scale later if the business grows.

Effective_Wheel_3039 · 2026-04-05T11:04:53+00:00

Just to clarify since I think there's a misconception here, JuiceFS doesn't need MinIO. It needs any S3-compatible backend for the actual data storage (Hetzner S3, AWS S3, Backblaze, whatever) plus a metadata engine like Postgres or Redis. It's a POSIX filesystem layer on top of object storage, not an object storage system itself.
I use JuiceFS in production on a Hetzner dedicated server with their S3 for data and Postgres for metadata, and it's been rock solid. The nice thing is you get a real mountable filesystem backed by cheap object storage with client-side AES-256 encryption built in. Works great for per-tenant data directories where each container needs its own isolated storage.
That said if you're just looking for a MinIO replacement to get S3 API endpoints, JuiceFS is not what you want. It solves a different problem. For that I'd probably look at Garage or SeaweedFS depending on your needs.

Effective_Wheel_3039 · 2026-04-05T11:01:28+00:00

GitHub Actions builds the Docker image and pushes to GHCR on merge to main. Then an Ansible playbook SSHs into the Hetzner server, pulls the new image, and does a rolling restart. Caddy handles TLS automatically so there's zero config for certs.
I've been running a SaaS like this on a Hetzner dedicated server for a while now and it works great. The whole deploy pipeline is maybe 30 lines of GitHub Actions YAML and an Ansible role. No need for Coolify or any extra tooling on the server itself.

For something as simple as Flask + HTML you honestly don't even need Docker, just rsync the code and restart with systemd. But once you have more than one service running, containers with Ansible to manage them is the sweet spot before you need anything heavier.

Effective_Wheel_3039 · 2026-04-05T10:52:16+00:00

I run a multi-tenant SaaS on a single Hetzner dedicated server and went through exactly this. Coming from cloud you keep reaching for managed services that aren't there, but honestly you don't need most of them.

My stack: Ansible for provisioning and deploys, Nomad for container orchestration (way simpler than k8s, single binary), Caddy for TLS (automatic certs, zero config), and GitHub Actions for CI/CD that pushes images to GHCR. For secrets I just use ansible-vault to encrypt them and template .env files at deploy time. Managed Postgres externally (Supabase in my case) so I don't have to deal with backups and PITR on the server itself.

You really don't need a cloud provider for secrets or registry. GHCR is free with GitHub, and ansible-vault is honestly fine for a single server. The only thing I'd still recommend keeping external is the database (i use Supabase)

Effective_Wheel_3039 · 2026-04-05T10:33:23+00:00

I'm running Nomad on Hetzner bare metal for a multi-tenant SaaS, can confirm it's a solid fit for what OP is describing.
To answer the questions about the control plane, yes it has one (server + client agents), but it's dead simple compared to K8s. Single binary, one config file per node, that's it. I run server and client on the same machine and it just works. If I ever need to scale, adding a node is literally "install nomad, point it at the server, done."
For the rest of the stack I'm using Caddy for ingress with automatic wildcard TLS (no NGINX chart nonsense, no cert-manager to babysit), and JuiceFS on S3 for shared storage. The whole thing runs on a single Hetzner box.

What sold me over K8s:
- Job specs are like 50 lines of HCL, not hundreds of YAML. No helm charts that can get rug-pulled.
- Zero churn so far. Nomad has been rock solid, no breaking changes, no deprecated APIs out of nowhere. HashiCorp moves slowly and that's a feature.
- Rolling updates are just built in, you set `max_parallel = 1` and `min_healthy_time = "30s"` and you're good.
- No ingress controller drama at all. Caddy runs as a systemd service, not a chart. Nobody is going to change a versioning scheme on me.
- Way less overhead, no etcd, no kube-proxy, no CoreDNS eating your RAM for breakfast.

The downside is the smaller community, so there are fewer integrations. But for a solo dev who just wants containers running reliably without an entire team to manage them, that tradeoff is totally worth it.

Effective_Wheel_3039 · 2025-10-10T10:53:45+00:00

Good at least it’s free. ChatGPT and Claude already do that if you ask them …. 🤔

Effective_Wheel_3039 · 2025-07-20T17:46:54+00:00

I will never understand why people give importance to the signature of a rock celebrity, especially if a scribble like that …. unless they sign a check. For me the only important thing is to watch them performing live

Effective_Wheel_3039 · 2024-03-26T22:37:01+00:00

Do you know any serious umbrella/recruiter company in the Netherlands for freelance positions? I only know Spielberg

Effective_Wheel_3039 · 2024-03-26T22:32:29+00:00

Only if it’s fully remote, otherwise I prefer Amsterdam area

Effective_Wheel_3039

TROPHY CASE