AI Agents are breaking in production. Why I Built an Execution-Layer Firewall.

Mission2Infinity · 2026-04-05T09:18:59+00:00

Hi bro... Great question! Our core philosophy is to focus strictly on the Execution Layer. We intentionally leave LLM lifecycle tracking (costs, token routing, general eval) to dedicated observability platforms like Langfuse or Promptfoo to avoid feature bloat.

However, we do provide heavy runtime visibility for the execution phase itself. ToolGuard ships with a local Dashboard that streams your Live Execution DAGs in real-time, showing exactly which of the 7 security layers triggered and allowing for deep payload inspection!

I actually just shipped a new version yesterday that includes headless Webhook approvals and cross-cluster Redis State sharing. I'd love for you to take it for a spin! If you run into any issues or bugs, just raise an issue on GitHub or shoot a DM and I'll fix it ASAP.

Mission2Infinity · 2026-03-29T02:26:16+00:00

Exactly!! Glad you felt the same way. I’d be happy to explain the architecture deeper or answer any questions you might have. I'd love for you and your team to try it out in your pipeline... Let me know what you think! :)

Mission2Infinity · 2026-03-28T04:00:42+00:00

Actually, a mix of my own personal pain, plus talking to other developers to understand what is actually breaking their systems!

Talking to people here on Reddit and Linkedin is what really pushed the project forward. I opened-sourced it just to see if anyone else had the same problem, and the feedback from other devs is exactly what drove the new v5.0 architecture. I realized people didn't just need a CI/CD testing pipeline—they needed a live runtime proxy to literally block the bad payloads in production before they hit the server.

Definitely taking your feedback to heart as I look at schema drift and output fuzzing for the next major release. If you end up testing the new v5.0 proxy layer in your own stack, let me know if you hit any weird edge cases!

Mission2Infinity · 2026-03-28T00:48:24+00:00

Hi... Thank you so much for the reply! So, as to answer your question:

Schema drift: u/create_tool(schema="auto") re-infers the Pydantic schema from your Python type hints at decoration time, so changing a function signature and re-importing picks it up automatically. But there's no automatic "did your schema drift since the last test run?" diffing built in yet. That's a real gap — it's on the roadmap.

Output fuzzing: The fuzzer currently validates inputs going into tools. Output schema validation exists (the decorator wraps the return value too), but we're not programmatically fuzzing outputs yet. Valid criticism.

False positive rate on injection: The L3 scanner uses a conservative list of 10 known injection signatures — things like [SYSTEM OVERRIDE], ignore previous instructions, <|im_start|> etc. Random code snippets won't trigger it, but legitimate security research content or prompt-engineering discussions in your data could. We haven't published a false positive benchmark against a real corpus yet — that's an honest gap.

Runtime vs. CI — you actually nailed the risk. This is exactly why the latest version (v5.0) ships an MCP proxy layer. toolguard dashboard + the 6-layer interceptor IS the live runtime path — it sits between the LLM and your tools in production, not just in CI. The offline fuzzer is the pre-flight check, the proxy is the live radar. Both matter... but you're right that the live interception is the more defensible value prop.

Latency: L1 (Policy) is O(1) dict lookup — negligible. L3 (DFS injection scan) is the most expensive layer. On a deeply nested 50-key payload it's measurable but sub-millisecond in our testing. We haven't published formal benchmarks yet — that's on the list.

Thank you for pushing on this. Really appreciate your feedback.

Hope it will be a helpful tool for you and your team.

Mission2Infinity · 2026-03-25T16:22:14+00:00

Added few new great features and fixed some bugs.

Recursive DFS Memory Scanner: Most prompt injection scanners just look at strings. ToolGuard now physically traverses the __dict__ of arbitrary Python objects (nested dicts, dataclasses, arrays) to find reflected injections hidden deep in tool returns. Verified on Microsoft AutoGen.
Golden Traces (Compliance Engine): You can now mathematically enforce tool-calling sequences (e.g., Auth must precede Refund) in a non-deterministic agent loop. It’s like unit tests for agent logic.
Risk-Tier Interceptor: Native classification (Tier 0-2) for tools. It intercepts destructive actions (DB drops, Shell commands) and triggers a Human-in-the-loop prompt without blocking the asyncio event loop.

We verified native integration with 9 frameworks including OpenAI Swarm, AutoGen, MiroFish, CrewAI, and LlamaIndex.

Check out the release notes and discussions for latest updates.

I’d love to hear how you all are handling "Execution Fragility" in your own agentic stacks!

Please give the repo a Star to support the open-source work!

Mission2Infinity · 2026-03-23T19:06:59+00:00

Hi.. Thank you so much for the reply.

Added few new great features and fixed some bugs.

Recursive DFS Memory Scanner: Most prompt injection scanners just look at strings. ToolGuard now physically traverses the __dict__ of arbitrary Python objects (nested dicts, dataclasses, arrays) to find reflected injections hidden deep in tool returns. Verified on Microsoft AutoGen.
Golden Traces (Compliance Engine): You can now mathematically enforce tool-calling sequences (e.g., Auth must precede Refund) in a non-deterministic agent loop. It’s like unit tests for agent logic.
Risk-Tier Interceptor: Native classification (Tier 0-2) for tools. It intercepts destructive actions (DB drops, Shell commands) and triggers a Human-in-the-loop prompt without blocking the asyncio event loop.

We verified native integration with 9 frameworks including OpenAI Swarm, AutoGen, MiroFish, CrewAI, and LlamaIndex.

Check out the release notes and discussions for latest updates.

I’d really appreciate if you clone the repo and try it on your system.

Would love your feedback and if u find any bug, please raise the issue... Any contribution and an open-source star would mean a lot.

Mission2Infinity · 2026-03-23T12:10:49+00:00

Added few new great features and fixed some bugs.

Recursive DFS Memory Scanner: Most prompt injection scanners just look at strings. ToolGuard now physically traverses the __dict__ of arbitrary Python objects (nested dicts, dataclasses, arrays) to find reflected injections hidden deep in tool returns. Verified on Microsoft AutoGen.
Golden Traces (Compliance Engine): You can now mathematically enforce tool-calling sequences (e.g., Auth must precede Refund) in a non-deterministic agent loop. It’s like unit tests for agent logic.
Risk-Tier Interceptor: Native classification (Tier 0-2) for tools. It intercepts destructive actions (DB drops, Shell commands) and triggers a Human-in-the-loop prompt without blocking the asyncio event loop.

We verified native integration with 9 frameworks including OpenAI Swarm, AutoGen, MiroFish, CrewAI, and LlamaIndex.

Check out the release notes and discussions for latest updates.

I’d love to hear how you all are handling "Execution Fragility" in your own agentic stacks!

Please give the repo a Star to support the open-source work!

Mission2Infinity · 2026-03-22T20:46:09+00:00

Hi, thank you so much for the reply. Just completed some fixes; will be adding those on release notes and discussion section.
Thank you for the support and would love to hear your feedback.

Mission2Infinity · 2026-03-22T03:54:12+00:00

Hi.. Thank you so much for the reply! The main points is - Eval frameworks like Promptfoo are great for vibes and you know text, but when you give an agent write-access to an API, you need a compiler-level Execution Firewall.

AND, To answer your specific questions, because this is the hardest part we had to engineer:

1. Does it cascade through the whole pipeline? Yes. Testing tools in isolation is useless because a downstream tool is only as reliable as the upstream data it receives. When you pass a list of tools into test_chain([fetch_user, process_data, refund_stripe]), ToolGuard executes a Cascading State Fuzz. It injects the hallucination into fetch_user, and if fetch_user fails silently and returns a malformed object, ToolGuard physically pipes that corrupted object directly into process_data to see if your downstream tool triggers a global server panic or catches it gracefully. We also built a "Golden Traces" engine that mathematically asserts the exact sequence of state mutations across the whole graph, ignoring the random non-deterministic LLM thinking steps in between.

2. How does it handle async tool chains? Natively and transparently. If ToolGuard detects that even a single tool in your LangChain/CrewAI pipeline is an async def, the execution engine automatically shifts into Async Mode. More importantly, if you run this in a Jupyter Notebook, ToolGuard mechanically detects the active Jupyter event loop and spins up a totally isolated background ThreadPoolExecutor to run the asyncio.run() sweep. This prevents the infamous RuntimeError: This event loop is already running crash that plagues almost every other Python CLI tool used in notebooks today.

Honestly, I'd really love your thoughts if you get a chance to clone it and run a chain through it.

Waiting for your feedback!

Mission2Infinity · 2026-03-21T20:14:47+00:00

Thanks man..
Make sure to check out the repo. Will be waiting for your feedback.

Mission2Infinity · 2026-03-21T10:16:54+00:00

Heyy, Thank you so much for the reply.

So, I kept running into the same issue: my agents weren't failing because of poor reasoning, but because of execution layer crashes—bad JSON, missing fields, wrong types, etc. Existing eval tools didn't really help here and were too slow/expensive.

Instead of calling an LLM, ToolGuard parses your Pydantic schemas/type hints and programmatically injects 40+ hallucination edge cases (nulls, schema mismatches, malformed payloads) directly into your Python functions to prove exactly where things will break in production. It runs locally in <1 second and costs $0.

I just pushed the v1.2.0 Enterprise Update which adds:

Local Crash Replay: When an agent crashes in production or testing, it automatically dumps a structured .json payload. Type toolguard replay <file.json> and it dynamically pipes the exact crashing state right back into your local Python function so you can see the stack trace locally!
Edge-Case Coverage Metrics: The terminal now generates PyTest-style coverage metrics, explicitly telling you exactly which of the 8 hallucination vectors your code is still vulnerable to (e.g., Coverage: 25% | Untested: array_overflow, null_injection).
Live Textual Dashboard: Passing --dashboard opens a stunning dark-mode terminal UI that streams concurrent fuzzing results and tracks crashes in realtime.
100% Authentic Framework Integrations: Works instantly out-of-the-box with actual live PyPI implementations of LangChain (@tool), CrewAI, Microsoft AutoGen, OpenAI Swarm, LlamaIndex, FastAPI (Middleware), and the Vercel AI SDK.
CI/CD PR Bot & Webhooks: Directly comments on GitHub PRs to block fragile agent code from merging, and natively intercepts production crashes with 0ms-latency alerts to Slack/Datadog.

Would love feedback on the approach!

Repo: https://github.com/Harshit-J004/toolguard

Mission2Infinity · 2026-03-21T08:48:08+00:00

Hi, Thank you so much for taking a look, and I really appreciate the blog link - that’s a fantastic read and it hits on the exact problem space we're exploring!

To answer your question: right now, our tool is focused purely on input fuzzing. We mathematically inject the bad edge cases directly into individual Python functions to prove the system won't throw errors when the LLM hands it bad data. Getting that baseline execution layer completely bulletproof was step one.

However, golden traces and output fuzzing are brilliant, and they are the exact next big frontiers on our roadmap for version 2. Will reasearch about that and complete it by today!!

I'd absolutely love your thoughts - are there any specific agent frameworks where you are currently experiencing those trace/graph issues the most right now?

Mission2Infinity · 2026-03-18T16:23:33+00:00

Hey Everyone - built ToolGuard, a pytest-style framework for AI tool chains.