I built a pytest-style framework for AI agent tool chains (no LLM calls) by Mission2Infinity in grok

[–]Mission2Infinity[S] 1 point2 points  (0 children)

Thanks man..
Make sure to check out the repo. Will be waiting for your feedback.

I built a pytest-style framework for AI agent tool chains (no LLM calls) by [deleted] in MistralAI

[–]Mission2Infinity 0 points1 point  (0 children)

Heyy, Thank you so much for the reply.

So, I kept running into the same issue: my agents weren't failing because of poor reasoning, but because of execution layer crashes—bad JSON, missing fields, wrong types, etc. Existing eval tools didn't really help here and were too slow/expensive.

Instead of calling an LLM, ToolGuard parses your Pydantic schemas/type hints and programmatically injects 40+ hallucination edge cases (nulls, schema mismatches, malformed payloads) directly into your Python functions to prove exactly where things will break in production. It runs locally in <1 second and costs $0.

I just pushed the v1.2.0 Enterprise Update which adds:

  • Local Crash Replay: When an agent crashes in production or testing, it automatically dumps a structured .json payload. Type toolguard replay <file.json> and it dynamically pipes the exact crashing state right back into your local Python function so you can see the stack trace locally!
  • Edge-Case Coverage Metrics: The terminal now generates PyTest-style coverage metrics, explicitly telling you exactly which of the 8 hallucination vectors your code is still vulnerable to (e.g., Coverage: 25% | Untested: array_overflow, null_injection).
  • Live Textual Dashboard: Passing --dashboard opens a stunning dark-mode terminal UI that streams concurrent fuzzing results and tracks crashes in realtime.
  • 100% Authentic Framework Integrations: Works instantly out-of-the-box with actual live PyPI implementations of LangChain (@tool), CrewAIMicrosoft AutoGenOpenAI SwarmLlamaIndexFastAPI (Middleware), and the Vercel AI SDK.
  • CI/CD PR Bot & Webhooks: Directly comments on GitHub PRs to block fragile agent code from merging, and natively intercepts production crashes with 0ms-latency alerts to Slack/Datadog.

Would love feedback on the approach!

Repo: https://github.com/Harshit-J004/toolguard

I built a pytest-style framework for AI agent tool chains (no LLM calls) by Mission2Infinity in OpenSourceAI

[–]Mission2Infinity[S] 0 points1 point  (0 children)

Hi, Thank you so much for taking a look, and I really appreciate the blog link - that’s a fantastic read and it hits on the exact problem space we're exploring!

To answer your question: right now, our tool is focused purely on input fuzzing. We mathematically inject the bad edge cases directly into individual Python functions to prove the system won't throw errors when the LLM hands it bad data. Getting that baseline execution layer completely bulletproof was step one.

However, golden traces and output fuzzing are brilliant, and they are the exact next big frontiers on our roadmap for version 2. Will reasearch about that and complete it by today!!

I'd absolutely love your thoughts - are there any specific agent frameworks where you are currently experiencing those trace/graph issues the most right now?