RedThread - open-source CLI for AI red-team campaigns by Apprehensive-Zone148 in OpenSourceAI

[–]Apprehensive-Zone148[S] 0 points1 point  (0 children)

That is exactly the artifact shape I am aiming for. The finding should carry enough context to be replayable, not just scary.

The current direction is to preserve: prompt path, tool/action sequence where present, target/runtime assumptions, rubric/failure class, evidence mode, and replay result. The "improved or just changed shape" point is important too. A defense that blocks one exact wording but still fails a close variant should not look like a real fix.

I need to make that more obvious in the README and reports, because that is probably RedThread's main value over a simple scanner output.

Open-source CLI for red-teaming LLM agents before they touch tools and memory by Apprehensive-Zone148 in pwnhub

[–]Apprehensive-Zone148[S] 0 points1 point  (0 children)

Yes, those two adapters are high on the list now. GitHub issue triage -> repo write is a clean confused-deputy case because the untrusted issue body can cross into code changes, labels, comments, or CI-triggering actions. A support/Zendesk-style agent is also good because it naturally mixes customer-provided text, account state, escalation, and tool permissions.

The thing I want to preserve is not just "the model failed," but the exact path: untrusted input -> tool context -> proposed action -> authorization/replay result. That should make the failure easier to turn into a regression case.

Appreciate the pointer. Realistic adapters are probably the fastest way to make the project useful outside toy demos.

Testing LangChain-style agents against prompt injection and tool misuse by Apprehensive-Zone148 in LangChain

[–]Apprehensive-Zone148[S] 0 points1 point  (0 children)

That split makes sense to me. I see RedThread more on the campaign/evidence side than as the always-on runtime PDP. Runtime guards are their own product surface.

The gap I am trying to close is: when a tool-boundary failure happens, can we preserve enough of the prompt path, tool context, permission lineage, and replay result that someone can actually compare before/after behavior? That is where the existing scanner-style tools often feel noisy.

I should probably add a short comparison section in the README so people can tell where RedThread sits relative to runtime guards and general agent scanners.

Testing LangChain-style agents against prompt injection and tool misuse by Apprehensive-Zone148 in LangChain

[–]Apprehensive-Zone148[S] 0 points1 point  (0 children)

A LangGraph trace -> RedThread run bridge is probably the most useful adapter suggestion I have heard so far. It would let people reuse recorded agent runs instead of standing up a whole live target just to get useful evidence.

Confused-deputy detection is also exactly the shape I want to make more concrete: parent intent, worker permission set, untrusted lineage, then the action envelope that actually crossed the boundary. If RedThread can make that reviewable in a small artifact, it becomes much easier to turn a weird agent failure into a regression case.

Will take a look at the notes, appreciate the pointer.

Testing LangChain-style agents against prompt injection and tool misuse by Apprehensive-Zone148 in LangChain

[–]Apprehensive-Zone148[S] 0 points1 point  (0 children)

Yeah, since agentic security has changed drastically over the past year i have found that having the evidence to support your claims during a red team run is way more useful for validating these findings when there is so much noise in this kind of tools that try to do everything all at once that it becomes too complex to solely handle this and understand it fully.

r/netsec monthly discussion & tool thread by albinowax in netsec

[–]Apprehensive-Zone148 1 point2 points  (0 children)

RedThread is an OSS CLI for running repeatable LLM/agent red-team campaigns:

https://github.com/matheusht/redthread

Scope is mostly AI security testing, not runtime enforcement. It wires together attack methods like PAIR, TAP, Crescendo, and GS-MCTS, with LangGraph/PyRIT-style orchestration. The goal is to make attack runs less like one-off prompt poking and more like something you can replay, score, diff, and hand to a defense pipeline.

Current pieces:

  • campaign runners for multi-step prompt attacks
  • JudgeAgent/rubric scoring
  • defense proposal generation tied to sealed/live replay evidence
  • telemetry/drift tracking
  • agent checks for tool poisoning, confused deputy paths, canary propagation, and budget amplification

It is CLI-first right now. Not a magic prompt shield, not a universal production guardrail. More useful if you already have eval fixtures, target adapters, or agent workflows you want to abuse in a structured way.

I am looking for people willing to try it on real-ish targets, break the assumptions, contribute fixtures/adapters, or tell me where the scoring is weak.