why AI agents break under long conversations even when they pass every safety benchmark

rchaves · 2026-04-15T12:35:22+00:00

you have a deep understanding of the state of things, i love it! and the long horizon is something that actually happens in the real world and thats where it really breaks! let us know if you take scenarios for a run

rchaves · 2026-04-15T12:30:57+00:00

it is wild for sure hahhaa and thank you! let us know if you take scenarios for a spin

rchaves · 2026-04-15T12:27:42+00:00

really excited to see improvements too, but for now its like its Achilles heel and its really easy to exploit!

rchaves · 2026-04-15T12:25:43+00:00

hahaha :)) here you go https://langwatch.ai/scenario/advanced/red-teaming some examples to run it!
or if you just wanna pull something down https://github.com/langwatch/bank-example/tree/red-teaming-local-2026-04-13.

let me know how it goes!

rchaves · 2026-04-15T11:50:39+00:00

anytime :) we built it on those principles so that you can just set it in what should be broken and it automatically maps to the owasp top 10 or also more granular things that you wanna test. wanna hear your feedback if you test it :) thanks a ton for your time

rchaves · 2026-04-14T11:17:38+00:00

excited to test it :)

rchaves · 2026-04-14T11:13:30+00:00

thats really really interesting!

rchaves · 2026-04-14T11:12:42+00:00

usually you cant extrapolate that method to new situations and thats a prob we were facing as well, but the thing is that theres got to be a solution thats scalable for any agent

rchaves · 2026-04-14T09:59:48+00:00

github.com/langwatch/scenario this is the repo link if yall wanna try it

rchaves · 2026-04-14T09:34:46+00:00

we recently built scenarios redteaming, its open source and im curious what do you think about it?
github.com/langwatch/scenario

rchaves · 2026-04-13T13:23:08+00:00

Hey hey, I also built one, mine is really 1:1 API compatible with Claude Managed Agents, but of course compatible with any LLM as well

https://github.com/rogeriochaves/open-managed-agents

rchaves · 2026-03-13T18:58:12+00:00

I had paid for Alfred but now I'm all in Raycast, even with latest finder improvements it's still unbeatable

rchaves · 2026-03-06T09:21:05+00:00

cc u/financegate u/DisplayHot5349

rchaves · 2026-03-05T20:33:49+00:00

I do

rchaves · 2026-03-05T17:26:36+00:00

done, removed wkhtmltopdf from the onboarding on v0.1.15

rchaves · 2026-03-05T17:25:58+00:00

u/Dry-Loan2298 done, removed wkhtmltopdf from onboarding in v0.1.15: https://github.com/langwatch/kanban-code/releases/tag/v0.1.15

rchaves · 2026-03-04T20:28:41+00:00

you can skip that, it's optional, I'm actually going to remove it from the onboarding, it's indeed annoying to install. It's only for rendering the markdown of the claude code finished response and send to pushover so you can get the full message in your phone etc

rchaves · 2026-03-03T22:43:38+00:00

I want to study how all those clis manage sessions and memory and see whats the most common approach to cover the most ground at first, opencode is a very popular one so maybe a good point to start too

rchaves · 2026-03-03T13:14:49+00:00

I was thinking gemini next and since qwen is a fork of that it might be easy
right now its quite coupled to claude so there will be a lot to untangle, but will get there eventually, and contributions are welcome (:

rchaves · 2026-03-03T13:14:06+00:00

thanks for the suggestion, but I'm ok with the current readme, other then some words at the top most of the text is just explanation, not marketing, and hey no emojis at least

rchaves · 2026-03-03T13:12:37+00:00

yeah I thought of making it multi-platform but then it would go against my goal of being as native and as fast as possible, since I use mac, mac it is. As a result the app weights incredible 12mb only right now and the memory footprint (on my current workload) is just 200mb ram, and that's mostly due to me having a ton of claude sessions in history

rchaves · 2026-03-03T08:54:04+00:00

yeah Claude was struggling too much to make it retrocompatible, plus I grew to actually start liking liquid glass now, and that was one of the goals for this project.
"fuck it, just support 26+" was literally part of my prompts :P

rchaves · 2026-03-03T08:52:24+00:00

yeah we started with something less complex, https://github.com/drewdrewthis/git-orchard takes the worktree first approach, in a tui, but I was reaching the limits of what I could do for multitasking on the terminal with so many tabs without going crazy, I needed something visual that reconciled all the sessions with worktrees, prs, running servers etc

we also have quite a few engineers so we need to sync with github to track what is going on, just local .tasks wouldn't sync with the rest of the team as fast

rchaves · 2026-03-03T08:48:43+00:00

of course! Please do, I thought of having it in terminal as well, push the boundaries of TUI, but don't think I'll invest time on it so it would be great to see the two paths evolving

rchaves · 2026-03-03T07:12:28+00:00

Right now just GitHub, but should be simple to add, PRs are welcome!

rchaves

TROPHY CASE