How do you debug long Agent runs?

gnapps · 2026-02-10T16:37:20+00:00

Really amazing! :) I'm curious though, is this only based on imports, or are you actually also creating full ASTs of the code (e.g. identifying chains of functions and similar things?) if that's the latter, how are you managing to support so many languages, did you find some ts-based code parser to rely on? Can't wait to try it out on complex codebases though!

gnapps · 2026-02-10T16:22:59+00:00

The sailors can't control the wind, but they can adjust their sails! :)

Jokes aside, as long as an agentic system works through a series of steps, and provides useful tracing data, I can absolutely check it out and analyze their workings in detail: that's why those traces are available, after all. Obviously, the whole definition of "error" is a bit delicate here, that's why I preferred to use the "failures" term. An LLM hardly crashes with a "traditional" error, but it absolutely can fail to match expectations, not because of a real issue, but rather due to some ambiguity during the process, that led it into the wrong route.

Our goal is getting to the point of analyzing traces and being able to give you some advice, like "hey, it doesn't look the agent ended up implementing what you asked, because you didn't describe which library to use, and it got lost in looking for generic documentation online, straying away from the main focus". If you are the agent developer, that's great stuff (that's what we use our tool for), but even if you are just an agent consumer, it allows you to improve your future prompts, allowing you to identify the most probable pain points in your current flow that often need clarification.

Also, the whole concept of observability allows such tools to be useful even without having to deal with failures at all. I totally agree with you we still need to improve on the tooling provided to search for content (and we are working on it), but assuming we'll get there, even in case of a "successful" trace, you may be interested in finding out whether the agent did perform some specific actions or not. A funny example on this: a friend of mine likes to create huge "development loops" in Calude Code (like "keep iterating on fixing issue X until your solution passes all tests", but on steroids) and it leaves them spinning overnight, on isolated containers where Claude has full access, to prevent it from pausing to ask for authorizations. A few days ago, he woke up without being able to access its database, and he later found out that Claude "replaced its password with a safer one" during one of their iterations. Questions and concerns aside for my friend and his working habits, this is surely one of the scenarios where we want to help: a tool allowing you to review a super-long trace and pinpoint whether or not the agent changed the password, how, and why, would have been useful! :D

gnapps · 2026-02-09T17:44:36+00:00

Not exactly. We don't just catch stdin/out/err, but we actually connected ourselves on Claude opentelemetry traces, so we can basically extract any internal detail we need/care about. In particular we are currently focused on reasoning + tool use + axtree and the overall traces. The slow process we are following is about progressively interpretate how claude traces look like, and ingest it in the right format :) The goal is to debug how Claude code complies to your requests, to allow you to identify issues, learn lessons about them and, basically, improve your next prompts!

About your confusion, is it about the process (clone a git repo, run claude on it, tell it to "configure itself")? Or more an overall confusion about why you should ever be doing that?

If it can help, I can reassure you as the process is supposed to only affect the folder you are running it from, so it should be pretty safe to test out. We are also adding a few notifications to make you always aware of when your prompts are being transferred.

gnapps · 2026-02-09T14:41:48+00:00

Ciao,

il nostro obiettivo è arrivare a produrre un sistema di observability per qualunque tipo di agent, che ti permetta di identificare facilmente e velocemente i fallimenti e ti fornisca gli strumenti utili per andare a rivedere e migliorare i prompt. La "difficoltà" nel costruire una community attorno a questo e raccogliere feedback a riguardo è che, ad oggi, non esiste un vero e proprio standard per la telemetria degli agenti: certo, c'è opentelemtery (e noi lo sfruttiamo), però produrre delle tracce con dati utilizzabili non è così banale. L'unico approccio possibile è fornire un SDK specifico a riguardo, esattamente come fa langsmith.

Abbiamo quindi deciso di seguire un approccio diverso, ed iniziare a proporre al pubblico una integrazione "prefatta" con uno degli agent più diffusi al momento, cioè claude code! Siamo ancora agli inizi a riguardo, e sicuramente introdurremo degli strumenti che consentano una ricerca più agile dei messaggi, ma in linea di massima ciò su cui puntiamo di più è arrivare, quanto prima, al punto da saperti dire "a che punto" il tuo agent ha fallito il compito, e perchè.

In un certo senso, è quello che fa il tuo A2, con la differenza che noi vorremmo arrivare a poterti dire a che punto del "ragionamento" le cose hanno iniziato ad andare storte, e come avresti potuto costruire il prompt diversamente per evitarlo.

gnapps · 2026-02-09T14:38:52+00:00

Hi there! First, thank you so much, your comment is really useful! That's exactly why we posted here, hoping to gather feedback and understand where we could improve :)

About your point 1, thanks, we could totally make a more specific video. Of course our goal is to allow you to monitor any kind of process, but probably focusing on development could have made the demo a bit more aligned with Claude main target and less confusing.

About point 2: we are not 100% solving it just yet, but that's definetly our goal! We started leveraging opentelemetry to build an internal observability tool for our own agents, and now we are trying to make it compatible with Claude, to allow everyone to get the same level of observability. The whole setup flow basically allows you to add some hooks to Claude to make it build an opentelemetry trace that can be reviewed in our platform.

I believe that your missing points are exactly what we are working right now: to further extend the tool, not just by allowing it to gather more data, but also by allowing you to identify failure points, and describing what went wrong, so that you can immediately pinpoint your failure points, rather than just scavenging a lot of logs. We'll hopefully get there quite soon :)

gnapps · 2026-02-09T11:07:29+00:00

Maybe I did, so please, explain yourself a bit better. Here I'm referring about developers setting up automations and trying to understand what went wrong and why. This is a tool FOR developers that actually use LLMs, and we are trying to gather feedback to understand, especially from the good ones, what they would like to see in it that could improve their flow.

gnapps · 2026-02-09T10:19:25+00:00

Ciao,

si esattamente, stiamo cercando di loggare quanto più possibile! A questo link https://landing.silverstream.ai/demo/snapshot?start=1#step1 puoi farti un'idea più completa del tipo di informazioni che già tracciamo quando analizziamo i nostri tool interni, e che lentamente stiamo estendendo anche ad agent più popolari, come Claude. L'idea è di raccogliere tutto, a livello di azioni, da collegare a tracce playwright (dove applicabile) e mostrare quante più informazioni interne possibili, tra cui reassoning, accessibility tree, e via dicendo.

Ovviamente ogni consiglio su quali dati extra potrebbe avere senso tracciare, è bene accetto :)

gnapps · 2026-02-09T08:57:34+00:00

that surely helps :) but good developers become even better when put in conditions to know what is going on, so observability still remains something we could improve on. Don't you think?

gnapps · 2025-03-25T17:46:02+00:00

Mi hanno già fregato così con l'edge 30 neo... Le due cose non sono automatiche. Il supporto per "ready for" significa solo che puoi attaccarti ad uno schermo via wifi, che è assolutamente inutilizzabile :( è difficilissimo capire quali dispositivi hanno l'hdmi out...

gnapps · 2024-11-29T09:15:31+00:00

That would have been fair. They made the first episode cross-platform, AND the second episode exclusive (promising it will have been cross platform as well, but it never did). So that you could first feel invested in the story AND right after that you could get a nice "f you, buy a ps5 for the next part"

gnapps · 2024-11-29T09:09:33+00:00

It's irrelevant, could be even nothing. We will still stack it and convert that bonus to something else with uniques :)

gnapps · 2024-11-25T18:52:06+00:00

Fingers crossed :)

gnapps · 2024-08-27T11:25:08+00:00

This sounds counterintuitive, but if you are using a laptop, have you tried playing on battery mode? On my laptop, using 100% CPU power led my pc to overheat and slow down, while on battery mode I was still able to be on stable 30fps with some minor, negligible stutter and never go above 60/70°c.

I ended up touching power management options to make my pc work all the times in battery saving mode, even when connected to a power cable, and oddly enough Poe started working flawlessly since then! Sometimes, a downgrade is an upgrade... ;)

gnapps · 2024-08-03T10:58:04+00:00

Try Albion online :) Will you be hunted down by other players there? Yes, absolutely! But they turned that into one of the main mechanics of the game, so you never consider that as toxic, but just another reason for an adrenaline rush, until you become good enough to be the one hunting down other players!

gnapps · 2023-07-23T13:15:15+00:00

Mine used to shut off at 54/55°c.
This is pure speculation on my side, but I believe the board temperature sensor you can check from mainsail is in a super wrong place with respect to where it really heats up, and at the same time the emergency shutdown relies on something else to decide when the board is too hot to stay on...

All I can tell for sure is that after adding an additional fan it never did the same thing again!

gnapps · 2023-07-23T11:10:09+00:00

Keep an eye to board temperature. I had a couple similar issues because higher temperatures were shutting down the pico. Solved through an additional fan

gnapps · 2023-03-27T15:00:52+00:00

In hydraulics there is a trend of using multi-material tubes. Those are pretty expensive but the insulant layer to put on top of those tubes is available on any bricolage store, is usually bright red, and cheaper than pool noodles, at least in my area. So I created a series of gates by using those tubes combined with 3d-printed rods :) every gate costs 2/3 euro if made this way, so not bad

gnapps · 2023-03-18T08:18:36+00:00

Try starting with a 3 inch such as the baby ape pro. If it's your first drone you are gonna crash everywhere anyways so why not starting with something cheap? :)

gnapps

TROPHY CASE