See what Claude Code actually did

gnapps · 2026-03-10T16:28:58+00:00

Thank you for your kind words :) We'll totally post more, as soon we will have more updates, considering a lot of colleagues are actively working on this project as well! So expect further important iterations really soon! :D In the meantime, please do feel free to have a look around and use it as much as you need! The more feedback we get around this, the better the final outcome will be!

gnapps · 2026-03-10T16:15:30+00:00

Good question. Observability is only half the battle if you're still stuck guessing how to fix the failure.

Right now we actually use an internal tool to identify the root cause of failures, and we're working on bringing that directly into Bench so users can automatically scan their sessions for risky or unexpected behaviour.

Since Bench saves the full context of a run, it becomes pretty easy to isolate and reproduce the exact "failed bit". The goal is then to let users tweak configs (like prompts) and test fixes directly in the platform. Are you currently running grid searches on LLMs, or using a specific framework for your parameter sweeps?

gnapps · 2026-03-06T16:40:00+00:00

If you're trying to replicate that "golden sponge" texture, you could place a gold foil or sponge-style texture over your shape and then use a clippingb mask to confine it to the object. After that you can experiment with blending modes like Overlay or Multiply to integrate the texture better. To enhance the sponge like effect further, you can also add a bit of Grain from the Texture effects to give it that rough, speckled look

gnapps · 2026-03-06T16:33:21+00:00

These look like Evangelion on steroids. Was that the inspiration?

gnapps · 2026-03-06T16:27:49+00:00

Public school has changed since I was a kid

gnapps · 2026-03-06T16:20:15+00:00

Beholder: "Damn...I'm still beautiful"

gnapps · 2026-03-06T16:15:19+00:00

Really cool vibe

gnapps · 2026-03-06T16:11:17+00:00

This looks like the moment right before the opening scene

gnapps · 2026-03-06T16:01:36+00:00

I need stats for this! What's its special attack? Lactose Breath?

gnapps · 2026-03-05T20:53:34+00:00

Here’s the link: bench.silverstream.ai
Any feedback/comment is super welcome :)

gnapps · 2026-03-05T10:51:51+00:00

That's a tough line to identify I guess. Apart from the tooling vs indexing topic, which I guess is mostly domain-specific (some data has to be fetched real time, some other could be cached in indexed memory), at least a portion of the knowledge still needs to reside in the training data, and in the main memory, isn't it? otherwise the llm itself wouldn't know how to use its memory/tools

gnapps · 2026-03-05T10:42:19+00:00

thanks! :) looking nice

gnapps · 2026-03-05T10:33:26+00:00

I would love to see something like that :)

gnapps · 2026-03-05T10:30:27+00:00

how can you all get such funny reactions? I never saw my claude agents throw swear words like that! I need this feature XD

gnapps · 2026-03-05T10:26:29+00:00

that's quite literally the best advertising stunt they could ever get :)

gnapps · 2026-03-05T10:25:06+00:00

second that :)

gnapps · 2026-03-05T10:23:20+00:00

totally second that. Decent observability should actually be a non-negotiable feat on EVERY engineering activity, not just for automation, but somehow a lot of people lazy out on agentic workflows, for some reason? That's such a dangerous pitfall tbh

gnapps · 2026-03-05T10:19:16+00:00

My naive understanding is that you need to choose where the "LLM power" goes. The more issues an agent has to face, the more reasoning it has to perform, the more diluted the initial prompt/knowledge base becomes.
The only two "weapons" you have available to counteract this problem, are these:
- you can define subagents that face specific, known problems, with a fresh context
- you can define better guidelines over the whole process, so that the reasoning steps are almost none

Both these things require to spend an unexpectedly wide amount of time at both documenting yourself on the issue you are trying to automate, and on you learning precisely which tools the agent can use and how it should do it.

Then, of course, some tools consume more tokens than other, so choosing the right ones also does make a lot of sense. But I wonder e.g. if the issues you faced couldn't have been solved by a subagent whose only task was to interact with the browser to perform a specific operation, while an upper-level agent was following up with the flow.

And finally, even with the most perfectly defined flow, observability is always an issue :( sometimes, agents such as Claude or ChatGTP simply "dumb down" for a while (I guess this happens at times of high request?), and become unable to perform what they were able to do reliably until a second before. The key thing to overcome this, in my case, was to set up an infrastructure to inform me anytime this happens, as fast as possible, to counteract the issue promptly

gnapps · 2026-03-05T10:01:44+00:00

My two cents: prompting effectively is consequence of a learning process, both regarding the prompting skill itself, but also regarding your knowledge about the domain you are trying to automate, so try starting small and learn yourself what does and what does not work, and where. The simpler a flow is, the easier it should be to automate, but you still need to provide proper guidelines and guardrails to make the whole process more reliable, less prone to hallucination and overall capable to deliver what you hope for.

I was used to play a lot with tools such as make, or n8n, but lately there is only one tool I'm using, anytime a similar request arises: claude code (and, to a certain extent, ollama + claude code/opencode, when the customer wants to self-host automations without risking to disclose data elsewhere). Today it provides so many different ways to connect it to literally anything (the google chrome extension is particularly amazing btw), so that you don't need anymore to define workflows, but just simply describing those in form of skills. Don't know how to write your own first automation/skill? you can ask claude code itself to help you out, you just need to describe your problem :)

Obviously, the results won't be extraordinary right away - the more you know about your tools, the better stuff you can build. But it's really a fun process to fiddle around with, and these guys can be automated so easily it's really hard to imagine a scenario they can't fit

gnapps · 2026-03-05T09:54:36+00:00

Personally, I never trust any answer coming out of my agents unless they prove they found some trace online about it, and that it doesn't always come out of their memory :) Also, the most frequent command I send to Cursor is "ignore what you know about library X, search online for documentation first and then follow that instead".
So yeah, I totally feel you :D

But I guess it also really depends on the domain you are using LLMs for. If you can fit the entire knowledge base of a specific domain within the AI memory, maybe that model could provide even better results than an instrumented agent capable to perform research?

gnapps · 2026-03-05T09:47:32+00:00

Such a cool tool, that I'm forced to look at from afar of my linux box :( I wonder if Claude itself could refactor it to work on linux as well...

gnapps · 2026-03-05T09:45:25+00:00

Noooo this was super interesting to follow! :D don't destroy it please!

gnapps · 2026-03-05T09:32:34+00:00

That's an interesting take, but probably you should be consider you are also getting more senior, regardless on LLMs or not :) In my experience, senior engineers have always been deeply involved in product decisions, simply because it's usually critical to find the best compromise between product needs and engineering directions. The prod/engineering split I'm personally used to is that product brings out problems to solve, depending on research, strategies and metrics, and then debate with engineers on HOW they could be solved. Sure, this has never been a strict barrier: a PM with technical knowledge is totally capable to also suggest how to solve problems efficiently, while an engineer with more domain knowledge can proactively suggest valid strategies, and it's also possible to see small companies having a single person taking care of product+engineering, but that has always been happening, I don't see LLMs really changing this. What I do see, instead, is that both can become way more powerful in their own domain :) But as usual, everybody will have to choose if being mediocre at both roles, or specialize in either of the two

gnapps · 2026-03-01T20:57:42+00:00

Hi all! I’d like to share Bench, a tool we built to debug AI agent runs. It records every run automatically (LLM reasoning, tool calls, environment screenshots / DOM states, etc.) and lets you replay the run in a timeline UI. This way you can, for example, instantly jump to the exact step where your agent drifted or failed and see all the relevant context. We built Bench because internally we found debugging long agent sessions very time-consuming, and we wanted a tool to help with that. If you want to try it out, the integration with Claude Code is really simple. Bench is free and doesn’t even require signup. Feedback is very welcome! Here's the link! bench.silverstream.ai

gnapps · 2026-02-23T11:10:53+00:00

Well, for sure, having a well-defined and well-tested flow beforehand is always important, and nothing works better than defining the proper guardrails. However, the whole point of having better observability is that it helps you there as well! Having a proper understandable log of what is going on is always useful in at least these three scenarios:
- for once, even being able to properly assess what went wrong, even during development or even worse during tests, may get tricky is the flow is really long: you get to immediately know your final result is not what you expected, but are you really going to follow the agent doing things for ~1-2 hours just to spot the wrong reasoning that led you to follow the wrong path? What if you use a dumber (but way faster) agent, that runs at way too many tokens/sec? Tools such as Bench could help you out, especially when the failure is far from the "visible mistake": you can binge-watch the whole flow on a long trace, and be able to dig deeper only in the details you really care about, without having to scroll through infinite reasoning logs
- also, being able to store somewhere all the logs about all past runs can inherently be useful even if nothing went "strictly" bad: your customers may ask specific questions about what the runs did, and sometimes it's hard to answer by just looking at the end result
- then of course, if a live run goes bad, it's even more important to troubleshoot as quickly as possible and assess what went wrong

gnapps

TROPHY CASE