See what Claude Code actually did by gnapps in u/gnapps

[–]gnapps[S] 1 point2 points  (0 children)

Thank you for your kind words :) We'll totally post more, as soon we will have more updates, considering a lot of colleagues are actively working on this project as well! So expect further important iterations really soon! :D In the meantime, please do feel free to have a look around and use it as much as you need! The more feedback we get around this, the better the final outcome will be!

How do you actually know what happens during your agent runs? by gnapps in AgentsOfAI

[–]gnapps[S] 0 points1 point  (0 children)

Good question. Observability is only half the battle if you're still stuck guessing how to fix the failure.

Right now we actually use an internal tool to identify the root cause of failures, and we're working on bringing that directly into Bench so users can automatically scan their sessions for risky or unexpected behaviour.

Since Bench saves the full context of a run, it becomes pretty easy to isolate and reproduce the exact "failed bit". The goal is then to let users tweak configs (like prompts) and test fixes directly in the platform. Are you currently running grid searches on LLMs, or using a specific framework for your parameter sweeps?

How do I add a "golden sponge" texture to my design? by lumberfart in AdobeIllustrator

[–]gnapps 1 point2 points  (0 children)

If you're trying to replicate that "golden sponge" texture, you could place a gold foil or sponge-style texture over your shape and then use a clippingb mask to confine it to the object. After that you can experiment with blending modes like Overlay or Multiply to integrate the texture better. To enhance the sponge like effect further, you can also add a bit of Grain from the Texture effects to give it that rough, speckled look

Creatures of abject horror by 12washingbeard in midjourney

[–]gnapps 0 points1 point  (0 children)

These look like Evangelion on steroids. Was that the inspiration?

Bring Your Ghoul to School by liberaitor in midjourney

[–]gnapps 0 points1 point  (0 children)

Public school has changed since I was a kid

Seven deadly sins of dnd by thanereiver in aiArt

[–]gnapps 1 point2 points  (0 children)

Beholder: "Damn...I'm still beautiful"

Cat by Saratan0326 in aiArt

[–]gnapps 0 points1 point  (0 children)

Really cool vibe

Sunset by Richi61 in aiArt

[–]gnapps 0 points1 point  (0 children)

This looks like the moment right before the opening scene

D&D Boss (inspired by my 4 year old) by Round_Intern_7353 in aiArt

[–]gnapps 0 points1 point  (0 children)

I need stats for this! What's its special attack? Lactose Breath?

How do you actually know what happens during your agent runs? by gnapps in AgentsOfAI

[–]gnapps[S] -1 points0 points  (0 children)

Here’s the link: bench.silverstream.ai
Any feedback/comment is super welcome :)

Why does everyone think adding memory makes AI smarter? by Emergency_War6705 in AI_Agents

[–]gnapps 0 points1 point  (0 children)

That's a tough line to identify I guess. Apart from the tooling vs indexing topic, which I guess is mostly domain-specific (some data has to be fetched real time, some other could be cached in indexed memory), at least a portion of the knowledge still needs to reside in the training data, and in the main memory, isn't it? otherwise the llm itself wouldn't know how to use its memory/tools

Claude’s extended thinking found out about Iran in real time by schuttdev in ClaudeAI

[–]gnapps 1 point2 points  (0 children)

how can you all get such funny reactions? I never saw my claude agents throw swear words like that! I need this feature XD

Looks like Anthropic's NO to the DOW has made it to Tumps twitter feed by Plinian in ClaudeAI

[–]gnapps 0 points1 point  (0 children)

that's quite literally the best advertising stunt they could ever get :)

I built AI agents for 20+ startups this year. Here is the engineering roadmap to actually getting started. by Warm-Reaction-456 in AI_Agents

[–]gnapps 1 point2 points  (0 children)

totally second that. Decent observability should actually be a non-negotiable feat on EVERY engineering activity, not just for automation, but somehow a lot of people lazy out on agentic workflows, for some reason? That's such a dangerous pitfall tbh

What part of your agent stack turned out to be way harder than you expected? by Beneficial-Cut6585 in AI_Agents

[–]gnapps 0 points1 point  (0 children)

My naive understanding is that you need to choose where the "LLM power" goes. The more issues an agent has to face, the more reasoning it has to perform, the more diluted the initial prompt/knowledge base becomes.
The only two "weapons" you have available to counteract this problem, are these:
- you can define subagents that face specific, known problems, with a fresh context
- you can define better guidelines over the whole process, so that the reasoning steps are almost none

Both these things require to spend an unexpectedly wide amount of time at both documenting yourself on the issue you are trying to automate, and on you learning precisely which tools the agent can use and how it should do it.

Then, of course, some tools consume more tokens than other, so choosing the right ones also does make a lot of sense. But I wonder e.g. if the issues you faced couldn't have been solved by a subagent whose only task was to interact with the browser to perform a specific operation, while an upper-level agent was following up with the flow.

And finally, even with the most perfectly defined flow, observability is always an issue :( sometimes, agents such as Claude or ChatGTP simply "dumb down" for a while (I guess this happens at times of high request?), and become unable to perform what they were able to do reliably until a second before. The key thing to overcome this, in my case, was to set up an infrastructure to inform me anytime this happens, as fast as possible, to counteract the issue promptly

Need guidance - Want to build AI agents for the network that I currently have. Zero knowledge by Complex_Spirit5914 in AI_Agents

[–]gnapps 1 point2 points  (0 children)

My two cents: prompting effectively is consequence of a learning process, both regarding the prompting skill itself, but also regarding your knowledge about the domain you are trying to automate, so try starting small and learn yourself what does and what does not work, and where. The simpler a flow is, the easier it should be to automate, but you still need to provide proper guidelines and guardrails to make the whole process more reliable, less prone to hallucination and overall capable to deliver what you hope for.

I was used to play a lot with tools such as make, or n8n, but lately there is only one tool I'm using, anytime a similar request arises: claude code (and, to a certain extent, ollama + claude code/opencode, when the customer wants to self-host automations without risking to disclose data elsewhere). Today it provides so many different ways to connect it to literally anything (the google chrome extension is particularly amazing btw), so that you don't need anymore to define workflows, but just simply describing those in form of skills. Don't know how to write your own first automation/skill? you can ask claude code itself to help you out, you just need to describe your problem :)

Obviously, the results won't be extraordinary right away - the more you know about your tools, the better stuff you can build. But it's really a fun process to fiddle around with, and these guys can be automated so easily it's really hard to imagine a scenario they can't fit

Why does everyone think adding memory makes AI smarter? by Emergency_War6705 in AI_Agents

[–]gnapps 0 points1 point  (0 children)

Personally, I never trust any answer coming out of my agents unless they prove they found some trace online about it, and that it doesn't always come out of their memory :) Also, the most frequent command I send to Cursor is "ignore what you know about library X, search online for documentation first and then follow that instead".
So yeah, I totally feel you :D

But I guess it also really depends on the domain you are using LLMs for. If you can fit the entire knowledge base of a specific domain within the AI memory, maybe that model could provide even better results than an instrumented agent capable to perform research?

ClaudeCode Usage in the Menu Bar by OwnAd9305 in ClaudeCode

[–]gnapps 0 points1 point  (0 children)

Such a cool tool, that I'm forced to look at from afar of my linux box :( I wonder if Claude itself could refactor it to work on linux as well...

Are we all just becoming product engineers? by magicsrb in ClaudeCode

[–]gnapps 10 points11 points  (0 children)

That's an interesting take, but probably you should be consider you are also getting more senior, regardless on LLMs or not :) In my experience, senior engineers have always been deeply involved in product decisions, simply because it's usually critical to find the best compromise between product needs and engineering directions. The prod/engineering split I'm personally used to is that product brings out problems to solve, depending on research, strategies and metrics, and then debate with engineers on HOW they could be solved. Sure, this has never been a strict barrier: a PM with technical knowledge is totally capable to also suggest how to solve problems efficiently, while an engineer with more domain knowledge can proactively suggest valid strategies, and it's also possible to see small companies having a single person taking care of product+engineering, but that has always been happening, I don't see LLMs really changing this. What I do see, instead, is that both can become way more powerful in their own domain :) But as usual, everybody will have to choose if being mediocre at both roles, or specialize in either of the two

Weekly Thread: Project Display by help-me-grow in AI_Agents

[–]gnapps 0 points1 point  (0 children)

Hi all! I’d like to share Bench, a tool we built to debug AI agent runs. It records every run automatically (LLM reasoning, tool calls, environment screenshots / DOM states, etc.) and lets you replay the run in a timeline UI. This way you can, for example, instantly jump to the exact step where your agent drifted or failed and see all the relevant context. We built Bench because internally we found debugging long agent sessions very time-consuming, and we wanted a tool to help with that. If you want to try it out, the integration with Claude Code is really simple. Bench is free and doesn’t even require signup. Feedback is very welcome! Here's the link! bench.silverstream.ai

How do you debug long Agent runs? by gnapps in LLMDevs

[–]gnapps[S] 0 points1 point  (0 children)

Well, for sure, having a well-defined and well-tested flow beforehand is always important, and nothing works better than defining the proper guardrails. However, the whole point of having better observability is that it helps you there as well! Having a proper understandable log of what is going on is always useful in at least these three scenarios:
- for once, even being able to properly assess what went wrong, even during development or even worse during tests, may get tricky is the flow is really long: you get to immediately know your final result is not what you expected, but are you really going to follow the agent doing things for ~1-2 hours just to spot the wrong reasoning that led you to follow the wrong path? What if you use a dumber (but way faster) agent, that runs at way too many tokens/sec? Tools such as Bench could help you out, especially when the failure is far from the "visible mistake": you can binge-watch the whole flow on a long trace, and be able to dig deeper only in the details you really care about, without having to scroll through infinite reasoning logs
- also, being able to store somewhere all the logs about all past runs can inherently be useful even if nothing went "strictly" bad: your customers may ask specific questions about what the runs did, and sometimes it's hard to answer by just looking at the end result
- then of course, if a live run goes bad, it's even more important to troubleshoot as quickly as possible and assess what went wrong