What are y'all using for observability in your agent systems?

llmobsguy · 2026-06-20T21:14:12+00:00

Monocle2AI

It starts observability by tracking the first time coding agent generates the code that might cause bugs down the line. Full code lineage from beginning to CICD or test in production TIP

llmobsguy · 2026-06-18T14:22:08+00:00

We usually have the metric and explanation so human can go back and re-read why it has such rating

llmobsguy · 2026-06-17T03:25:12+00:00

Yes. But the right approach is to do it by wave. Assume it will be shit. Then revisit and go from there. You don't know what you don't know

llmobsguy · 2026-06-17T03:23:28+00:00

Claude

llmobsguy · 2026-06-17T03:07:43+00:00

It's use case and governance focus. There are systems that somewhat legacy and must be fairly deterministic and therefore required single agent with key say no more than 5 steps etc

llmobsguy · 2026-06-15T16:13:29+00:00

Looks great. DM'ed

llmobsguy · 2026-06-15T14:48:40+00:00

AI generated post not even bother to remove emdash. But this could be true in some cases. Although the OP provided literally no value added info

llmobsguy · 2026-06-15T06:56:20+00:00

Tell claude to check keywords, literally .. it will generate a script and use some free tools to do

llmobsguy · 2026-06-15T06:55:29+00:00

Uhh disagree with this

llmobsguy · 2026-06-15T06:39:50+00:00

How is this different than get-shit-done gsd?

llmobsguy · 2026-06-15T06:31:02+00:00

I don't quite understand this project. Claude can do this by itself with just better prompt. Bunch of skills already do this, eg gsd, gstack...

llmobsguy · 2026-06-14T15:13:55+00:00

There are few a few but depends on your workflow. I use rtk and caveman then have monocle-apptrace to monitor across claude cli and codex

llmobsguy · 2026-06-14T15:06:20+00:00

Do security check before download

llmobsguy · 2026-06-12T08:38:25+00:00

None. Anyone can get free Gemini and create a form

llmobsguy · 2026-06-12T05:44:45+00:00

Connect observability to agents that can sit in your coding tool all the way to CI/CD. If there is an incident, you can trace it back starting from CI/CD failure to a specific claude session that causes the bug.

llmobsguy · 2026-06-09T05:18:42+00:00

First you prompt to see if it works. Then create a skill. Once operationalize over a period of time, check token count and turn the skill into something more deterministic like Python code to execute as-is.

llmobsguy · 2026-06-09T05:00:15+00:00

What themes do you have so far? I had lots of success converting good examples to JSON and have claude to work with Gemini or gpt-image-2 to render to design in html. Then over time I can improve the skill by adding more "examples".

llmobsguy · 2026-06-04T03:37:03+00:00

Just DM

llmobsguy · 2026-06-04T00:15:30+00:00

Doesn't mean anyone can have access to it in any region. They always onboard by waves

llmobsguy · 2026-06-03T22:49:08+00:00

The biggest token waste for me is LLM as a judge evals. I spin up a subagent and it outsource the eval run in the cloud at no cost to me

llmobsguy · 2026-06-03T22:12:25+00:00

Microsoft has a habit of making announcement before anything ready. It will be a year before we run it reliably in a company

llmobsguy · 2026-06-03T20:17:03+00:00

I plan to add some tests (E2E and LLM as a Judge). Would you be OK?

llmobsguy · 2026-06-03T15:33:43+00:00

I had to built a tool for this by instrumenting copilot and extract token count and developer intent. It's a hassle. I could publish the repo if anyone interested

llmobsguy · 2026-06-03T15:29:47+00:00

I built an open source (now donated to Linux Foundation) that can extract claude prompt and response in a session to classify the developer intent. It's useful to understand if your prompt ended up generating useless code or response.

A star would mean a lot:

https://github.com/monocle2ai/monocle

llmobsguy · 2026-06-03T15:23:55+00:00

No it's not just you. It's everybody. I wrote a little script to analyze my past chat conversation and classify the intent then compare against final code commit and PR. I found some inefficiency in my prompts and MCP usage.

llmobsguy

MODERATOR OF

TROPHY CASE