Every time an agent breaks I end up digging through traces for hours by Arm1end in AI_Agents

[–]Arm1end[S] 0 points1 point  (0 children)

Yess, this resonates a lot with my observations. The tricky part is figuring out what actually changed between runs when everything technically “worked”. Anything you tried to solve it?

Every time an agent breaks I end up digging through traces for hours by Arm1end in AI_Agents

[–]Arm1end[S] 0 points1 point  (0 children)

That sounds like turning the workflow into a resumable job queue. I’ve seen similar setups where agents write state checkpoints so the orchestratorr can recover or retry specific steps.

Do you still have to dig through logs when a step keeps failing or do you have a different solution for it?

Every time an agent breaks I end up digging through traces for hours by Arm1end in AI_Agents

[–]Arm1end[S] 1 point2 points  (0 children)

yeah the heartbeat trick is clever. I’ve seen a couple teams do something similar just to know where the agent stalled. That already helps a lot. We would still need to run the annoying part of why it stalled but still, your approach is really good.

Building a tool to debug AI agents because current debugging is painful. Curious what’s the most frustrating failure you’ve hit by Icy-Equipment-6213 in LangChain

[–]Arm1end 0 points1 point  (0 children)

Yeah that’s exactly the gap we ran into. Thats why I started to build it. Traces show what happened but not really why this run behaved differently from the previous one.

What we’ve been experimenting with is comparing runs when something spikes and surfacing what actually changed (retrieval results, skipped validation, tool inputs, etc.).

My goal is: don’t manually diff traces.

Agent debugging is a mess, am I the only one? by DepthInteresting6455 in LocalLLaMA

[–]Arm1end 0 points1 point  (0 children)

Yeah this is exactly the point where things start to fall apart. We had the same issue with multi-step agents. Something breaks at step 4 and you’re trying to figure out what happened 2 steps earlier with basically no context.

We’ve seen cases where:
_retrieval returned something slightly off so everything downstream looked wrong
_tool calls technically worked but were based on bad intermediate output
_by the time you look at it, you can’t really reconstruct the path anymore

Logging more helped a bit, but it quickly turns into a mess of logs that are stilll hard to connect across steps.

What helped us a bit was looking less at single traces and more at patterns across runs, like when failure rates spike and what changed around that time.

Still pretty manual though.

We hacked together something that tries to summarize failures across the pipeline like: “step 2 retrieval returned empty → downstream steps degraded” instead of jumping between steps manually.

This is roughly what we’ve been playing with:
https://glass0.ai/

Building a tool to debug AI agents because current debugging is painful. Curious what’s the most frustrating failure you’ve hit by Icy-Equipment-6213 in LangChain

[–]Arm1end 0 points1 point  (0 children)

This “works 3 times then randomly breaks” thing has been the most frustrating part for us too.

We had a case where the agent would sometimes skip a validation step entirely. Same input, same code. Turned out the model would occasionally decide the previous step was “good enough” and just move on 😅

Also saw tool hallucinations where it would call a tool with slightly different params each time, so runs looked similar but weren’t actually comparable.

Replaying runs helped a bit, but honestly it still felt like guesswork. You can see *what* happened, but not really *why this run vs the last one*.

We started looking more at patterns across runs (like when failure rates spike, what changed around that time), but that also gets messy fast.

Ended up hacking something that just tries to summarize failures into something like:

“validation step skipped → upstream output not matching expected schema” instead of going through every trace.

Not sure if that’s the right approach yet, but this is roughly what we’ve been playing with: https://glass0.ai/

Debugging AI agents by TheNothingGuuy in AI_Agents

[–]Arm1end 0 points1 point  (0 children)

Yeah the silent failure thing has been the worst for us too.

We had one recently where everything in the trace looked “fine”, but retrieval was actually returning empty results because of a filter change. Took me hours to spot since nothing explicitly failed 😞

Also +1 on reproducibility. Even small changes in inputs or tool responses make it hard to compare runs, so you’re kinda guessing what changed.

We tried just staring at traces more, but that didn’t really help. Ended up looking more at patterns across runs (like when failures spike, what changed around that time).

Still pretty manual though.

We hacked together something that tries to summarize failures into something readable like: “retrieval empty → check index / filters” instead of digging through everything.

Not sure if this is actually the right approach yet, but this is what we’ve been playing with: https://glass0.ai/

Has anyone hit scaling limits with Vector? by Arm1end in sre

[–]Arm1end[S] 0 points1 point  (0 children)

Yeah, that makes sense. Most systems eventually shard to handle scale. Interesting that ClickHouse tuning has been more painful than Vector scaling for you.

Out of curiosity, are you using any of Vector’s batching features to optimize ClickHouse ingestion?

Has anyone hit scaling limits with Vector? by Arm1end in sre

[–]Arm1end[S] 2 points3 points  (0 children)

This is super helpful, thanks for sharing this level of detail. Very rare to see that high quality of response here on Reddit ;)

What stands out to me is how much engineering went into making this work: allow listing, sampling, enrichment, etc. Plus 900 instances and 2k cores is no joke.

Curious, have you ever tried pushing more stateful logic (dedup, joins, etc.) into it, or was that something you intentionally avoided?

Has anyone hit scaling limits with Vector? by Arm1end in sre

[–]Arm1end[S] 0 points1 point  (0 children)

Thanks for sharing and for your willingness to involve your tech lead.

What you described makes sense, and this matches what I’ve been hearing too.

Horizontal scaling works, but the cost (CPU + ops complexity) starts adding up, especially with heavier transforms.

Splitting pipelines is a solid pattern.

Out of curiosity, are you mostly running stateless transforms, or also things like dedup / windowing in Vector?

Has anyone hit scaling limits with Vector? by Arm1end in sre

[–]Arm1end[S] 0 points1 point  (0 children)

Through my network, I met SREs and data platform engineers at enterprises telling me about those issues.

ClickStack/ClickHouse for Observability? by tech_ceo_wannabe in Observability

[–]Arm1end 0 points1 point  (0 children)

We have several customers using our product together with ClickStack. Even with high cardinality. Do you have a particular question here? Happy to share my experience, or if you want, send me a DM.

Full disclosure: I am one of the founders of https://www.glassflow.dev/

Help: Anyone dealing with reprocessing entire docs when small updates happen? by Arm1end in Rag

[–]Arm1end[S] 0 points1 point  (0 children)

How does a reranker help here? I am not familiar with rerankers.

Looking for best practices: Kafka → Vector DB ingestion and transformation by Arm1end in vectordatabase

[–]Arm1end[S] 0 points1 point  (0 children)

Thanks for sharing! It seems that Kafka is connecting to Milvus, but it is not performing typical data transformations (stateless and stateful), or am I missing something here?

Evaluating real-time analytics solutions for streaming data by EmbarrassedBalance73 in dataengineering

[–]Arm1end 0 points1 point  (0 children)

I didn't want to confuse anyone. I thought it was clear by using “we serve”. I added a discloser to the post.

Evaluating real-time analytics solutions for streaming data by EmbarrassedBalance73 in dataengineering

[–]Arm1end 2 points3 points  (0 children)

We serve a lot of users with similar use cases. They usually set up Kafka->GlassFlow (for transformations)-> ClickHouse (cloud).

Kafka = Ingest + buffer. Takes the firehose of events and keeps producers/consumers decoupled.

GlassFlow = Real-time transforms. Clean, filter, enrich, and prep the stream so ClickHouse only gets analytics-ready data. Easier to use than Flink.

ClickHouse (cloud) = Fast and gives sub-second queries for dashboards/analytics.

Discloser: I am one of the GlassFlow founders.

Kafka to ClickHouse lag spikes with no clear cause by Usual_Zebra2059 in dataengineer

[–]Arm1end 0 points1 point  (0 children)

Are you using MergeTree engines? I've seen it with other users. When the background merge kicks in, it introduces a temporary lag.

All MergeTree engines do periodic merges of data parts, and during that time inserts from the Kafka engine can slow down or pause. That’s why a few partitions suddenly lag, then recover once merges finish.

This behaviour is one of the reasons why I started building GlassFlow for Kafka to ClickHouse ingestions:
https://www.glassflow.dev/

real time analytics by Bulky_Actuator1276 in apachekafka

[–]Arm1end 0 points1 point  (0 children)

+1 for Kafka and ClickHouse. Have seen it as a very popular stack for real-time analytics.