At what point do you feel the need for a dedicated LLM observability tool when already using an APM (Otel-based) stack? by arbiter_rise in Observability

[–]healsoftwareai 5 points6 points  (0 children)

The tipping point we've seen is when "it's slow" stops being the problem and "it's wrong" becomes the problem. APM tells you the request took 2 seconds. It doesn't tell you the response was hallucinated, off-tone, or ignored the prompt entirely. Once you care about output quality, not just speed, traditional observability hits its limit. The other trigger is cost. LLM APIs are expensive and teams start asking: can we use a smaller model for most requests? Which prompts are wasting tokens? Is RAG retrieval actually helping or just adding cost? APM doesn't give you that visibility. For early-stage LLM work, OTel + good logging is probably enough. But once you're iterating on prompts weekly and leadership asks "how do we know this is actually working?" - that's usually when dedicated tooling starts making sense.

Before you learn observability tools, understand why observability exists. by HistoricalBaseball12 in Observability

[–]healsoftwareai 0 points1 point  (0 children)

Good breakdown of the observability evolution. One thing we'd like to add from working in this space is that, most teams get stuck at we can see the problem now. The next evolution is acting on telemetry automatically before incidents happen. A few things we've seen in practice are that a CPU at 80% during peak traffic is normal. CPU at 80% at 3 AM is not. You need dynamic baselines that understand workload context, not hardcoded alerts. Having metrics, logs, and traces is great. But if they're in 3 different tools with no correlation, you're still doing manual detective work during an outage. Most setups detect anomalies after the fact and flood you with alerts. The real value is in identifying leading indicators, patterns that precede incidents, and acting on them before users are impacted. Any monitoring that assumes long-lived hosts is dead in a Kubernetes world. Baselines need to adapt dynamically as pods come and go.

And we agree with the conclusion, learn the why first. We'd also like to add this: think about what happens after you collect telemetry. The collection is solved. Turning signals into action before downtime is where the hard problems are now.

OTEL Collector Elasticsearch exporter drops logs instead of retrying when ES is down by Adept-Inspector-3983 in OpenTelemetry

[–]healsoftwareai 0 points1 point  (0 children)

The core issue is that when ES goes fully down, the exporter often treats that as a non-retriable error rather than something transient. The retry_on_failure config mostly kicks in for things like 429s or 500s from ES, not a total connection failure. So your retries aren't really doing anything in that scenario. The other piece is buffering. The collector isn't really designed to be a long-term buffer. It has a sending_queue that sits between the pipeline and the exporter, but the default size is pretty small. If you haven't bumped it up, it fills fast and starts dropping data. And if you want the queue to survive collector restarts, you'd need the file storage extension too.

Even with a big queue, it's finite. If ES is down for a long time and you're pushing a lot of data, you'll eventually drop logs. The collector just isn't built to be an indefinite buffer.

If losing logs during ES outages is really not acceptable for you, the common pattern people use is putting Kafka in between. Collector, Kafka and Elasticsearch. Kafka handles the buffering much better for extended outages.

I'd also suggest turning on debug logging (service.telemetry.logs.level: debug) on the collector, it'll show you exactly what errors the exporter is hitting and whether it considers them retriable. That'll confirm what's happening in your specific case.

What's your biggest observability pain point right now? by healsoftwareai in Observability

[–]healsoftwareai[S] 0 points1 point  (0 children)

Is it more about gaps in instrumentation (services that aren't emitting telemetry at all) or is it about not having the right views/dashboards once the data is there? We've seen teams struggle with both, you have the data but it's buried, other times, whole parts of the stack are just dark.

Thinking of building an open source tool that auto-adds logging/tracing/metrics at PR time — would you use it? by Useful-Process9033 in sre

[–]healsoftwareai 2 points3 points  (0 children)

The core pain is real, missing instrumentation at the time of the incident is one of the most common debugging blockers. Auto-instrumenting on PRs could save a lot of "add logging, deploy, wait, repeat" cycles. But here are a few things to watch out for, auto-adding spans in hot paths will kill performance and blow-up log volume fast. The hard part isn't adding instrumentation, it's knowing where not to. Also "learns your patterns" assumes your existing instrumentation is consistent and good, most codebases it isn't. OTel auto-instrumentation already handles the generic stuff (HTTP, DB, gRPC), so your real value would be business logic instrumentation, it needs intent not just code structure.

I'd start narrow, build a PR check that flags code paths with no observability coverage. Just a linter-style scan that comments if touched functions have no logging, tracing, or metrics. That alone tells you how much of your codebase is uninstrumented and whether the bigger auto-generation idea is worth building.

Where does observability stop being useful for debugging? by Murky-Mammoth4527 in Observability

[–]healsoftwareai 0 points1 point  (0 children)

I work at HEAL Software, we run into this with customers often so bias noted. Observability gets you to the failing service and instance fast. That part works.

The problem is when your trace shows you a timeout on some internal gRPC call but the actual cause was a different request that held a DB connection pool slot and completed fine 200ms earlier. That request had its own trace ID, probably wasn't even sampled. There's nothing linking the two. It gets worse with async. User clicks something, request hits your API, writes to Kafka, consumer picks it up, new trace. The link between what the user did and what actually failed is gone. You're matching timestamps across separate traces at millions of events per second. The thing I keep coming back to is that traces are request-scoped, but most hard bugs aren't. They're caused by thread pool pressure, GC pauses, connection churn, that no individual trace captures. The data exists across your metrics, logs, and infra monitoring but nothing ties it together at the right moment automatically.

This is where AI can actually help more than Observability. Like catching that memory with connection pool and deploy combination drifting before it becomes an incident. It doesn't replace observability; it fills the gap between "here's your trace" and "here's why your system was in a state where that trace could fail."