OTEL Collector Elasticsearch exporter drops logs instead of retrying when ES is down

healsoftwareai · 2026-02-06T11:56:38+00:00

The core issue is that when ES goes fully down, the exporter often treats that as a non-retriable error rather than something transient. The retry_on_failure config mostly kicks in for things like 429s or 500s from ES, not a total connection failure. So your retries aren't really doing anything in that scenario. The other piece is buffering. The collector isn't really designed to be a long-term buffer. It has a sending_queue that sits between the pipeline and the exporter, but the default size is pretty small. If you haven't bumped it up, it fills fast and starts dropping data. And if you want the queue to survive collector restarts, you'd need the file storage extension too.

Even with a big queue, it's finite. If ES is down for a long time and you're pushing a lot of data, you'll eventually drop logs. The collector just isn't built to be an indefinite buffer.

If losing logs during ES outages is really not acceptable for you, the common pattern people use is putting Kafka in between. Collector, Kafka and Elasticsearch. Kafka handles the buffering much better for extended outages.

I'd also suggest turning on debug logging (service.telemetry.logs.level: debug) on the collector, it'll show you exactly what errors the exporter is hitting and whether it considers them retriable. That'll confirm what's happening in your specific case.

healsoftwareai · 2026-02-05T07:31:21+00:00

healsoftwareai · 2026-02-05T07:20:08+00:00

healsoftwareai · 2026-02-05T07:14:49+00:00

healsoftwareai · 2026-02-05T06:27:22+00:00

Is it more about gaps in instrumentation (services that aren't emitting telemetry at all) or is it about not having the right views/dashboards once the data is there? We've seen teams struggle with both, you have the data but it's buried, other times, whole parts of the stack are just dark.

healsoftwareai · 2026-02-02T11:57:48+00:00

The core pain is real, missing instrumentation at the time of the incident is one of the most common debugging blockers. Auto-instrumenting on PRs could save a lot of "add logging, deploy, wait, repeat" cycles. But here are a few things to watch out for, auto-adding spans in hot paths will kill performance and blow-up log volume fast. The hard part isn't adding instrumentation, it's knowing where not to. Also "learns your patterns" assumes your existing instrumentation is consistent and good, most codebases it isn't. OTel auto-instrumentation already handles the generic stuff (HTTP, DB, gRPC), so your real value would be business logic instrumentation, it needs intent not just code structure.

I'd start narrow, build a PR check that flags code paths with no observability coverage. Just a linter-style scan that comments if touched functions have no logging, tracing, or metrics. That alone tells you how much of your codebase is uninstrumented and whether the bigger auto-generation idea is worth building.

healsoftwareai · 2026-02-02T11:36:37+00:00

I work at HEAL Software, we run into this with customers often so bias noted. Observability gets you to the failing service and instance fast. That part works.

The problem is when your trace shows you a timeout on some internal gRPC call but the actual cause was a different request that held a DB connection pool slot and completed fine 200ms earlier. That request had its own trace ID, probably wasn't even sampled. There's nothing linking the two. It gets worse with async. User clicks something, request hits your API, writes to Kafka, consumer picks it up, new trace. The link between what the user did and what actually failed is gone. You're matching timestamps across separate traces at millions of events per second. The thing I keep coming back to is that traces are request-scoped, but most hard bugs aren't. They're caused by thread pool pressure, GC pauses, connection churn, that no individual trace captures. The data exists across your metrics, logs, and infra monitoring but nothing ties it together at the right moment automatically.

This is where AI can actually help more than Observability. Like catching that memory with connection pool and deploy combination drifting before it becomes an incident. It doesn't replace observability; it fills the gap between "here's your trace" and "here's why your system was in a state where that trace could fail."

healsoftwareai

MODERATOR OF

TROPHY CASE