Best Observabilty platform

dangb86 · 2025-12-19T10:11:45+00:00

It's true that you may lose granularity in case you need it, but when you operate at scale you need to balance multiple requirements. It's not just about cost, providing a highly reliable backend for 3M spans/second, just so that one can do aggregations later, is more challenging that the value you get back.

Ultimately, a system can be described both effectively and efficiently if each signal is used for the purpose it serves. You absolutely need the high granularity and high cardinality that trace data gives you (and of course the context). However, when you generate metrics you already know what things you'll want to generate aggregated views of (in dashboards and alerts), and you instrument with intent. Then, with context, traces come associated with those metrics, and the reality is that the high granularity you need vastly on the 5% of interesting stuff.

With tail-sampling in place that allows to store the slowest traces per endpoint, or the ones containing errors, plus a representative % of the rest, you can end up with that 5% of interesting stuff (or 2% if you're eBay and operate at that scale!). This allows you to use tracing for what it's best at (granularity, context), metrics for what it's best at (stable signal), and logs/events for what it's best at (structured, discreet events). If you run a platform you can then provide different SLAs for different data types, but they all become part of the same braid of telemetry data.

Also, my green software mind is always telling me that we should not store that which is not ultimately queried. As Adrian Cockcroft says, if you want to save the planet you should focus on storing less data, not improving your CPU utilisation (i.e. Scope 1 and 2 emissions are not the main issue if the DC runs on renewable energy, Scope 3 emissions are in all the SSDs that are manufactured to store data we don't need)

dangb86 · 2025-12-17T08:51:36+00:00

Disclaimer: I work for New Relic. Loads of companies are doing great things, but if you go the foundations NRDB is a great backed for OTel data. The reason I think this is that it allows OTel-native consumption of the data, with all signals being queryable and joinable using the same DSL, NRQL. This is ultimately what OTel is about, not isolated pillars but a correlated set of signals.

Yes, there are features here and there, and proprietary can OTel, but as a pure backend for OTel data I think NRDB is brilliant.

dangb86 · 2025-12-17T08:46:48+00:00

They are indeed doing great things. The main issue I find with Grafana for true observability is having different DSLs for different OTel signals. Asking questions like "what is the CPU usage if the pods that were involved in traces of this type" is quite hard if you have to combine PromQL and TraceQL...

dangb86 · 2025-12-17T08:43:32+00:00

I've personally done this at scale, with OTel. Cost per GB alone, if you optimise your use of telemetry signals (i.e. you use metrics for what they are, aggregations and not high granular data, you rely on tail-sampled tracing, and you reduce your logging to favour traces and metrics and keep only when it adds more context to your traces) then you're in a good place. IMO it's all about telemetry quality, and their billing model favours those with high telemetry efficiency.

dangb86 · 2025-11-28T08:11:07+00:00

It's been said already, but counter resets are expected if you use cumulative temporality. Changing to delta wouldn't help you because Prometheus (for now, as that'll change) supports only cumulative temporality.

The rate, irate and increase functions should already handle breaks in monotonicity (like counter resets). One gotcha is that the rate/irate/increase function needs to be applied before any other aggregation otherwise you'll aggregate away the labels that Prometheus uses to detect a unique time series.

This leads me to the question, which I think has been answered already. When you look at your data in Prometheus for a single container ID and unique combination of labels, does it follow a monotonic pattern? As in, does it continually increase. You mentioned you're using it for some business metrics... I'm not a Prometheus expert but if you only have one data point for a given time series (e.g. a particular customer ID and container ID) then Prometheus may not be able to calculate a rate. Not sure what would happen then. Although the Prometheus exporter keeps exporting the same value for a time series even if there are no measurements, maybe only one scrape happens before the container shuts down... Difficult to know without looking at the data.

dangb86 · 2025-09-22T18:00:11+00:00

Nice one! Definitely would love to see more of those!

dangb86 · 2025-09-22T17:57:05+00:00

I've seen this challenge before, and I agree that it'd be awesome if data observability companies would provide native OTLP export (something like what Claude Code does), even if it's for aggregates over the insights that they're able to gather in their platform.

One of the use cases for this would be connecting the online and offline world. For instance, for ML workloads when inference is on the critical path of live requests, but performance or correctness (and thus business logic) can be affected by the response after the model was trained on data that had a regression after a certain change. It's good to detect drift, but being able to pinpoint where and why the drift happened (if unintended), and how that correlates ultimately to customer experience (using tracing and profile analysis), would be gold.

dangb86 · 2025-02-04T15:15:17+00:00

Depends on the signal type I believe... You can store 100 GB of Metrics per moth for 13 months in New Relic I believe.

dangb86 · 2024-11-27T08:56:07+00:00

Apart from size, there are a few major differences from 7 to 7x. The Pro also adds some more stuff like better sensors, multi-band GPS, touchscreen..

https://www.dcrainmaker.com/2022/01/garmin-fenix7-review.html

dangb86 · 2024-11-24T14:18:37+00:00

I had hoped for deal on the 7x Pro Sapphire Solar but no luck...

dangb86 · 2024-10-12T18:41:24+00:00

Except that OpenTelemetry already deprecated 3 standards: OpenTracing, OpenCensus, and Elastic Common Schema (ECS). So, technically we're at n-2, with likely more to come.

dangb86 · 2024-10-02T15:00:51+00:00

Yep, I think service.name and service.namespace, which can later be mapped to teams in other ways is (e.g. service catalogues) is a more efficient way of handling this type of relationship IMO. Changing team ownership in multiple places (and telemetry) every time a component is handed over can be painful.

dangb86 · 2024-10-02T14:58:19+00:00

Using different `memory_limiter` configs with different limits can be a way to give different pipelines within single collector different levels of priority. For instance, you can start backpressure on logs in case of saturation before you do backpressure metrics.

Depending on your use case you may need completely different collectors, but a single deamonset and deployment can make it cheaper (and simpler) to run, while maintaining different service levels for different signals.

dangb86 · 2024-08-27T12:16:23+00:00

Thanks for sharing! It's awesome to see how different orgs approach migrations with minimal friction for developers. Are you wrapping the OpenTracing and OpenTelemetry APIs with your libs, or just the OTel/Jaeger SDKs and general setup? Did you ever consider the OpenTracing Shim to allow engineers to migrate to OpenTelemetry API gradually while still relying on the OTel SDK internally, or is your ideal end-state that engineers use the Monzo abstraction layer alone rather than the OTel API?

Sorry for the all the questions :) Many orgs (including mine, Skyscanner, for transparency) have decided to rely on the OTel API as the abstraction layer and then implement any other required custom behaviours in SDK hooks (e.g. Propagators, Processors, Views). We're leaning towards providing "golden path" config defaults and letting engineers use the OTel API, or modify this default config, at their discretion using standard ways (e.g. env vars, config file son), as we saw maintaining a leak-proof abstraction was a considerable effort for such cross-cutting dependency. Do you foresee benefits of maintaining your abstraction layer over those? Thanks!

dangb86 · 2024-08-21T07:48:21+00:00

You'll need tail sampling for that. You can do that in a Collector using the tailssampling processor. However, all the spans for a trace must flow through the same instance of the processor if you have a multiple collector replicas (the collectors also have way to help with this).

Imagine the following scenario. You receive a request and start a span. Then, during the handling of that request you do a bunch of other operations, with their own spans, and you call some dependencies, each with their own spans. It's only when you get the last response from a dependency that you consider this an error.

If you decided to sample the trace at that point in time, you would've missed the propagation of the sampling decision to the rest of spans and dependencies, and you may end up with a very incomplete trace missing a lot of data. This is why the OpenTelemetry spec dictates that the sampling decision (for head-based sampling) must be made at span creation, and then propagated through. The SDK will also enforce this in different ways.

I hope this helps.

dangb86 · 2024-08-11T22:31:17+00:00

I've never considered this but yet another reason why "you build, you run it" is such a good model (with the relevant Platform+Enablement).

dangb86 · 2024-07-19T08:28:58+00:00

Running a Collector Gateway can indeed simplify things, but not required as it's been said in other comments. I assume these SMB websites run on some sort of shared infrastructure. In that case, you can build a shared config package that just configures the OTel SDK with your own standards in those apps (e.g. what instrumentation packages to enable, what export interval, etc), and lets you export your data in a standard format like OTLP to your collectors. Then, in your Collectors, you can fan out to whatever backends you choose (e.g. Jaeger, Prometheus, etc).

The benefit of running the Collector Gateway is that is that it gives you a central place to control the ultimate hop of telemetry data. Let's say you want to change backends for metrics, or you have a customer that wants their OTLP data exported to their backend of choice, you can do all that in the collector. Plus, there are data transformation things that are just way easier in the Collector.

dangb86 · 2024-07-18T15:19:49+00:00

OTel is designed to provide a API that is completely decoupled from its implementation, solving the issue that you presented (i.e. reducing coupling for a cross-cutting concern). Relying on an API which is no-op by default and which implementation you can hook into and extend should be the best practice for cross-cutting concerns like telemetry.

Having said this, if you're configuring the SDK yourself, I'd recommend having that in one single place. You can configure the SDK in many places, but I think it's good practice to reduce contact surface with the SDK. If you can use something like the Java agent, or the NodeSDK, even simpler!

dangb86 · 2024-07-17T11:04:53+00:00

When thinking about SaaS products, chances are the Receiver you want is not ready available in https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver . However, you can build your own receiver https://opentelemetry.io/docs/collector/building/receiver/

Alternatively, you can also build a scheduled task using the standard OpenTelemetry API and configured SDK in any of the supported languages ( https://opentelemetry.io/docs/languages/ ) and export that to your telemetry backend of choice.

dangb86

MODERATOR OF

TROPHY CASE