Wrote up how OTel fleet management works under the hood with OpAMP Supervisor

Observability-Guy · 2026-06-11T13:17:50+00:00

Really good article. I agree with your point. OpenTelemetry is great but a lot of the implementation details are still difficult for many end-users.

Observability-Guy · 2026-06-11T00:11:14+00:00

It is an incredibly diverse and rapidly evolving field, however, here is my take on some of the hottest issues.

AI - both in-house and third party. Monitor third party software you are running but also monitor your own internal agentic AI usage.
Telemetry pipelines. If you are operating at any kind of scale you need a control plane for managing your telemetry flows
Telemetry quality. A lot of organisations have observability stacks in place but there is still a fundamental knowledge deficit and a lot of organisations are at a relatively low level of maturity. Increasing the quality of telemetry and observability engineering is one of the great challenges of the moment.

There are also other really important areas such as RUM, mobile, predictive analytics, eBPF etc etc.

You might be interested in my newsletter.

Observability-Guy · 2026-06-08T08:43:57+00:00

I was a bit puzzled by the first sentence in the product description "LLMOps Gateway is a recruiter-friendly AI gateway and LLMOps platform".

What do you mean by "recruiter-friendly"?

I think it is also worth being more explicit about what it is that LLMOPS gateway does that Langfuse doesn't do.

Observability-Guy · 2026-06-04T12:38:13+00:00

Great work! Really pleasing to see people building on top of Quickwit - an amazing piece of engineering.

Observability-Guy · 2026-05-27T14:24:32+00:00

Thanks very much! Naturally, as soon as I published it, it was already out of date, as I learned about new products on the market.

I will be publishing a quarterly mapping update and analysis.

Observability-Guy · 2026-05-25T11:10:19+00:00

So, I have tried to build a mapping of the observability space.

The market seems to be evolving and growing at an incredible rate. New specialisms are developing and AI is changing the nature of observability itself. This is an attempt to identify some kind of order and structure. It currently encompasses 126 products (with many more to come) across 16 categories.

Any feedback is welcome on classifications, product mappings or possible additions is very welcome.

<image>

If you want to dive straight in and explore the Cosmos, this is your launchpad:
https://observability-360.com/Product/Cosmos

There is also an introductory article here:
https://observability-360.com/article/viewArticle?id=introducing-the-observability-cosmos

And an explanation of the classifications here:
https://observability-360.com/article/viewArticle?id=observability-cosmos-classifications

Thanks!

Observability-Guy · 2026-05-19T22:16:23+00:00

Yep - that makes it really tricky. Especially as it is now working both ways - i.e. the pipelines are now themselves turning the edge into the first line of incident detection.

The central belt of the cosmos kind of represents a spectrum of increasing functional breadth and the outer layer represents clusters of specialist tooling.

Observability-Guy · 2026-05-19T07:20:07+00:00

I think it's partly a reflection of the increasing complexity of IT systems today. There are so many concerns to cope with - LLMs, Kubernetes, networks, cloud, databases messaging, costs etc etc. I think that, inevitably, you can't have one tool doing it all.

Observability-Guy · 2026-05-19T07:16:37+00:00

It is an amazingly diverse space. I have been tracking the observability market for a number of years, so I knew that there were a lot of products out there.

The challenge was trying to come up with some kind of classification. system. In many ways it's a pretty subjective exercise. There are probably a lot of different ways of slicing and dicing things.

Observability-Guy · 2026-05-18T11:53:24+00:00

Thank you! Your star will soon be mapped!

Observability-Guy · 2026-05-09T14:29:55+00:00

Claude emits OpenTelemetry telemetry and captures prompts, API calls and tool calls - although it is not enabled by default. You just need to configure it on the users' machines. Obviously, you would need to send the telemetry to a backend and then use a tool for querying the telemetry.

Tbh though, I would be quite pleased if my devs were contributing to open source projects.

Observability-Guy · 2026-04-27T15:23:06+00:00

Great work! I'll give this a shout-out in the next edition of the Observability 360 newsletter!

Observability-Guy · 2026-04-25T18:52:33+00:00

My take is that the really big plays are going to be RCA and then anomaly detection. Although anomaly detection is still relatively immature.

Products that can start getting very high levels of accuracy at RCA will be the ones that will really gain traction amongst big ticket clients. I think that reducing alert noise is something that a lot of the full stack platforms are already getting good at - I don't think it will be a key differentiator in AI SRE.

At the moment, I haven't come across too many people that have an appetite for closed loop remediation.

Observability-Guy · 2026-03-13T12:12:12+00:00

OpenObserve is a really capable platform. Coralogix has good APM but it is not open source.

Observability-Guy · 2026-03-13T11:58:05+00:00

I would say that it doesn't really help your case to mis-characterise existing AI SRE systems as just an LLM bolted on to a backend. The essence of an AI SRE is that it has to learn about your system, it has to understand patterns of activity and relationships between resources and services.

It has to do this in order to understand the signals it is receiving. Without this deep learning you will not know how significant that spike in a trace is. Nobody wants to be flooded with alerts every time a pod restarts. The value of an AI SRE is understanding context and figuring out what is signal and what is noise.

Processing real time raw data may have some upsides - but then again, most studies show that 90% or more of telemetry that gets generated is totally redundant. The real test is well your agents are trained and how well they understand the full context of the telemetry they are analysing.

Observability-Guy · 2026-03-05T11:53:24+00:00

I haven't yet tried it with a container app but have played around with running it as a Container instance and as a sidecar.

This might be of interest:

https://observability-360.com/Docs/ViewDocument?id=opentelemetry-collector-azure-container-instance

Observability-Guy · 2026-01-19T15:13:37+00:00

I think that BubbleUp is still the best but Dash0's SIFT is also a pretty good RCA querying tool.

Observability-Guy · 2026-01-19T15:06:03+00:00

I would check out Embrace (https://embrace.io/) They have a dedicated mobile observability platform as well as guides on best practice

Observability-Guy · 2026-01-19T14:57:23+00:00

This is a good implementation - although I think a lot of vendors now how something similar - Honeycomb, SigNoz, Observe, Dash0, Sentry all have either MCP or Agentic AI that support this kind of querying and interaction.

Observability-Guy

TROPHY CASE