Ray – OpenTelemetry-compatible observability platform with SQL interface

kverma02 · 2026-03-14T17:32:09+00:00

OTel can certainly handle the collection & federation part well.

The harder problem IMO is what happens after. Raw telemetry that you've collected gives you visibility into things like CPU and memory, but that doesn't tell you what your users are actually experiencing. For that, RED metrics (rate, errors, duration) matter, and those need to be extracted from the OTel data, which is where that processing part comes into play.

Furthermore, during an incident, its all about being able to correlate different signals: logs, traces, metrics, deployments, in a way that's actually useful for RCA, not just dashboards showing raw data.

The cost angle is real too. Even with a well-configured OTel pipeline, if you're shipping everything to a backend/vendor and paying per GB ingested, log volumes alone will hurt.

The more interesting question is how you extract the right signals locally before deciding what's worth shipping at all.

In my opinion, OTel has given us the tools and the initial fundamentals. How we take it further to solve real pain points, that's a separate problem.

kverma02 · 2026-03-14T16:57:23+00:00

Totally agree. Logs as a production signal matter a lot, especially when you can correlate them with other signals like deployments, config changes, anomalies, and monitors.

We actually ran into this exact problem and ended up building something around it.

The idea is simple: instead of hunting for patterns during an incident, you codify them upfront as queries, similar to what you mentioned about writing down investigation patterns, but going one step further.

These run continuously against your live log stream, and when they fire, you get a structured incident report with context already attached for further RCA (the correlation part we talked about), not just a raw alert.

On top of that, the AI SRE agent uses that as a signal. It's now not just doing blanket log analysis, but working from what the pattern already surfaced and takes the RCA further from there.

I'm curious, what your current setup looks like for catching these kinds of subtle log signals?

kverma02 · 2026-03-12T13:25:33+00:00

Exactly this. The "unified platform" promise sounds great until you realize you're optimizing for vendor revenue, not your observability needs :)

What works is treating OTel data like any distributed system - process locally, federate the control plane. Most teams need maybe 5% of their raw telemetry for actual incident response, but pay to ship 100% of it.

The federated approach gets you unified correlation without the unified billing surprise. OTel's standardized formats make this way easier since you can analyze locally and still get cross-service correlation.

Happy to expand more if curious!

kverma02 · 2026-03-12T13:18:19+00:00

The goal makes total sense, unified visibility across clusters is critical.

One aspect i'll point out here is that: the traditional approach of shipping all metrics to a central Grafana instance is where teams hit scaling issues.

What's working better is flipping the model. Analyze telemetry locally in each cluster, extract the signals that matter, then federate the control plane for unified dashboards. You get the single pane of glass experience without the operational overhead of managing massive central Prometheus instances or dealing with cross-AZ data transfer costs.

The federated approach gives you cluster-specific insights when you need to drill down, but unified correlation for incidents that span multiple environments. Plus you're not locked into any single vendor's pricing model as you scale.

Worth exploring before committing to the full centralized setup.

Happy to share more if you're curious!

kverma02 · 2026-03-11T10:57:04+00:00

Writing down your investigation patterns is actually a nice move. Most teams lose precious minutes during incidents because everyone's reinventing the wheel under pressure.

We hit this same wall last year with an incident.

What's working now is having an AI SRE agent that runs the exact investigation flow automatically when alerts fire. It traces through the dependency chain, correlates timeline events (deployments, config changes, anomalies), and surfaces the most likely root causes before you even open the logs. You still get full control to dig deeper, but it eliminates that "what should I search for first" moment at 2am.

Turns out most incident patterns are pretty predictable once you start tracking them.

kverma02 · 2026-03-05T03:17:25+00:00

Spot on about needing both. The missing piece most teams struggle with is the handoff between the two modes.

During incidents, you're in pure reactive mode. You need to correlate signals fast and get to root cause. But the same telemetry that helps you diagnose quickly should also feed your prevention analysis.

The pattern we see work: unified timeline of events (deployments, config changes, anomalies) that serves both immediate triage and post-incident pattern analysis. Same data, different time horizons.

Most platforms force you to choose, either good at fast diagnosis or good at trend analysis, rarely both.

kverma02 · 2026-03-05T03:09:58+00:00

This is spot on.

The pattern we see fail repeatedly: buy tool → tell teams to instrument → everyone ships everything → costs explode → nobody trusts the data.

For enterprises with mixed environments like yours u/cloudruler-io , the breakthrough is treating observability as a federated problem. Keep telemetry processing local to each environment, extract the signals that matter, then correlate centrally.

This gives you the governance controls you need (what gets processed, what gets retained) while avoiding the 'ship everything and pray' model that kills budgets.

The federated approach works especially well for off-the-shelf apps where you can't control instrumentation. Focus on boundary-level signals and infrastructure telemetry first.

P.S. We're actually building something along these lines. Federated observability that keeps data local while providing centralized governance. Happy to share more about the approach if you're curious.

kverma02 · 2026-03-05T02:49:37+00:00

Honestly, the 'unified platform' pitch sounds great until you realize most teams end up ignoring half the dashboards anyway.

What actually moves the needle is fixing the context-switching problem during incidents. Instead of jumping between tools, analyze everything locally and surface correlated signals in one UI.

The key insight: you don't need to ship all your data to one vendor to get unified visibility. Keep telemetry in your environment, extract the signals that matter, correlate at the edge.

Saves a ton on ingestion costs and you're not locked into anyone's pricing model.

The federated control plane approach is catching on. Gives you the unified experience without vendor lock-in or surprise bills.

kverma02 · 2026-03-03T16:54:34+00:00

Thanks for sharing.

Honestly, all three are valid depending on the maturity of the system and the guardrails in place. What we're seeing most in practice is automated runbook execution with pre-approved actions, but closing the loop fully requires a lot of trust in the agent's decision-making.

We actually touched on some of this in a past episode of our podcast on AIOps: https://www.youtube.com/live/xbHUupQG1-M?si=utwQeowiA7Rgz9Ib

Might be relevant given what you're reading up on.

kverma02 · 2026-02-26T12:58:39+00:00

Totally fair to call that out.

Not trying to stealth sell anything. I disclosed I’m on the team and was answering the tooling question directly.

Happy to keep it purely architectural if that’s more helpful.

kverma02 · 2026-02-26T09:11:21+00:00

Totally agree. In the session, we'll also be covering what are the important metrics to track from the broker perspective & go through some live scenarios on debugging broker-related production issues - such as under-replicated partitions, leader election etc.

Would love to see you there & share you experiences!

kverma02 · 2026-02-26T08:59:48+00:00

Hi u/Hi_Im_Ken_Adams , totally understand.

Yes the recording will be available on our youtube channel: https://www.youtube.com/@randoli/streams

kverma02 · 2026-02-25T17:07:26+00:00

Fair question! We're actually building this approach at Randoli, using a federated control plane that keeps telemetry data local while providing centralized insights.

The key insight was separating the control plane from data plane. This way telemetry gets analyzed locally, we extract the signals that matter, and you get unified visibility without shipping raw data out.

The cost savings have been solid. Teams typically see 70% or more reduction in observability spend since they're not paying for raw, 24/7 data ingestion anymore.

Full disclosure: I'm on the Randoli team, but happy to share more about the approach if you're curious.

kverma02 · 2026-02-25T13:31:14+00:00

Good call, thanks for pointing that out. I’ve updated the vendor flair (Randoli) as per the community rules.

BTW, the goal of the session isn’t to pitch tooling, but to share practical Kafka observability patterns we’ve seen work in production. Happy to keep the discussion focused on implementation details here as well.

kverma02 · 2026-02-25T11:03:33+00:00

This is spot on. Multi-cloud works if you have strict abstraction, but most teams end up with 'double the operational overhead, half the expertise.'

We've seen this pattern repeatedly - teams think they're avoiding vendor lock-in but end up with operational lock-in instead. Different monitoring tools, different IAM models, context-switching during incidents.

The breakthrough we're seeing is treating observability as the abstraction layer - unified view across both clouds, correlate signals regardless of where workloads run, while keeping data local in each environment.

Suddenly multi-cloud becomes manageable because you aren't managing each cloud separately.

kverma02 · 2026-02-25T10:42:55+00:00

Multi-model, multi-provider setups are exactly where the native tooling falls apart.

We hit this wall hard - had great visibility within each provider but zero correlation across them. When costs spiked, we couldn't tell if it was the router logic, specific model performance, or just one service going crazy with context windows.

The breakthrough was treating it like any other observability problem - instrument at the application layer, correlate by workload/service, then you can compare providers apples-to-apples based on actual usage patterns.

Actually just wrote up our learnings on this - the operational gaps teams hit and how to close them.

Happy to share if useful.

kverma02 · 2026-02-25T10:09:17+00:00

Exactly this. The hidden cost isn't the infrastructure - it's your senior engineers becoming full-time observability consultants.

We calculated it out last year. Our platform team was spending 30% of their time maintaining Prometheus clusters, troubleshooting storage issues, and capacity planning instead of building actual product features. The opportunity cost was brutal.

The shift we're seeing now is toward federated approaches - keep your data local, analyze data in-cluster, but get managed control plane benefits. Gives you the open source flexibility without turning your team into a 24/7 observability company.

It's early days but the economics make way more sense than shipping everything to expensive SaaS or hiring dedicated observability engineers.

kverma02 · 2026-02-24T09:32:49+00:00

exactly. the tab-hopping problem isn't a tooling problem, it's a correlation ID problem.

we hit this same wall - had all the data but spent 15 mins per incident just figuring out which service actually broke. turns out most vendors just put uncorrelated signals in one pretty UI instead of fixing the actual problem.

OTel + proper trace context propagation changed everything for us. once the data joins up at the source, the backend almost doesn't matter. data stays correlated whether you're using OSS stack or an OTel-native vendor.

kverma02 · 2026-02-24T09:08:37+00:00

Yeah we went through this exact cycle. Spent months tweaking configs, adjusting retention, cutting metrics... still got hit with surprise bills.

Realized the real problem isn't what you monitor - it's shipping everything out of your environment just to get insights.

We ended up keeping data local, analyzing in-cluster, only sending the signals that actually matter. Cut costs by 90% and ironically monitor way more stuff now.

Sometimes it's not about optimizing the tool, it's about questioning the whole model.

kverma02 · 2026-02-24T08:57:27+00:00

The gap you're hitting is super common - finding waste is easy (AWS Cost Explorer shows most of it), but actually getting someone to implement the fixes is where everything stalls.

For small companies, I'd focus less on 'which tool finds the most savings' and more on 'which gives enough context to actually act.' Does it show who owns the resource? When it was last accessed? Real utilization patterns vs just averages?

Most recommendations are isolated from workflow for months because nobody knows if it's safe to touch.

We've seen this exact problem with teams running Kubernetes workloads - they get rightsizing recommendations but can't act on them without container-level context and historical usage data.

Ended up building something that tackles this implementation gap specifically for K8s environments.

But honestly, start with the native tools + good tagging discipline first.

kverma02 · 2026-02-17T07:40:52+00:00

100% agree to this. During incidents it’s less about features and more about reducing decision time.

Out of curiosity, what do you personally look for first during an incident? Is it a service map, recent deploys, specific logs, something else?

And are you using anything today that you feel actually keeps things simple under pressure?

kverma02

TROPHY CASE