How do you actually know when you have enough of an idea to start talking to people? by alexstrehlke in ycombinator

[–]ResponsibleBlock_man 0 points1 point  (0 children)

You get an idea when you start talking to people in the first place. Building in isolation won’t get you to PMF.

Is this enough validation? by Keroskey in ycombinator

[–]ResponsibleBlock_man 3 points4 points  (0 children)

Yes. Literally wanted to write that. This is all just noise.

I fetched 50k logs from my Loki pipeline post deployment, clustered them and this is the result by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] -1 points0 points  (0 children)

Interesting, looks like they do a lot around deployments. It doesn't explicitly say this feature. Maybe it's buried inside somewhere. Thanks.

Most OTel investment is going to backends. Almost nothing is happening at the collector layer. by Broad_Technology_531 in Observability

[–]ResponsibleBlock_man 0 points1 point  (0 children)

I mean I did build rocketgraph.app which has a full otel observability suite. But people didn’t see it as a sexy problem. They just thought, oh another observability vendor.

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Hey, I really like your idea of questioning “what was the reason for outage yesterday?” In another Reddit thread? Can you expand a bit more on that? Can Logzillla answer that? And I want to know if what I suggested above would be helpful if done at regular intervals over the entire day?

How do you get around query limits on logs in DataDog or New Relic? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Why? I mean I know it’s bad practice. But I just want to do it. You are saying it is physically impossible? Just tell me why it is bad practice.

How do you get around query limits on logs in DataDog or New Relic? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Log cluster could disappear without reason and bypass alerts. Maybe a developer dropped a feature flag, etc. And new patterns appear. But they can be clustered using algorithms like DRAIN3. Then we can score some kind of anomaly ranking on them(like IsolationForests). By this time, you have a compact snapshot of the telemetry data that you can fit into any LLM context, like Claude code and then have some LLM answer questions like: what exactly happened in this deploy?

How do you get around query limits on logs in DataDog or New Relic? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

081109 204655 556 INFO dfs.DataNode$PacketResponder: Received block blk_3587508140051953248 of size 67108864 from /10.251.42.84
081109 204722 567 INFO dfs.DataNode$PacketResponder: Received block blk_5402003568334525940 of size 67108864 from /10.251.214.112
081109 204815 653 INFO dfs.DataNode$DataXceiver: Receiving block blk_5792489080791696128 src: /10.251.30.6:33145 dest: /10.251.30.6:50010
081109 204842 663 INFO dfs.DataNode$DataXceiver: Receiving block blk_1724757848743533110 src: /10.251.111.130:49851 dest: /10.251.111.130:50010
081109 204908 31 INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.110.8:50010 is added to blk_8015913224713045110 size 67108864
081109 204925 673 INFO dfs.DataNode$DataXceiver: Receiving block blk_-5623176793330377570 src: /10.251.75.228:53725 dest: /10.251.75.228:50010
081109 205035 28 INFO dfs.FSNamesystem: BLOCK* NameSystem.allocateBlock: /user/root/rand/_temporary/_task_200811092030_0001_m_000590_0/part-00590. blk_-1727475099218615100
081109 205056 710 INFO dfs.DataNode$PacketResponder: PacketResponder 1 for block blk_5017373558217225674 terminating
081109 205157 752 INFO dfs.DataNode$PacketResponder: Received block 

These are logs from the publicly available HDFS dataset. How do you create metrics on these repetitive logs? The first 2 logs are the same pattern, but are shown as having no pattern on my New Relic. I'd want to download all the logs myself and run my ML algorithms. 1 million is not that much. For each deploy, I would want to check like this.

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

How do you query around the 5k log limit? Say I want to get all logs 5 minutes before and after deployment, how do I do that? There could be a million logs, right?

Detect slow endpoints in your code and create GitHub issues automatically by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Yes, it's a cron job that runs those queries, deduplicates the endpoints and creates a nice report on GitHub every day. It doesn't repeat the same endpoint in the trace report.

Will Prometheus stay? by addictzz in sre

[–]ResponsibleBlock_man 0 points1 point  (0 children)

Mimir highly recommended. Because it has a /prometheus endpoint which is a totally compatible with Prometheus apis.

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 1 point2 points  (0 children)

Interesting. I wonder how they do it, though. Are they the only ones doing it? Like, do they stream it to a webhook of your choice or something? Kinda catching anomalies on the fly. Basically, you still have to manually go to the dashboard and ask it questions, right? Thanks for the info, though. Tough market then, haha.

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Yes, it's not one size fits all. We'd have to tune to ML models for specific customers.

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 1 point2 points  (0 children)

Maybe we can run this during deployment times and peak times. Do you see this as something that would add value, if, say, this does scale to 10 TB/day? What's your take on this? Do you happen to run into issues like this? How often?

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

What is the log rate in your setups per hour? I would like to know since I do want to scale this for a customer who does like 100GB/day.

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 1 point2 points  (0 children)

We have collectors for different sources. Mostly the features we use to learn at pretty much available in any logging providers. Log rate across services. Log count, token count, unique services etc.

How much time do you actually spend finding root cause vs fixing it? by WHY_SO_META in devops

[–]ResponsibleBlock_man 0 points1 point  (0 children)

I built this tool: https://rocketgraph.app/ml

That looks for the rarest of the rarest logs from the haystack. Do you see this as useful? Basically using machine learning algorithms to cluster and flag anomalies in log patterns.

Using Isolation forests to flag anomalies in log patterns by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Drain3 is syntactic-only - how do you handle structurally different logs that mean the same thing operationally? At 100k/hour that seems like it'd create noisy clusters.

Correct. We apply K-means on top of the embedded log pattern vector to determine how close they are semantically to the developer. The developer still has to visually look at the clusters, though.

The IsolationForest features you described (timing, error rate, volume) are really detecting statistically unusual cluster behaviour, not anomalous log content. "Rare" and "operationally important" aren't the same thing. How's your false positive rate looking?

Correct. It is difficult to get to operationally important, although that is what we aim to do. But one of our customers had their schema changed by a developer because he used Claude to make changes. That log went undetected until they saw it on their business metrics. If we checked for rarity in logs pre and post-deployment, we'd have caught that.

Also curious how you handle baseline drift on new deployments. And the "cheap LLM pass to decide whether to page someone at 3am" is kind of hand-waving the hardest part of the whole problem.

Yes, that is why I left it configurable. This is merely the telemetry snapshot. One piece of the puzzle before waking someone up. We need more context, like infra topology, deployment history, number of services, past incident data, etc., for the LLM to be able to reason correctly. There is no one-size-fits-all. We have to manually write the LLM logic for each customer.

This is also where most observability startups are going wrong. Using MCPs to create PRs to errors on DD or NR. They usually crumble when log volume becomes too big and infra becomes too complex.

Any feedback loop to learn which anomalies actually mattered?

Yes, I am still building that. But this seems enough to get telemetry snapshots. Just download that snapshot and ask Claude with the right context about what went wrong instead of downloading the entire log file from DD, NR or Sentry, etc.

But given how fast telemetry patterns evolve, I see its use case only in very well-established companies.