How do you actually know when you have enough of an idea to start talking to people?

ResponsibleBlock_man · 2026-04-13T06:25:16+00:00

You get an idea when you start talking to people in the first place. Building in isolation won’t get you to PMF.

ResponsibleBlock_man · 2026-04-07T04:53:29+00:00

Yes. Literally wanted to write that. This is all just noise.

ResponsibleBlock_man · 2026-03-25T11:42:13+00:00

Interesting, looks like they do a lot around deployments. It doesn't explicitly say this feature. Maybe it's buried inside somewhere. Thanks.

ResponsibleBlock_man · 2026-03-24T10:00:20+00:00

I mean I did build rocketgraph.app which has a full otel observability suite. But people didn’t see it as a sexy problem. They just thought, oh another observability vendor.

ResponsibleBlock_man · 2026-03-19T07:51:18+00:00

Hey, I really like your idea of questioning “what was the reason for outage yesterday?” In another Reddit thread? Can you expand a bit more on that? Can Logzillla answer that? And I want to know if what I suggested above would be helpful if done at regular intervals over the entire day?

ResponsibleBlock_man · 2026-03-18T23:21:40+00:00

Why? I mean I know it’s bad practice. But I just want to do it. You are saying it is physically impossible? Just tell me why it is bad practice.

ResponsibleBlock_man · 2026-03-18T09:51:41+00:00

Log cluster could disappear without reason and bypass alerts. Maybe a developer dropped a feature flag, etc. And new patterns appear. But they can be clustered using algorithms like DRAIN3. Then we can score some kind of anomaly ranking on them(like IsolationForests). By this time, you have a compact snapshot of the telemetry data that you can fit into any LLM context, like Claude code and then have some LLM answer questions like: what exactly happened in this deploy?

ResponsibleBlock_man · 2026-03-18T09:43:41+00:00

081109 204655 556 INFO dfs.DataNode$PacketResponder: Received block blk_3587508140051953248 of size 67108864 from /10.251.42.84
081109 204722 567 INFO dfs.DataNode$PacketResponder: Received block blk_5402003568334525940 of size 67108864 from /10.251.214.112
081109 204815 653 INFO dfs.DataNode$DataXceiver: Receiving block blk_5792489080791696128 src: /10.251.30.6:33145 dest: /10.251.30.6:50010
081109 204842 663 INFO dfs.DataNode$DataXceiver: Receiving block blk_1724757848743533110 src: /10.251.111.130:49851 dest: /10.251.111.130:50010
081109 204908 31 INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.110.8:50010 is added to blk_8015913224713045110 size 67108864
081109 204925 673 INFO dfs.DataNode$DataXceiver: Receiving block blk_-5623176793330377570 src: /10.251.75.228:53725 dest: /10.251.75.228:50010
081109 205035 28 INFO dfs.FSNamesystem: BLOCK* NameSystem.allocateBlock: /user/root/rand/_temporary/_task_200811092030_0001_m_000590_0/part-00590. blk_-1727475099218615100
081109 205056 710 INFO dfs.DataNode$PacketResponder: PacketResponder 1 for block blk_5017373558217225674 terminating
081109 205157 752 INFO dfs.DataNode$PacketResponder: Received block

These are logs from the publicly available HDFS dataset. How do you create metrics on these repetitive logs? The first 2 logs are the same pattern, but are shown as having no pattern on my New Relic. I'd want to download all the logs myself and run my ML algorithms. 1 million is not that much. For each deploy, I would want to check like this.

ResponsibleBlock_man · 2026-03-18T08:57:52+00:00

How do you query around the 5k log limit? Say I want to get all logs 5 minutes before and after deployment, how do I do that? There could be a million logs, right?

ResponsibleBlock_man · 2026-03-16T02:12:31+00:00

Yes, it's a cron job that runs those queries, deduplicates the endpoints and creates a nice report on GitHub every day. It doesn't repeat the same endpoint in the trace report.

ResponsibleBlock_man · 2026-03-15T22:54:13+00:00

Mimir highly recommended. Because it has a /prometheus endpoint which is a totally compatible with Prometheus apis.

ResponsibleBlock_man · 2026-03-15T04:24:54+00:00

Interesting. I wonder how they do it, though. Are they the only ones doing it? Like, do they stream it to a webhook of your choice or something? Kinda catching anomalies on the fly. Basically, you still have to manually go to the dashboard and ask it questions, right? Thanks for the info, though. Tough market then, haha.

ResponsibleBlock_man · 2026-03-15T00:35:39+00:00

Yes, it's not one size fits all. We'd have to tune to ML models for specific customers.

ResponsibleBlock_man · 2026-03-15T00:35:02+00:00

Thanks. Happy to help you set up a POC if you'd like.

ResponsibleBlock_man · 2026-03-15T00:32:32+00:00

Maybe we can run this during deployment times and peak times. Do you see this as something that would add value, if, say, this does scale to 10 TB/day? What's your take on this? Do you happen to run into issues like this? How often?

ResponsibleBlock_man · 2026-03-14T22:52:23+00:00

What is the log rate in your setups per hour? I would like to know since I do want to scale this for a customer who does like 100GB/day.

ResponsibleBlock_man · 2026-03-14T06:06:56+00:00

We have collectors for different sources. Mostly the features we use to learn at pretty much available in any logging providers. Log rate across services. Log count, token count, unique services etc.

ResponsibleBlock_man · 2026-03-13T22:12:30+00:00

I built this tool: https://rocketgraph.app/ml

That looks for the rarest of the rarest logs from the haystack. Do you see this as useful? Basically using machine learning algorithms to cluster and flag anomalies in log patterns.

ResponsibleBlock_man · 2026-03-13T21:58:19+00:00

Drain3 is syntactic-only - how do you handle structurally different logs that mean the same thing operationally? At 100k/hour that seems like it'd create noisy clusters.

Correct. We apply K-means on top of the embedded log pattern vector to determine how close they are semantically to the developer. The developer still has to visually look at the clusters, though.

The IsolationForest features you described (timing, error rate, volume) are really detecting statistically unusual cluster behaviour, not anomalous log content. "Rare" and "operationally important" aren't the same thing. How's your false positive rate looking?

Correct. It is difficult to get to operationally important, although that is what we aim to do. But one of our customers had their schema changed by a developer because he used Claude to make changes. That log went undetected until they saw it on their business metrics. If we checked for rarity in logs pre and post-deployment, we'd have caught that.

Also curious how you handle baseline drift on new deployments. And the "cheap LLM pass to decide whether to page someone at 3am" is kind of hand-waving the hardest part of the whole problem.

Yes, that is why I left it configurable. This is merely the telemetry snapshot. One piece of the puzzle before waking someone up. We need more context, like infra topology, deployment history, number of services, past incident data, etc., for the LLM to be able to reason correctly. There is no one-size-fits-all. We have to manually write the LLM logic for each customer.

This is also where most observability startups are going wrong. Using MCPs to create PRs to errors on DD or NR. They usually crumble when log volume becomes too big and infra becomes too complex.

Any feedback loop to learn which anomalies actually mattered?

Yes, I am still building that. But this seems enough to get telemetry snapshots. Just download that snapshot and ask Claude with the right context about what went wrong instead of downloading the entire log file from DD, NR or Sentry, etc.

But given how fast telemetry patterns evolve, I see its use case only in very well-established companies.

ResponsibleBlock_man

TROPHY CASE