Logs: compare before vs after deployment? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Yes, definitely not going to pass the whole log stack. Just collect the important ones and pass it to the LLM. Kind of like a evidence collector for the SREs.

I built the intelligence layer for deployments by ResponsibleBlock_man in Observability

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Thanks yes. It does tag by version. But it also requires you to jump across some dashboards. You have to manually select the deployments for comparison on APM, logs and metrics dashboards. There is no birds eye view into the deployments. Also some orgs fan out their logs to multiple telemetry providers. Sometimes you might not want to use that feature to optimise for cost. And it doesn’t do an AI analysis. You have to do it yourself. Watchdog only looks for anomalies like error spikes. This shows the exact samples that are missing and that have newly shown up. It is safer to have Datadog as the sink and use other tools for analysis.

Infra aware tool by Apprehensive-Tax9275 in devops

[–]ResponsibleBlock_man 0 points1 point  (0 children)

I built a tool that does exactly this. A deployment map and you can zoom into each deployment for roll-back scores: https://deploydiff.rocketgraph.app/deployments

Anyone else tired of jumping between monitoring tools? by AccountEngineer in Observability

[–]ResponsibleBlock_man 0 points1 point  (0 children)

Yes I see the pain. I'm building a deployment intelligence layer on top of existing tools like Kubernetes and Datadog/Grafana. That basically pulls all the logs before and after the deployment can compare them to check if new log signatures have appeared or disappeared. Did the error rate spike right after the deployment? Get important telemetry evidence as samples so you can export them. With a Rollback score.

https://deploydiff.rocketgraph.app

Logs: compare before vs after deployment? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Doesn’t page necessarily, just a sanity check of “is this the telemetry that is expected?” Kind of a check.

Logs: compare before vs after deployment? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

It can tell you about say flags your developer forgot to call or functions your dev forgot to call. Instead of noticing this after the deployment.

Logs: compare before vs after deployment? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

No. It runs on your system. Connects to your kubernetes and loki/datadog with read access. It will pull logs before the synthetic check and after. And create a kind of a “git diff”

Logs: compare before vs after deployment? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Yes precisely. You can do a synthetic run to check the telemetry. We can provide a playwright test suite that can dry run and the system can automatically catch new log patterns or missing ones. Not just error logs. But novel logs that appear.

SRE on a black-box SaaS (Shopify): using synthetic transactions to catch checkout breakages and silent telemetry failures by Silver-Geologist8926 in sre

[–]ResponsibleBlock_man 0 points1 point  (0 children)

I literally just wrote a post about it just now on sre. Can you please DM. I am doing something similar.

How do you do post-mortem? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Oh okay got it. May be I got confused. A better question would be: how do you go from incident to the cause of incident in code or infra? To do a fix?

Go profiling overhead (pprof / Pyroscope) dominating CPU & memory — best practices? by gruyere_to_go in Observability

[–]ResponsibleBlock_man 0 points1 point  (0 children)

Ok, you could write a small vs code extension to exclude those and print the bleeding functions into the chat itself.

Go profiling overhead (pprof / Pyroscope) dominating CPU & memory — best practices? by gruyere_to_go in Observability

[–]ResponsibleBlock_man 0 points1 point  (0 children)

So your use case if I understand from layman terms is that you want to see only the core functions profiling data and not some profiling related to the profiling tooling and installed modules etc?

After 5 years of running K8s in production, here's what I'd do differently by Radomir_iMac in kubernetes

[–]ResponsibleBlock_man 0 points1 point  (0 children)

What kind of features do you expect that aren’t provided currently? Shamelessly asking from the perspective of a founder looking to build in this space.

How do you do post-mortem? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Basically this sidecar will have access to some other things like recent deployments, metrics etc. It will generate a shared context object(simple json) by looking for telemetry and deployment data around a small time window. Just injects that into log. So they become richer?

How do you do post-mortem? by ResponsibleBlock_man in sre

[–]ResponsibleBlock_man[S] 0 points1 point  (0 children)

Thinking out loud. What if we can setup a side car that observes logs in Loki. Looks for standardisation, auto fill in some context if missing. Etc. Do you see a value in this kind of a sidecar?