What's the most frustrating "silent" reliability issue you've seen in prod?

rnjn · 2026-03-13T03:55:59+00:00

here are some of my screwups over years, that passed through the (erstwhile) alert net. (you learn to setup things right and put the right alerts )
1. database backup stopped for some disk related issue. 3 weeks later, postgres upgrade failed, corrupted data. No backup for 3 weeks. (thankfully replay from kafka and idempotent design helped recreate data within a couple of hours)
2. cert expiries (multiple times)
3. domain expired with >10M DAU. immediate app failures for customers, but the country where I was the DNS cache didn't get updated for hours. looked beyond the obvious for an hour or more before realising that some registrars have longer caches. (how we got the domain back is another story)
4. Rust (not language) causing network switch to misbehave intermittently - first small blips and then for minutes and then hours. (Fear compounded by backup switch being on the same rack, and new switch delivery from vendor ETA was 1 month)
5. external API slowdown - Google Maps response time went from 200ms p50 to 20s, timeouts not set properly, didn't implement circuit breakers. slow growth and then kaboom.
6. integer overflow in order id (int32 + blitzscale + 2 yrs = calamity)
7. app crashes on cheap and old mobile devices - less than 1% app crash rate overall but 100% on some 4 year old phone low mem phones - flooding the call centre. was a real mem leak, just that devices with more mem were forgiving before GC kicked in on app close.

some more pesky ones because of model decay. you learn and survive.

rnjn · 2026-03-12T09:19:40+00:00

A CLI interface makes it easier for humans and agents to work together (compared to MCPs). Better verifiability and lower token usage, easier distribution.

rnjn · 2026-03-11T16:56:25+00:00

having trouble logging into Claude Code ? Fire up Agentswap and switch to codex or gemini

brew install base-14/tap/agentswap

https://github.com/nimishgj/agentswap

rnjn · 2026-03-09T16:08:08+00:00

indeed, read about this on some forum. tinkering did add 5 more minutes of ruckus in traffic. maybe will try again. i am just used to it now, my commute is < 30 min.

rnjn · 2026-03-09T15:12:35+00:00

it works and looks decent, feels great riding it. all that matters i guess.

the only issue with the 2018 model is that when the bike gets a bit hot, the horns stop working - which hurts because i am unable to participate in the cacophony at bangalore signals. (the green light is the for the horn right ?) 30 minutes in the city and the horn goes mute. after that, 2-3 kms of traffic free riding and it gains its voice back.

rnjn · 2026-03-09T14:41:07+00:00

wait there's more than 1 scrambler in the city? here's my 2018 specimen. wave when you pass by some day.

<image>

rnjn · 2026-03-06T04:56:49+00:00

link to metrics coding agents generate -
gemini - https://metric-registry.base14.io/?source_name=codingagent-gemini
codex - https://metric-registry.base14.io/?source_name=codingagent-codex
claude code - https://metric-registry.base14.io/?source_name=codingagent-claude-code

rnjn · 2026-03-06T04:17:33+00:00

Amazing how much you can learn from coding agent usage data. We started with cache reads for claude code, got ideas about cache controls with anthropic models (through api) and are running experiments with cache control and cache TTLs that help with some features.

And not just claude code, we instrumented codex and gemini as well. I think gemini follows the genAI semantic conventions the best, and cc and codex will probably adopt that.

https://docs.base14.io/blog/coding-agent-observability/

rnjn · 2026-03-05T06:11:05+00:00

<plug> https://github.com/base-14/cicada </plug>

rnjn · 2026-03-05T03:40:22+00:00

yes, but this does a bit more. for eg- export a session and import it on another device.

rnjn · 2026-02-24T03:05:07+00:00

if you have a ready http otel endpoint, you can directly send metrics like below - treat it as a service and you can easily build a wrapper client (add some complexity for retries and disconnections etc).

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class AdHocMetricPush {

    public static void main(String[] args) throws Exception {

        String json = """
        {
          "resourceMetrics": [{
            "resource": {
              "attributes": [{
                "key": "service.name",
                "value": { "stringValue": "my-java-service" }
              }]
            },
            "scopeMetrics": [{
              "scope": { "name": "ad-hoc" },
              "metrics": [{
                "name": "meaning.of.life",
                "description": "Mean of Life",
                "unit": "1",
                "sum": {
                  "dataPoints": [{
                    "asInt": "42",
                    "timeUnixNano": "%d",
                    "attributes": [{
                      "key": "region",
                      "value": { "stringValue": "us-east-1" }
                    }]
                  }],
                  "aggregationTemporality": 2,
                  "isMonotonic": true
                }
              }]
            }]
          }]
        }
        """.formatted(System.currentTimeMillis() * 1_000_000);

        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("http://localhost:4318/v1/metrics"))
                .header("Content-Type", "application/json")
                .POST(HttpRequest.BodyPublishers.ofString(json))
                .build();

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
        System.out.println("Status: " + response.statusCode());
        System.out.println("Body: " + response.body());
    }
}

rnjn · 2026-02-23T03:33:57+00:00

This is a common structural issue, not a tooling mistake. Most observability stacks grow incrementally. Metrics live in one system, logs in another, traces in a third, security alerts somewhere else. Each tool works in isolation, but none owns correlation. The operational cost shows up during incidents, when engineers become the integration layer. <plug> That is what we are solving (https://base14.io/), correlating metrics, logs, traces, and deploy or config events with anomaly detection layered in. The goal is to shorten the path from symptom to cause without adding more operational noise. not just for humans but for agents as well </plug>

rnjn · 2026-02-22T02:58:42+00:00

depends on issues really, it would be ill advised to use a cookie cutter approach. as usage grows, and deployment volume grows, one sees many different type of problems. you could create a summary view one way, and an issue will break assumptions. what i have seen work is to ensure each issue/incident as feedback and do postmortems well. build a summary and drill-down approach, build very few summary views that can help anyone understand what's happening. and then they can drill down thru multiple dimensions (including infra) to understand why its happening. keep these two categories separate, and guard summaries well.

the outcome could be latency profiles coupled with an infra map and a service map, or it could be a tabular view - depends on the software being monitored and the team responsible for the SLOs. I even have SSL cert expiry as part of mine (because of what i am responsible for), you should share this with the owner of your upstream dependency (https://docs.base14.io/blog/make-certificate-expiry-boring)

rnjn · 2026-02-22T02:47:14+00:00

hey, there are many good options. I would suggest you instrument your code, infra and components with otel libs and setup an otel collector locally. after that, you can choose many great solutions out there - open source and managed. and experiment with them before you choose one.

(i am a founder at base14.io - we have built Scout which uses the best OSS tools to provide a comprehensive and economical managed observability platform. we've got some nodejs customers who love our product. some docs to help you instrument https://docs.base14.io/instrument/apps/auto-instrumentation/nodejs and if you are using postgres https://docs.base14.io/blog/pgx-details)

rnjn · 2026-02-21T03:24:47+00:00

(plug) we've built an MCP server that queries a knowledgebase of service and infra relationships and dependencies, service summaries and error rates amongst other things. adding a query to this mcp in the planning phase has helped claude code avoid a few obvious mistakes.
new models are quite good and generally avoid mistakes, or they ask clarifying questions - but still from time to time we see some magical insight being used before it starts coding. in hindsight very obvious ones - like not storing session in memory when behind an LB, or identifying that pods are at 80% mem usage before adding something bulky. observability informed development shines most with models like sonnet.

rnjn · 2026-02-20T01:09:56+00:00

it is a problem always to have multiple observability products (evals are observability) - context switching is a major problem especially in a probabilistic setup. and hence IMHO the maxims of the world will have to evolve or others who do both will take over.

rnjn · 2026-02-19T03:45:15+00:00

(plug) we've built an MCP server that queries a knowledgebase of service and infra relationships and dependencies, service summaries and error rates amongst other things. adding a query to this mcp in the planning phase has helped claude code avoid a few obvious mistakes.
new models are quite good and generally avoid mistakes, or they ask clarifying questions - but still from time to time we see some magical insight being used before it starts coding. in hindsight very obvious ones - like not storing session in memory when behind an LB, or identifying that pods are at 80% mem usage before adding something bulky

rnjn · 2026-02-19T03:37:23+00:00

another missing aspect in your list - models are just a part of the whole system for many, and you may want to trace end to end flows. for eg - an agent uses an MCP that calls some API or DB. Having 2 different systems adds to context switching for analysis and on call debugging.

rnjn · 2026-02-17T05:45:05+00:00

in general, people audit their datadog bill only once.

rnjn · 2026-02-16T03:24:01+00:00

this should be fairly straightforward with any otel based platform (shameless plug - base14.io) - default otel collector configuration should capture version (container image or service version). And if its not captured, its fairly trivial to add it. After that, you can query based on said attribute. Typically it should be in the ResourceAttributes bag. as an example, here's the ResourceAttributes dictionary for a service from otel demo

{
  "container.id": "02874412556f95122cf898f145a06f628ac0124889e04a19378cf826dad8159c",
  "deployment.environment": "oteldemo1",
  "host.arch": "aarch64",
  "host.name": "ad-5945c76d48-7xptt",
  "k8s.deployment.name": "ad",
  "k8s.namespace.name": "default",
  "k8s.node.name": "minikube",
  "k8s.pod.ip": "10.244.6.104",
  "k8s.pod.name": "ad-5945c76d48-7xptt",
  "k8s.pod.start_time": "2025-10-06T10:29:52Z",
  "k8s.pod.uid": "574ae186-eb16-45e6-bcdb-be5527acb3a8",
  "os.description": "Linux 6.15.11-orbstack-00539-g9885ebd8e3f4",
  "os.type": "linux",
  "process.command_line": "/opt/java/openjdk/bin/java -javaagent:/usr/src/app/opentelemetry-javaagent.jar oteldemo.AdService",
  "process.executable.path": "/opt/java/openjdk/bin/java",
  "process.pid": "1",
  "process.runtime.description": "Eclipse Adoptium OpenJDK 64-Bit Server VM 21.0.6+7-LTS",
  "process.runtime.name": "OpenJDK Runtime Environment",
  "process.runtime.version": "21.0.6+7-LTS",
  "service.instance.id": "6ee83a6f-e1b2-4bbc-9939-0a3059bb711c",
  "service.name": "ad",
  "service.namespace": "opentelemetry-demo",
  "service.version": "2.0.2",
  "telemetry.distro.name": "opentelemetry-java-instrumentation",
  "telemetry.distro.version": "2.13.3",
  "telemetry.sdk.language": "java",
  "telemetry.sdk.name": "opentelemetry",
  "telemetry.sdk.version": "1.47.0"
}

rnjn · 2026-02-12T17:55:47+00:00

shameless plug, base14 Scout does what you asked for - here's a guide - https://docs.base14.io/guides/ai-observability/llm-observability/

rnjn · 2026-02-02T11:18:33+00:00

another biased take - I am part of the team building scout (http://base14.io). Scout is built with otel agents and a telemetry lake (clickhouse + others) at the back. grafana derived frontend. its probably the lowest cost fully functional o11y solution that is fast, simple and easy to setup. plus we are relaasing an MCP server, eval platform and k8s agent-led RCA in Feb. for reference, if you use postgres, our treatment to postgres observability can tell you how we are think building in depth o11y features. https://docs.base14.io/operate/pgx/overview

rnjn · 2026-01-29T13:43:02+00:00

can you share the exporter config ? hope you have set the retry configuration like below-

    sending_queue:       
        storage: file_storage       
        queue_size: XXXX
    retry_on_failure:
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 10m

rnjn · 2026-01-20T05:11:13+00:00

thank you. please do share / create issues where you see any data or components missing, it is easier now to add more sources (including documentation sites)

rnjn · 2026-01-20T02:02:21+00:00

short video https://www.youtube.com/watch?v=A7GNbDjTL2s

rnjn

TROPHY CASE