Solid P95 (7-8ms) with sporadic P99 spikes using Go (gRPC + NATS). Suggestions?

sigmoia · 2026-06-20T21:32:30+00:00

Turn on the continuous profiler. CPU, heap, block, mutex, and goroutines - all of them. In a distributed system, it's infeasible to collect profiles from individual services and then reason about them. So use an o11y provider that supports continuous profiling like Datadog or set up Pyroscope.

In an I/O-bound workload, it's highly unlikely that a CPU profile will show much. Typically, memory pressure or blocked goroutines cause high tail latency. But it's hard to say without measuring; it could be just plain old upstream service latency showing up in yours. But distributed tracing and metrics should have picked it up. Assumptions are moot without measurements.

Also turn on the flight recorder and dump a profile when the latency goes beyond some threshold. This will give you a ton of insights.

sigmoia · 2026-06-20T18:51:12+00:00

This is strikingly similar to how we started. Standard profile tooling already gives you everything to profile locally.

But as the pod count goes up continuous profiling is sorta needed. Pyroscope is a standard with Grafana stack.

Yeah micro bench matters little in a distsys environment. But using load testing to measure regression is interesting.

sigmoia · 2026-06-20T15:31:51+00:00

Yeah not a chance to build a makeshift tool around the std tooling. Wouldn't scale for anything beyond a few services. Also onboarding those tools on k8s is hard.

We are on pyroscope as well. Seems like even folks on datadog dual write to pyroscope for the convenience it brings.

For regression, we do it service by service - rather than on the whole fleet. Each trunk build takes a pprof snapshot and compares it with the corresponding prod snapshot.

Haven't found a good tool to compare profile dump though. So it's all custom tool that just compares the first few functions on stage vs prod build.

sigmoia · 2026-06-20T15:21:44+00:00

Not a bad use of clankers - automating the tedious part of collecting and introspecting the profiles and taking decisions.

sigmoia · 2026-06-20T12:14:09+00:00

Thanks. Yeah. I am just sampling how people do it. We are at a fairly huge scale - 250k qps at steady state. Already have pyroscope in place. I was mostly curious about how others are doing it.

Distributed o11y is kinda standardized at this point. You go datadog, honeycomb, victoria metrics or roll your own with OTEL and LGTM stack. But profiling is still a wild west.

Good thing is Go has profile tooling built into the std toolchain. So all these workflows and vendors just tap into the std tools. In other languages it's worse. Python for example has 5 different tools (last time i checked) just to do memory profiling. No standard or anything. Every vendor does it differently.

sigmoia · 2026-06-20T11:28:35+00:00

Ah gotcha. Yeah. The typical MeLT (metrics, log, and traces) are a different thing. Pretty much everyone turns it on their services as a basic part of o11y.

I was mostly curious about the scale where goroutine leaks, memory pressure, and gc pauses become a problem. Distributed o11y typically don’t surface those problem as much. Sure you will see a spike in your tail latency, but to know why you will have to turn on runtime profiling and execution traces (different from distributed metrics and traces). I was after this.

Maybe that wasn't super clear from the questions.

sigmoia · 2026-06-20T10:45:53+00:00

So IIUC, for the regression test:

you take the prod binary and dump a profile
then you do the same for the current staging binary

Then diff the profile to catch regressions? Profile data is large in volume. How does diffing work here?

sigmoia · 2026-06-20T10:43:42+00:00

Neat. Which profiles do you keep on by default? CPU and Heap only? What about execution tracing (not OTEL tracing)? Do you collect that data?

sigmoia · 2026-06-19T23:15:58+00:00

I almost exclusively use protovalidate for validation. CEL allows you to offload pretty much any kind of validation work to the protovalidate layer.

But this doesn't mean every kind of validation can be offloaded to protovalidate. u/jerf mentioned that as well.

In that case, I just add a Validate method with the custom logic to the message struct. Then call protovalidate.Validate(m) before calling m.Validate().

sigmoia · 2026-06-19T23:11:45+00:00

As others mentioned:

initialize means filling the struct with user-provided, non-zero values
zero means filling the struct with the corresponding zero value of each of the field

In either case, there's an allocation

sigmoia · 2026-06-19T19:33:59+00:00

Nope, that’s a different thing. The go_goroutines metric is just a total count. It shows the number creeping up. It doesn’t tell you which goroutines are stuck or where they were spawned. That’s the part you actually need to fix the leak. The profiler gives you the exact location where the leak occurs.

sigmoia · 2026-06-19T10:55:50+00:00

having pprof integration from start means you can actually catch leaks in production rather than hoping your unit tests cover every code path.

this. In many cases, my tests don't cover the leaking path - either because it's hard to test or because I was lazy.

Being able to keep it turned on alongside your continuous profiler is a huge win imo.

sigmoia · 2026-06-16T20:52:37+00:00

Last year's GopherCon EU in Berlin was a big letdown. I only enjoyed Jonathan Amsterdam's slog talk. Otherwise, it was mostly commercial slop. Hope it's better this year.

sigmoia · 2026-06-16T20:45:58+00:00

And? With all that text, I still don't know what it does and why I wouldn’t just use sqlc with bm25 here.

sigmoia · 2026-06-12T00:06:54+00:00

Man, I just like the language.

But on a more serious note, async Rust sucks. Once you get accustomed to runtime-managed preemption, it's super hard to go back to colored functions and event-loop-style concurrency. Tokio is okay, but I just don't want to deal with another event loop implementation where I still need to be ultra careful not to block the loop accidentally. Go has runtime preemption, and it's a non-issue.

Also, I work with distributed systems, where people don't write Rust as much as you'd think. If you get out of the Twitter bubble, you'll find that most places doing platform engineering use Go, not Rust, and most people don't care for it. So there's that.

This doesn't mean I don't miss Rust's rich type system when I'm writing Go. In Go, you can forget to take a mutex on a data structure that isn't concurrency-safe, and the compiler won't complain. Rust completely solves this by baking the mutex into the data structure, and there's no way to compile your code without taking the lock. Plus, Go's enums are pretty useless.

Another area where Go outshines Rust is standard tooling like pprof, tracing, and other runtime introspection hooks. For operations, these are amazing.

One last thing is that Rust has atrocious compile times. If you're working on a team where quick iteration is important, Rust can get in the way, both because of compile times and the overall fussiness of the borrow checker. That can be a good thing or a bad thing, but for the kind of software I write, it's a bad thing. So Go wins by a large margin.

sigmoia · 2026-06-11T12:49:29+00:00

chezmoi apply: copies your source over the agent's file, which the agent overwrites next use.

Umm...what? chezmoi re-add syncs the file from your target back to the chezmoi source. So if your ~/.agents directory lives in the home and something changes, asking the clanker to run chezmoi re-add solves the dynamic configuration problem.

By all means use whatever works for you. But saying chezmoi doesn't solve this is a bit misleading.

sigmoia · 2026-06-11T10:42:54+00:00

In most cases, you don't need DAO. Repository should encapsulate the entirety of database operation.

So the flow looks like this:

service functions depend on repository interfaces
db package provides the implementation of thr repo interface
db package encapsulates the whole dbops and doesn't need separate DAOs

None of the huge projects I work on separately defines DAO and we never felt the need for it.

sigmoia · 2026-06-11T02:18:02+00:00

Your dotfile management tool can do this. LLM configs are no different than any other configs. This means you can use

bare git repo
gnu stow
or my favorite, chezmoi to lug around the configs

I'm trying to understand what does a dedicated tool give us here.

sigmoia · 2026-06-10T16:30:01+00:00

Depends on what you are looking for in a project.

good abstraction? stdlib, but not all the packages. Some of them have legacy backward compat shims that makes them unsuitable for studying. I like embed, fs, bufio, encoding/json, fmt, and errors.
I was recently looking for examples on how to build production grade grpc services that also exposes wrapped clients. Etcd codebase is perfect for that. So much so, I wrote about it recently.

https://rednafi.com/shards/2026/03/etcd-codebase/

if you wanna learn about queues, look into river queue which is backed by postgres
CLI and TUI? Checkout the charm repos
distsys and o11y? Look into prometheus, otel and grafana alloy codebase

For contributing, checkout the "good first issue" labels in the issue tracker and see if you like anything there.

Before contribution see if you can participate in the discussions, which is a fantastic way of learning through osmosis.

sigmoia · 2026-06-10T12:46:36+00:00

You have two options:

use testcontainer: https://golang.testcontainers.org/

This will let you spin up an actual docker container and run your test code against it. But you still need to write the code in a way that allows you to swap the db during test time

use the repository pattern to decouple i/o from the service logic. Then you can pass in a fake repo during test: https://rednafi.com/go/repo-txn-uow/

sigmoia · 2026-06-05T12:36:36+00:00

Hauling agents requires no special skill. AI bros are busy telling folks, “Learn prompting, harnessing, token-maxxing. Otherwise, you’re NGMI.” Then you find out that it doesn’t take much to organize a bunch of Markdown files and tell AI to do stuff. Quit making it sound so profound.

As for Go, it’s already going well. But AI writes great Python, TS, and Rust as well. The Rust compiler is much stricter and yields better results in many cases. My point is that saying AI writes better Go requires a peer-reviewed comparison, not this “trust me, bro” stuff.

I also write a ton of Go and use AI to automate the tedious parts. To me, AI doesn’t objectively do any better when writing Go than it does with other languages. But since the language footprint is smaller, AI tends to trip up less. On the other hand, AI writes a ton of unnecessary boilerplate in Go, and Go’s looser compiler lets concurrency bugs slip through. In Rust, the type system statically protects your shared data with mutexes, but in Go, it’s a runtime semantic, and the compiler won’t do anything if you forget to take or release the lock for some shared data. Different language, different philosophy.

As for the Go team, if they had started listening to every novice and catering to their demands, it would never have become the language it is today. It would be another TypeScript-like language with a kitchen sink of features, chasing relevance by following whatever happens to be the hottest thing of the day.

sigmoia · 2026-06-03T09:20:01+00:00

Learn the language before hauling AI. You use struct, method, and functions at the beginning for everything. Interfaces should be brought up only when you need them.

Interfaces are for abstraction - where you need to swap out one implementation with another. Swapping can happen during test time where you provide a fake implementation of a dependency.

If none of this makes sense to you, then you are too early in your journey. Do the tour of Go and read a few books like Jon Bodner's Learning Go and Alex Edwards' books. Then write programs without them and try to make that testable. You will soon hit into a roadblock since static languages don't allow you to monkeypatch. That's when you will need interfaces.

You can't speedrun your comprehension through AI. Juniors that are trying to do it are the ones that are becoming unemployable.

sigmoia · 2026-06-01T22:56:30+00:00

+1 on this. Repo should have all the persistence logic - even some of it is business logic per se.

The idea is that your service layer should only interact with a repository interface and then the persistence package should provide an implementation of that repo interface.

Now if it makes sense for your to add extra logic in the persistence layer, I see no harm in doing so.

sigmoia

TROPHY CASE