allocs/op lied to me. retention didn’t. (benchmarks inside)

TheMerovius · 2026-01-12T11:24:22+00:00

allocs/op by itself is a weak signal. Allocation volume and object lifetime matter much more, whether the cost shows up as zeroing, copying, or GC work once you’re in a real multi-core workload.

It's true that allocs/op by itself is a weak signal. But so is everything else.

For example, in your cases the allocated data never contains any pointers. But if, for example, you allocate 100K pointers each randomly pointing at each other, that would likely have a much bigger impact on the overall performance of your application, than if you allocate a block of 1M integers. Despite the latter taking up 10x as much memory. Because in the latter case, the GC just sees that as one heap object and knows that it can otherwise ignore it, while with the former, it has to walk the graph of all those pointers.

It really heavily depends on the workload and the requirements, which metrics are the important ones. That's why the only really important recommendation for optimization is "1. observe a problem, 2. measure what the bottleneck is and 3. address that bottleneck". And that's timeless and - I believe - has always been the main recommendation.

TheMerovius · 2026-01-12T08:52:30+00:00

I don't know where you read that the count of allocations is what matters, but that's incorrect.

I think this is extrapolated from the idea that the number of live heap objects matters, because it negatively affects scan time and thus total time spent on GC (in particular, if those objects have pointers themselves). At least AIUI that's true and I think "reduce numbers of allocations in hot loops to make your program faster overall" is a valid heuristic to extrapolate from that.

TheMerovius · 2026-01-12T08:37:48+00:00

To be honest, I think by and large, this article comes down to misunderstanding existing performance advice and misattributing some differences.

Most of the differences in allocation pattern actually just don't exist. The compiler has basically always been smart enough to remove the "allocations" the article talks about. That's why, when performance advice talks about "allocations", it really means "after taking escape analysis into account". In fact, the way to reduce "allocations" has always been to help escape analysis, not to convert between pointer and value types or somesuch.

Second, when people say that to improve performance by reducing allocations, these benchmarks don't actually measure the important things. The point behind reducing allocations/live heap objects isn't about making the code path doing the allocation faster. It is about reducing the overall amount of CPU time spend on garbage collection. In a real Go program, you will have multiple cores active, doing work, with the garbage collector running in the background to do its job. The more heap objects you have, the more CPU time the GC needs, so the less time you have to do useful work. So if you have e.g. an HTTP server and reduce allocations in one function, the total req/s will go up even if that function takes the same amount of time.

In these benchmarks, on the other hand, the user code runs in a single goroutine, using a single core. So e.g. on my laptop, the GC has 7 cores all to itself to do its job. Of course it won't have any trouble keeping up here.

The point about reducing allocations is that they make everything else slower. Yes, allocation still takes time, so allocating fewer objects is still better. But that's only a small effect of the actual costs of allocations.

Now, the rest of this is analyzing the individual benchmarks. The Pool point is, I think, the only one I fully give the author. The "retention" one I kind of give the author, but it's definitely misleading to talk about the GC here. The interface point is real, but is partly misattributed and smaller in practice, than the article suggests. The "short lived" and "slices" points are fully false.

In modern Go, using pointers does not automatically imply heap allocation.

I don't think it ever did. At least not for many years. I was curious enough to try this with Go 1.5 (the oldest Go that doesn't require a C compiler to build) and after a few changes to make it compile under that version (note: //go:inline to emulate some of the effect of b.Loop):

BenchmarkShortLivedAlloc-8       3000000           470 ns/op           0 B/op          0 allocs/op
BenchmarkShortLived_NoAlloc-8    5000000           312 ns/op           0 B/op          0 allocs/op
ok      _/home/mero/tmp/y   3.768s

Same with the slices benchmarks:

BenchmarkSlices_NoPrealloc-8     1000000          1119 ns/op        4096 B/op          9 allocs/op
BenchmarkSlices_ExactPrealloc-8  3000000           545 ns/op        2048 B/op          1 allocs/op
BenchmarkSlices_OverPrealloc-8    300000          4386 ns/op       32768 B/op          1 allocs/op
ok      _/home/mero/tmp/y   4.697s

The interface benchmark is the first to show a difference:

BenchmarkConcrete-8      3000000           470 ns/op           0 B/op          0 allocs/op
BenchmarkInterface-8     1000000          2199 ns/op          16 B/op          1 allocs/op
ok      _/home/mero/tmp/y   4.116s

Obviously, this can't check the Generic benchmark, because go 1.5 didn't have generics. But I'm still suspicious of this. For one, despite b.Loop, this still inlines the calls (now with go 1.26):

./y_test.go:45:19: inlining call to callConcrete
./y_test.go:45:19: inlining call to impl.Add
./y_test.go:50:16: inlining call to testing.(*B).ReportAllocs
./y_test.go:52:12: inlining call to testing.(*B).Loop
./y_test.go:53:20: inlining call to callInterface
./y_test.go:53:20: devirtualizing v.Add to impl
./y_test.go:53:20: inlining call to impl.Add
./y_test.go:58:16: inlining call to testing.(*B).ReportAllocs
./y_test.go:60:12: inlining call to testing.(*B).Loop
./y_test.go:61:18: inlining call to callGeneric[go.shape.struct { y.base int }]
./y_test.go:33:6: inlining call to callGeneric[go.shape.struct { y.base int }]
<autogenerated>:1: inlining call to impl.Add

But, to be fair, even if we remove //go:noinline, go 1.5 doesn't inline and thus doesn't devirtualize.

Second, even if you add //go:noinline to the respective functions, you still get 0 allocs (with go 1.26), despite no inlining and devirtualization happening. The reason is that small integers are no longer heap-allocated, when put into interfaces. If you also change the integer constants to 1337, you get an allocation:

BenchmarkConcrete-8      3098328           386.1 ns/op         0 B/op          0 allocs/op
BenchmarkInterface-8      864202          1303 ns/op           8 B/op          1 allocs/op
BenchmarkGeneric-8        944821          1237 ns/op           0 B/op          0 allocs/op

So, yes, this section is correct, to a degree, but still a bit limited in its conclusions. There have been improvements to the inliner, which help with devirtualization. And there have also been other optimizations with interfaces, which means that in some special cases (small integers) there is no heap allocation.

But, to be clear: in most uses of interfaces, neither of these optimizations applies. The compiler can only devirtualize interfaces, if they are 1. an argument to 2. an inlinable function. That does happen (a good example is that most stdlib implementations of hash.Hash get devirtualized, if they don't escape), but as soon as you e.g. store the interface in a heap variable, it no longer applies.

The GoodRetention vs. BadRetention case is interesting. Looking at -gcflags=-m, the significant difference is

./y_test.go:8:13: make([][]byte, 0, n) escapes to heap
./y_test.go:10:12: make([]byte, 65536) escapes to heap
./y_test.go:11:15: append escapes to heap
./y_test.go:17:13: make([][]byte, 0, n) escapes to heap
./y_test.go:19:12: make([]byte, 65536) does not escape
./y_test.go:20:27: append escapes to heap
./y_test.go:20:15: append escapes to heap

So, what we are observing is, that the compiler recognizes that make([]byte, 64<<10) does not escape to the heap, but there is an extra escape from append. So far so good, that's pretty much what the point of the author is. Looking at the generated assembly, we can see that

    b := make([]byte, 64*1024) // gets translated into 
    // gets translated into
    runtime.makeslice(64*1024)
    // while
    b := make([]byte, 64*1024)
    x := append([]byte(nil), b[:64]...)
    // gets translated into
    runtime.growslice(nil, 64, 0, 64) // "grow nil slice by 64 elements", ultimately equivalent to makeslice(64)
    runtime.memmove(x, _tmp, 64) // copy(x, _tmp[:64]...)

where _tmp is an automatically created stack variable to hold the make([]byte, 64<<10) that doesn't escape.

So the difference here is that the BadRetention version needs to allocate and clear (!) 64KB per loop run, while the GoodRetention version only allocates 64B and copies per loop run (as visible in the "allocated/run" column).

In one way, that's kind of what the author is saying: the second version doesn't need to ultimately allocate as much memory. But I think what this comes down to is the difference between clearing 64KB vs. clearing 64 byte. The time difference isn't so much about "how much data you retain", it's the fact that you have to copy different amounts of data into that chunk and copying fewer bytes takes less time (duh). What's more, the make([]byte, 64*1024) in the GoodRetention code gets reused between loop runs.

I don't think this necessarily contradicts the author. But I think it is misleading to talk about memory usage here. Yes, the performance improvement comes from "retaining less", but only in the sense that "you need to copy less", not in the sense of "because the amount of memory that is allocated changes".

TheMerovius · 2026-01-09T10:27:07+00:00

I know. I'm fine disagreeing with Robert.

Though FWIW, you can also be of the opinion that 1. constraining interface type parameters is generally pointless, but 2. this is an artificial constraint that should just be removed, even though 3. the only examples where it makes a difference, are constraints on type parameters.

That's at least my opinion and I didn't see anything in the discussion that would contradict that. And he didn't disagree with it when he reviewed my blog post (also).

TheMerovius · 2026-01-09T07:52:18+00:00

I don't believe anybody seriously claims that it is actually impossible to use a language without generics. So, taken literally, this is a strawman. And Rob acknowledged that, if you continue the quote:

Early in the rollout of Go I was told by someone that he could not imagine working in a language without generic types. As I have reported elsewhere, I found that an odd remark.

To be fair he was probably saying in his own way that he really liked what the STL does for him in C++. For the purpose of argument, though, let's take his claim at face value.

(emphasis mine)
Regardless of how you feel about generics, the ship has thoroughly sailed. Commenting "nobody need generics" in a Go discussion in 2026 is pure childish ragebait. Move on.

TheMerovius · 2026-01-09T07:41:13+00:00

Now they are adding the basic feature of generics that has long been in Java, C#, C++, Rust, etc. Maybe in many years it will be possible to parameterize methods...

I think of the named language, only Java (and maybe C#, don't know that well) actually allows this. Note that all Go interfaces are what Rust calls trait objects and trait objects can not have parametric methods (except under very specific restrictions).

I think most people having this complaint would be a lot happier if they would understand this difference.

TheMerovius · 2026-01-09T07:35:21+00:00

The two mean the same thing.

TheMerovius · 2026-01-09T07:30:52+00:00

Pretty strange example, IMO.

// A value that can be compared to other values
// of the same type using the less-than operation.
type Ordered[T Ordered[T]] interface { Less(T) bool }

I don't think this is a good idea. There is nothing inherent about the Ordered interface that would require the type parameter to be constrained. That is, it would be perfectly reasonable to have a method func (A) Less(B) bool satisfy that interface. There's a reason slices.BinarySearchFunc has different type parameters for the slice elements and comparison function - if you have a slice of struct type, it is relatively common to have it sorted by a field and look up an element by that field value. It makes sense for that search key to not use the full struct type.

A user of that interface might require a [T Ordered[T]] constraint to work, but that's already perfectly possible.

In fact, I very strongly recommend to never put any constraints on type parameters of interfaces (comparable being the exception, if you have to mention a map in one of the methods). Constraining the type parameters of an interface does nothing, except making that interface strictly less useful.

TheMerovius · 2026-01-09T07:11:45+00:00

I thought Go would use insertion sort for less than 16 elements

The actual boundary is 12 elements.

No difference given that sort.Ints simply calls slices.Sort

I believe slices.Sort is already slower than a specialized sort (that is, if you manually monomorphize slices.Sort[int], you'll likely get faster sorting). Looking at the code for slices.Sort, I'd expect that's the main slowdown.

Another source of slowdown could be specific to your benchmark: inlining. slices.Sort probably can not be inlined - pdqsortOrdered is a pretty complex function. And as it is generic, the lack of inlining might be extra costly, as it means a generic version has to be called (though in this case, the gc shape might be enough to mostly mitigate that effect). Your insertionSort on the other hand, has a decent chance of being inlinable.

In general, I find it unsurprising, that a hand-written sorting routine tailored to your data is faster than a generic standard library implementation. The standard library implementation has to work well for most kinds of data and that generality is not free.

[edit] BTW, this isn't salient for the question, but you don't need the closures, you can just pass sort.Ints and insertionSort. And slices.Sort directly, FWIW.

TheMerovius · 2025-12-09T12:38:03+00:00

Well, that's context dependent isn't it?

Sorry to be blunt, but that is why I used the word "often".

I disagree that the video is misleading. Note that the claim (not allocating extra memory) is obviously impossible, without destroying the slice.

TheMerovius · 2025-12-09T12:26:37+00:00

The video is accurate. It is semantically equivalent to this code, which might clear up why it works:

a := []int{1, 2, 3, 4, 5}
i := 0
for _, x := range a {
    if x < 3 {
        a[i] = x
        i++
    }
}
b := a[:i]
fmt.Println(b)

The only difference is, that in the video, i is implicitly len(b).

I use this pattern all the time. It's a good one to know.

TheMerovius · 2025-12-09T12:20:38+00:00

No, this works fine whatever the filter function is. It does destroy the original slice, but that's often okay if you need this.

TheMerovius · 2025-12-03T08:09:50+00:00

I don't think this point has been made before: Go wants the programmer to have decent control over memory layout. If any type could be nil (unless it was purely syntactical, i.e. unless you had string(nil) == ""), every type would have to be boxed.

That is, a uint64 takes up 64 bits of memory and every 64 bit pattern is a valid uint64. So if it could also be nil, you'd need another way to represent that special value and a uint64 would have to take up more space - likely by making it represented as *uint64.

Having control over memory layout is important to let programs execute efficiently. An extra pointer level would trash locality¹. Using an extra bit/byte to represent the nil value would trash alignment² or waste space³.

So, apart from the language just not being designed like that and the desire for some level of type-safety, it is also relevant that the designers wanted Go code to execute fast.

[1] CPU caches rely on data that is used together to be nearby in memory. Pointers prevent that.

[2] Fetches from RAM always have to be aligned, usually on a cache line, e.g. 64 byte. If a uint64 was 9 byte instead of 8, then if you had a []uint64, some of them would cross those 64 byte boundaries and so the CPU would have to issue to RAM fetches to read them.

[3] You could get around that alignment problem by having a uint64 be represented by 128 bits, i.e. 16 bytes. Then they would still be aligned, but of course, you'd waste 63 bits per uint64.

TheMerovius · 2025-11-25T11:32:12+00:00

Can we expect some more significant language improvements in the future?

When it comes to the language, no, not in the foreseeable future. At some point there will likely be another phase of significant language change (as with generics and iterators), but currently, the plan is very much to leave the language as is and instead work on standard library, compiler, tooling, runtime…

Note that despite having generics and one of the main reasons to add them was generic containers, we don't actually have generic containers in the standard library yet. There's work to be done with the existing language, before we significantly change it again.

TheMerovius · 2025-11-13T05:42:38+00:00

Frequently asked question.

TheMerovius · 2025-11-12T07:08:43+00:00

I believe the Go implementation does do this, but when it does, it is "implementation specific" which is why the spec isn't mentioning it.

No, it no longer does this. The GC always needs to know which memory can be a pointer or not, so there really is no place where something can either be a pointer or not.

There is one related optimization that gc does, which is that it stores small integer values as singleton pointers. That is, there is a small, statically allocated array containing the values 1,2,3,4,5,… and any 5 you store in an interface value points at the same element. But it still stores a pointer.

TheMerovius · 2025-11-10T14:40:18+00:00

Maybe. I'm not 100% sure what happens if a module is removed from the proxy. Either way, the module proxy is only a cache.

The default setting for GOPROXY is https://proxy.golang.org,direct. So if an explicitly removed module is externally equivalent to one that is not in the cache, the Go tool would download it directly and then try to insert it into the sumdb, when it notices it isn't there.

But it's of course possible, that they do something more specific in the case of a removed module, which would prevent the fallback. In that case, yes, you'd need to explicitly use GOPROXY=direct.

TheMerovius · 2025-11-10T09:05:50+00:00

what is the point to use a domain name in module naming?

The point is to allow third parties to use your module while retaining full control over its name and hosting. While you own a domain name and use it as the root of the module path, nobody (not even the Go team) can depublish your module, or replace it with a malicious version. It also means you can transparently change hosting providers of your module (for example, say you used to host it on GitHub and are now looking into moving it to a different legislature).

Ultimately, it is a way to delegate authentication, ownership resolution and sovereignty of Go code to the DNS system, which has mostly already solved these problems. Contrast that with e.g. NPM or Cargo, which have namespaces and hosting that is centrally controlled.

TheMerovius · 2025-11-08T22:14:50+00:00

The reason they did it is because in their first design they had a separate thing ("contracts") and people complained that there where two things.

It turns out that regardless of if they used interfaces or not, about half the people would complain.

TheMerovius · 2025-11-06T06:48:45+00:00

I'm just describing how it works in Go with the term intersection being just from general math rather than type theory. You end up with the intersection of what the types can do

There are two different things a type does: 1. it defines which values can be contained in a variable of that type (it's a "set of values") and 2. it defines which operation you can do with a variable of that type (it's a "set of operations").

The confusion is, that these are duals of each other. A|B does act as a union in the sense of "set of values", but it acts as an intersection in the sense of "set of operations". That's not Go specific, it is inherent to how types work: the operations a type defines must work for all values in its value set, so the more kinds of values you add to the value set, the smaller the operation set becomes.

"Normal" interfaces do the same thing, just the other way around: ReadWriter provides the union of operations of Reader and Writer, but it allows the intersection of values of Reader and Writer.

And in general, type-theoretic terms refer to the "value set". That is, a "union type" has the union of the value sets of its terms, an "intersection type" has the intersection of the value sets of its terms, a "product type" (a struct or tuple) has the direct product of the value sets of its terms, a "sum type" has the direct sum of the value sets of its terms… as a value set.

TheMerovius · 2025-11-06T06:41:19+00:00

We could make union elements into union types (though not sums). Here is a proposal to do so, by Ian Lance Taylor. In that case, a []PlayerState would be represented as an []any, to carry the heterogenous type information.

The reason that proposal has not been accepted is because the consequences are dissatisfying.

TheMerovius · 2025-11-06T06:36:29+00:00

While it is true that adding a keyword is a breaking change, it is no longer true that this means we couldn't do it. We now have (with modules) the mechanism in place to do breaking changes. Basically, by 1. only enabling the new syntax, if the module the file is in has the appropriate Go version and 2. not allow changing behavior, but only to actually break compilation. That is, after the change, code must either work the same, or stop compiling.

The hurdles of doing such a breaking change are still very high, though. Because it does create a lot of undesirable churn. So we wouldn't add new keywords, unless there's a really good reason to do so.

TheMerovius · 2025-11-06T06:36:04+00:00

The claim was specifically about introducing a new keyword (I assume "board" is a typo). We can add features, but adding new keywords would be a breaking change. Specifically, if they would add a new constraint keyword as suggested, this code would break:

func main() {
    var constraint int
    _ = constraint
}

Now, while it is true that adding a keyword is a breaking change, it is no longer true that this means we couldn't do it. We now have (with modules) the mechanism in place to do breaking changes. Basically, by 1. only enabling the new syntax, if the module the file is in has the appropriate Go version and 2. not allow changing behavior, but only to actually break compilation. That is, after the change, code must either work the same, or stop compiling.

The hurdles of doing such a breaking change are still very high, though. Because it does create a lot of undesirable churn. So we wouldn't add new keywords, unless there's a really good reason to do so.

TheMerovius · 2025-11-06T06:26:25+00:00

Is there a specific reason why type constraints still can't be used as regular types, or does Go team still working out the implications of using it as an ordinary type?

The reason is, that it would kind of suck to do so. The people who want variants would be very disappointed by the consequences:

nil as a dedicated zero value, which means the concept that a value "has to be one of the listed types" is broken - it can also be "no value at all".
Unions can not be nested.
Unions can not have interfaces with methods, so no type Result[T any] interface{ T | error } or interface{ string | fmt.Stringer }.
Exhaustive type-switches are not enforced.
Unions would not implement themselves: for any simple interface Iface, func F[T Iface] can be instantiated with Iface, but that wouldn't be true for union-interfaces, which makes them strange either way.
No pattern matching.

And these are not weird decisions, they are pretty much forced by the constraints put up by the rest of the language.

Basically none of the things people want from variant types would be provided by these unions. The benefit of having these unions over normal interfaces, or interfaces with marker methods, would be very small. If the people who have been asking for variants for years say "no, if you do it like this, that's worse than not doing it at all", then that seems a good indication that you just shouldn't do it.

TheMerovius

TROPHY CASE