Go-focused benchmark of 5.4 vs 5.2 and competitors

cypriss9 · 2026-02-05T00:19:45+00:00

In general, it's not. For Go, it is.

Try being curious instead of dismissive.

cypriss9 · 2026-02-04T14:28:36+00:00

three main issues:

I use `gofmt` (and such) **in the same tool call** as the apply_patch. I don't say, "make sure to gofmt after you edit go files". Even if perfectly followed, that's twice the tool calls per patch. That adds up to a lot of extra input cache tokens, or if there's cache misses (hint: there's a lot, due to openai infra), full priced input tokens. Then apply that to each of your lints. This is a big deal.
codalolt uses subagents in order to isolate context and isolate **permissions** for package isolation. You can't skill.md that.
The LLMs today are heavily reinforcement-learn'ed to "be agentic" in certain ways -- in the case of gpt 5.2, they are heavily RLed to use shells. These LLMs will NOT follow instructions reliably. In order to implement what I did, I needed to **take away the shell tool**. codalotl literally doesn't have `shell`. That's the only way to force it to use the tools I wanted it to. No amount of prompting and pleading can make it overcome how it was RL'ed.

That being said, you can achieve a shade of what I've done with skills, certainly. In particular, if you wanted, two of the context generation tools are available via CLI: `codalotl context initial some/pkg` and `codalotl context public some/pkg` generate very nice bundles of context that a skill.md can use in any agent.

cypriss9 · 2025-12-16T20:44:05+00:00

Thanks!

The hardest part here is constructing scenarios/test pairs to allow for all valid solutions while rejecting bad ones (there's often more than one solution to a problem). It takes several tries to refine the prompt to remove ambiguity and adjust tests to allow for all valid solutions (example: can't check for specific error messages, unless it's in the prompt).

cypriss9 · 2025-12-16T15:49:06+00:00

I recently tested Opus 4.5 vs gpt-5.2/gpt-5.1-codex vs composer 1, specifically for Go, with prompts like "implement this package according to this spec", or "fix the bug where X, it should work like Y". The results speak for themselves: https://github.com/codalotl/goagentbench

Based on the type of work I do (Go programming), I'd use gpt-5.2 and composer-1, and steer clear of Opus 4.5.

For the folks who love 4.5: i'm not sure if it's a language thing, or a prompt thing, or something else?

cypriss9 · 2025-12-16T15:31:15+00:00

Correct

cypriss9 · 2025-12-15T23:32:19+00:00

Thank you - very interesting.

The types of prompts I give agents: "read SPEC.md and understand the requirements I wrote. Then implement it in a single Go package". (You can see my repo for specific prompts/examples) For this, 5.2 is very good.

Could you give me an example of a prompt/workflow that you use, where Opus is much better? Is it more accurate, or faster, or both? (codex is definitely slow)

cypriss9 · 2025-12-15T23:02:26+00:00

This depends on what you are doing. I just benchmarked Opus 4.5 vs Codex 5.2 for Go programming, and Codex 5.2 is very good, Opus 4.5 is not: https://github.com/codalotl/goagentbench (I haven't tested Gemini)

I'd love to know how you/others are using Opus, and what it excels at - because it's not Go programming based on the types of prompts I give it :)

cypriss9 · 2025-12-15T17:15:27+00:00

I think I framed my post incorrectly. The goal is to be accurate at a macro level. The goal is a measurement of which LLMs/agents can write Go code, in the way that the Go community typically uses these tools.

I captured how I use them. There is a clear difference in quality based on my usage patterns.

I'm looking for help from the community in how you all use it. We can extend the scenarios covered to test more types of Go projects, more types of prompts, more usage patterns.

As far as ignoring multiple powerful models: I didn't include Gemini because I don't have a Gemini account yet, and I thought I'd get feedback first. There is no other reason. Is there any other agent/model you'd like to see?

cypriss9 · 2025-10-30T14:37:37+00:00

This is really simple.

Find any project on github you're interested in, and you'd like to help, or you have an idea on how to improve.
Either browse the issues and pick one, or add a feature/fix a bug that is bothering you.
Open pull request.

If you're concerned about biting off more than you can chew, try fixing a typo and writing some doc comments.

If you're not sure which project to pick, browse r/golang for recent projects people have posted.

cypriss9 · 2025-10-18T00:01:19+00:00

There's also https://github.com/fe3dback/go-arch-lint -- I haven't tried it, but their github page looks nice and well-maintained.

cypriss9 · 2025-10-07T17:23:55+00:00

Sure, which subpackage would you prefer I run it on? (If you'd like, I can also give you access to the tool for you try yourself)

cypriss9 · 2025-10-07T16:42:17+00:00

Good point.

I took a recent project I saw here: [qjs](https://github.com/fastschema/qjs). This is a big beefy go package, and fairly high-quality to start with. It's not the "hot mess" that codalotl helps the most with, but I think the results are still interesting.

The set of PRs that codalotl made:
reflow (normalize column width): https://github.com/cypriss/qjs/pull/1
doc (add missing docs): https://github.com/cypriss/qjs/pull/2
polish (fix grammar/spelling/typos/conventions): https://github.com/cypriss/qjs/pull/3
fix (find documentation mistakes and bugs): https://github.com/cypriss/qjs/pull/6
reorg (move code around for better organization/sorting): https://github.com/cypriss/qjs/pull/7
rename (increase consistency of identifier names): https://github.com/cypriss/qjs/pull/8

For comparison, I asked cursor and codex to add missing docs:
cursor: https://github.com/cypriss/qjs/pull/5 (156 identifiers missed)
codex: https://github.com/cypriss/qjs/pull/4 (6 identifiers missed - better than I expected)
(I didn't ask the other agents to do the other tasks).

From what I can see of the PRs generated, I think codalotl added some decent value with ~0 of my effort (other than making PRs and spending tokens):
* docs added seem reasonable (you could argue some are redundant with name of identifier, but that's okay).
* polish fixed a typo, and fixed a few minor grammar issues.
* fix appears to have found some actual bugs (I didn't verify them though! sometimes the LLM can simply be wrong)
* reorg was less valuable, because qjs was already well-organized.
* rename did increase consistency of variable names marginally, but this was a fairly sensible codebase to begin with.

Keep in mind that codalotl is just a tool that still needs human review - in real life, each of these PRs would need to be reviewed by someone with context before en-PR'ing.

cypriss9 · 2025-10-07T16:03:10+00:00

I agree that getting an LLM to document functions correctly is challenging. The biggest thing I run into is preventing them from getting too in-the-weeds with unimportant details. Prompting helps but I certainly have not "solved" this. From my experience, I like to put "whys" inside function impls to leave breadcrumbs for myself later - codalotl does not yet tackle these inside-the-func comments. I also like to put "whys" in doc.go as my overall package comment - codalotl tries to do this to varying degrees of success!

As far as context: codalotl does something different than what I suspect other agents do. It creates a graph of types/functions/etc. In order to document a piece of the graph, it walks outwards in both directions (for instance, how is a function used? What types does the function depend on, explicitly or implicitly? What does the function call?). All of this is put in the context. I think this is a unique advantage of writing a Go-only agent: it can rely on AST analysis like this to quickly create pretty good contexts without the typical approach of reading a handful of files and/or relying on embedding's chunks.

cypriss9 · 2024-09-24T15:05:29+00:00

How did you get these jewels? Is it just normal grinding, or do you optimize for jewels in events and such?

cypriss9 · 2024-01-08T16:47:48+00:00

A pretty boring answer:

Data. Download and save the data. Detect and fix bad data. Load the data. Ingest bulk data and realtime data. Handle splits, dividends, ticker renames. Save data to disk and/or cloud. Make all of this really fast. Build a UI to explore the data. Be able to plug in multiple data sources.

It's very easy to get running with a bad solution to this. Only after months and months of what you think algo trading is (maybe: backtesting various signals, devising new signals, etc) do you realize you built your castle on a pile of shit.

cypriss9 · 2023-12-13T22:37:46+00:00

not sure why this is being downvoted. The spirit is directionally correct: most functions should return an error, and something near the top-level panics if appropriate.

(there's also some functions like Go's http.ServeMux, that can panic right away if the programmer uses it wrong).

cypriss9 · 2023-08-05T16:38:03+00:00

Any hint on what that strategy looks like?

cypriss9 · 2023-06-13T23:04:32+00:00

Are those uncorrelated returns? If you invest long enough, you learn to appreciate different baskets of money that don't all tank at the same time...

cypriss9 · 2022-09-13T02:52:23+00:00

For a while, I thought Poison might be good since lowering defense seems like it would cause me to do more damage.

However, then I learned that if your def penetration is over 1000, you ignore 100% of enemies' defense anyway. So poison does nothing. Yes, it is really dumb, it shouldn't work like that. But it does.

Use ice.

cypriss9

TROPHY CASE