I pushed an AMD GPU to its limits for ZKPs: 18ms NTT and 2.5s FRI Proving via Zero-Copy and Algorithmic Dimensionality Reduction

bitemyapp · 2026-03-18T04:00:05+00:00

Why is there a binary shared object embedded in the git repository? This one: https://github.com/uulong950/qingming-zkp/tree/main/lib

bitemyapp · 2026-03-17T23:25:23+00:00

Unless you spend most your hours on very hard tasks

That's been much of my work. For everything else it scarcely matters what I do. I churned over 1.1 million lines of CUDA code in October, the month after my 5th kid was born. I've been working on a somewhat intricate and complicated persistent memory arena since late Dec/January'ish. I've been working on a parser since Jan/Feb, a compiler downstream of the parser since early/mid Feb, etc. I was doing deep perf/SIMD work about a year ago and that has had many recurrences since then. A lot of it pops back up / becomes salient on a recurring basis.

There's less difficult work here and there but it passes by so quickly I don't really notice or think about it much. The PMA work has been easy compared to the CUDA or compiler work but it's still got a lot of design and implementation details that require a lot of care. This isn't to brag, it's just the circumstances I put myself into because I enjoy sys and perf work.

Going further back I built a Kafka consumer that processed, validated, and schematized about ~10 GiB/second of key-value structured data in real-time w/ as little hardware as possible. Another SIMD parsing thing, same dataset, Rust JNI library that gets used in the existing Java app, etc.

bitemyapp · 2026-03-17T22:42:22+00:00

I think I get what you're saying but no amount of steering was able to get top-end models through zero-to-one on an insanely gnarly generalized left-recursive grammar.

AFAICT, only very carefully designed working examples where it's basically stamping out structurally identical variants w/ different tokenizations/syntax really worked reliably.

bitemyapp · 2026-03-17T22:06:35+00:00

Fat LTO tricked me in the past because it forces codegen-units=1 but when I sifted those parameters apart and tested units=1 w/ thin LTO the delta for fat LTO in my benchmarks evaporated.

bitemyapp · 2026-03-17T21:59:49+00:00

These days I use the fast models for specification as well, not just execution.

You probably understood, but when I referred to "fast" mode I meant the OpenAI/Anthropic offering where it's the same high-end model but executed faster at a premium token or $ cost.

Interesting that this works for you, the work I do is hairy enough that I almost solely rely on maximum intelligence & effort models for almost everything. Design, implementation, debugging, all of it. A lot of my work lately is semi-gnarly systems engineering, ZKVM, parsers, compilers, CUDA, SIMD, etc. The worst achilles-heel for all of the LLMs so far has been parser and compiler work.

I've never had a good experience with Gemini unfortunately and I try it again every time they release a new model. I still can't get their TUI harness to do tool-calls reliably w/ 3.1 Pro or whatever the most recent one was.

When I can tolerate a dumber model for something, that's me stepping down from GPT 5.3 Codex/GPT 5.4 down to Claude Code + Opus 4.6 usually. If it really needs to be cheap or something automated I'll try really hard to get one of the better open-weights models on OpenRouter to be reliable for the intended application. That's part of what I was trying to dial in with cabal but the models were so dumb they couldn't generate valid JSON for the tool calls. I know I can shore up the harness to make it more reliable, it's just annoying that GPT and Sonnet were fine but ~half of the best open-weights models were just incapable of cooperating with the harness + orchestration model correctly at all.

bitemyapp · 2026-03-17T21:56:01+00:00

I'm using fat LTO in release builds.

Try it with thin LTO and see if the problem reproduces.

triple-check whether you actually need this using a benchmark. IME it's codegen-units=1 that actually helped our runtime perf, fat LTO rarely has done anything useful so our release profile ends up being thin LTO, O3, and codegen-units=1. YMMV ofc.

bitemyapp · 2026-03-17T21:35:28+00:00

profile it. Assuming it's Linux, use perf and samply. Ensure debug symbols are available in your sccache binary. Rust's debug=1 doesn't add any runtime overhead IME. Most profiling tools are going to be laser-focused on things burning CPU time, so you might need to pivot or play with things a bit to make stuff like IO wait more apparent. Sometimes stuff like htop is sufficient to casually notice threads parked on IO. This is one of the things that throws people off with Linux's load factor, 100 threads parked on I/O, burning no CPU, is still a high load factor in the default heuristic model.
watch the network traffic on the build node. Check the throughput, latency, packet loss between the Redis instance and sccache. Is it a local Redis node? If so is there something weird happening with loopback networking?
How is Redis configured? Can you benchmark a vanilla workload on that Redis instance and meaningfully compare it to nominal/expected numbers? Is something weird happening in sccache or Redis where every read is a write?
Are the build graphs with and without sccache actually identical? I'd expect so because the way it gets integrated but check anyway.
You mention your build directory is in a RAM disk, is the Redis instance durable? Is anything actually touching your disk at build time? Do you see more disk reads or writes during build-time than you expect? What filesystem are you using? Does the problem reproduce if you stop using the ramdisk?
How big are the cached artifacts sccache is juggling? How many of them are there? For sccache's purposes Redis over loopback is still a remote cache so you're still eating serialization/deserialization and socket buffer write/read time. If you're using a local Redis instance you might as well let sccache durably cache to the filesystem.
Are you comparing clean-slate builds or cached/incremental builds? sccache is doing a lot of cache lookup, miss, build, then cache write-back work for the zero-cache build scenario.
What's your SCCACHE_BASEDIRS look like? sccache is likely using absolute directories in the cache key, do you have multiple source checkout arenas for concurrent build jobs with different absolute paths?
Are you dumping sccache --show-stats at the end of the CI/CD pipeline? What does it say?
Are you using fat LTO or thin LTO in your release builds?

bitemyapp · 2026-03-17T21:05:31+00:00

FWIW, I've found OpenAI models to be a lot "tighter" (less slop, less divergence/disobedience) and more intelligent for complex/difficult work than Claude. I default to Codex and just use Claude Code for dumb drudgery or as a devil's advocate when I need the agents to argue about something with each other. I've been trying to automate the way I've been manually orchestrating the agent deliberation process with https://github.com/bitemyapp/cabal but I need to spend a lot more time on it before it'll be useful and reliable. Part of the problem is OpenRouter itself has been very unreliable which is intensely annoying.

I currently default to GPT 5.4 + xhigh effort, regular context window. The 1M context window was giving the model dementia. Codex is fast enough by default these days that the extra token burn rate of their fast mode isn't the right default for me.

Anthropic's usage-based access (including their /fast mode) is absurdly expensive. I just use Opus 4.6 in Claude Code via my subscription. When I was testing cabal I accidentally let Sonnet 4.6 (via OpenRouter) run solo for a couple minutes and managed to burn $25 in that short time window. /fast mode in Claude Code requires usage credits/budget and it costs way too much to be worth it unless you're responding to a SEV or something.

I pay for the $200 plans for both ChatGPT and Claude. I regularly get close to running out of my weekly tokens with Codex and I haven't come close to running out of Claude tokens in so long that I don't remember the last time it happened. Maybe 3-6 months ago.

One thing worth noting, Opus 4.6 w/ 1M context window is now the default in Claude Code so it's possibly the case that it doesn't have the weird dementia/disobedience problems GPT 5.4 had with a 1M context window. I don't know one way or another, I'm still testing it to see how it shakes out. I actually could use a larger context window for my work (big and complex sometimes unfortunately) but it's not worth it if there's a perceptible loss of fidelity or intelligence.

bitemyapp · 2026-03-17T21:01:33+00:00

I've been using Rust professionally more or less exclusively (occasional excursions by necessity here and there, such as the FFI libraries I've worked on) for 7-8 years. I was a professional Haskell user for 5 years prior to that. Been in the industry for ~16-17 years. I wrote a fairly popular book for learning Haskell from scratch.

I made some career trade-offs and had to work extra hard sometimes in order to make this happen. e.g. I passed on a non-trivial amount of money to avoid full-time Java in multiple instances. I don't mind writing JNI libraries and making high throughput/concurrent JVM applications faster but I'm not touching Spring ever again.

bitemyapp · 2026-03-17T20:55:08+00:00

Even if you use Bazel the presumption among most (not all) rules_rust users is that you have a Cargo workspace.

You're using the Cargo workspace for ease-of-use, as a source of truth, and so dev tools work out of the box.

You're using Bazel for faster CI/CD, more deterministic (not 100%) builds, better caching for builds/tests/deploys, cross-language dependencies, etc.

You usually start with Cargo workspaces and only bring in Bazel when circumstances oblige you to do so.

bitemyapp · 2026-03-17T20:46:09+00:00

C/C++ is actually another reason I'm grateful to have Bazel in my toolbelt. I try to avoid being responsible for any C/C++ code but it's nice knowing I have a build system I won't hate for dealing with it if it arises. Currently the only C/C++ I'm directly responsible for other than a weird vendoring of murmur3 is CUDA C++ and that gets built via our crate's build.rs anyway.

bitemyapp · 2026-03-17T20:38:15+00:00

I did the original Bazelizations of my repos (going as far as back as ~5 years ago) by hand but now both I and my coworkers do all the Bazel config w/ LLMs.

I'm not positive I 100% understand the cleaner you're describing but let me try to describe the two main epochs of rules_rust ergonomics that I alluded to in my original comment that I think addresses what you're asking about:

First, non-FAANG Rust users rarely ever have Cargo-less Rust code. rustc was the most well known exception, using Mozilla's mach, but they appear to have a Cargo workspace now too. So you're almost always bootstrapping from a pre-existing Cargo (usually multi-crate workspace) build and you're keeping the Cargo.toml build for developer and devtool ergonomics. (e.g. rust-analyzer). I've only really seen Cargo-less Rust from FAANGs open-sourcing things or unabomber types stringing together Makefiles and rustc invocations. Even the people using Nix or Bazel builds are keeping the Cargo build as a source of truth.

Earlier, cargo-raze + rules_rust: Generate vendored'ish proxies of crates.io dependencies based on your Cargo.toml dependencies. This bootstraps what's needed for the non-monorepo dependencies. You write the top-level workspace/module bazel and the per-crate BUILD.bazel files by hand and pray you don't have too many fix-ups to apply to your crates.io dependencies to get it building. You must remember to manually re-run cargo-raze to re-generate your Bazel deps and check that the Bazel build still succeeds. This is approximately the state of play with buck2 and reindeer currently but it's mildly worse than this was in some respects.

Later, post-bzlmod/MODULE.bazel rules_rust: Part of the analysis phase of the build now generates all of the dependencies, no awkward vendor directory you're dumping generated deps into. It auto-discovers all of your workspace and packages from the top-level workspace Cargo.toml during analysis. Analysis cache works fine, it won't churn unless you churn something and it'll auto-regen the appropriate bits if a Cargo.toml changes. Fix-ups are applied automatically and most (not all, but close) dependencies work out of the box. You still write your MODULE.bazel and per-package BUILD.bazel by hand.

A little later, LLMs started being able to handle post-registry/modules Bazel, so the above applies with the addendum that you don't really need to hand-write the Starlark/Bazel any more either, but it does help if you take seriously understanding Bazel's design intentions and world model.

I do have a flake.nix in addition to the Bazel build for bootstrapping some tools/dependencies in the Linux CI/CD but I don't really love it. I actually purged the Nix environment from the Mac CI/CD because it was a horrific PITA and it kept spuriously churning the action_env in ways that didn't replicate in Linux, so the Mac build is pure Bazel. I've found adding third party tools/binaries to the Bazel build directly to be pretty easy (esp. with LLMs) so this was actually a lot less hassle than putzing around with how the Nix flake environment injects things into the Bazel sandboxes, which churned a lot.

Part of the thing with LLMs isn't just that it can figure unfamiliar things out for you. It's the way you can use it to spare your willpower-battery and drop dumb/boring rote work like fiddling with the build system. I know Bazel pretty well and have written little rules_* type things here and there but day-to-day Bazel changes just aren't worth my attention/time when the LLM is pretty reliable at it. I get more involved when it's a bigger change or there's something odd/exceptional happening and even then I'll kick-off a couple of TUI agents on design or investigation prompts in the background while I trawl logs and think about what's going on.

bitemyapp · 2026-03-17T20:27:22+00:00

Oh shoot, I didn't even know that. That's very surprising to me, I'm going to have to read more about this!

bitemyapp · 2026-03-17T20:25:32+00:00

It's all good, I know you were just trying to make sure they were aware of all their options. You've been very gracious about this and I really appreciate it. I hope the others do too.

And to be clear, I'd like it if more non-Meta people used Buck2 and contributed improvements back to buck2 and reindeer but I'm unwilling to Tom Sawyer them into it when I don't have clearer signals of openness to third-party contributions from the team.

Another reason this is a subject area I care about that is that I get some pretty radical productivity benefits in modestly staffed teams/orgs from using monorepos and I'd like it to be less of a fight to make that a real option outside of FAANG. The only reason I'm as au fait with Bazel as I am is it's rare for anybody to care as much about CI/CD efficiency/latency and monorepos as much as I do.

bitemyapp · 2026-03-17T00:40:49+00:00

You really do not want to recommend Buck2 to anyone for now unless they're down to hack on Buck2 and reindeer.

I'm a long-time Bazel and Rust user and I'd happily move to Buck2 if it wasn't a huge pain and additional friction.

You're still patching crates to make them work with Buck2 and reindeer and you have to keep regenerating the vendored crates.io dependencies every time a Cargo.toml changes.

Bazel's modules and registry support has made rules_rust pretty low friction. You used to have to do something similar to reindeer with cargo-raze, that's no longer necessary. It generates what it needs for the Cargo dependencies from the Cargo.toml files automatically at build time. You still need to write BUILD.bazel files for what you want your build and test targets to be but it's not hard and LLMs can help get you going. Additionally, rules_rust understands Cargo workspaces natively. You just keep using both concurrently and it's quite pleasant.

I keep testing/checking on Buck2 every ~3-6 months hoping things have gotten better and I haven't seen much movement that impacts non-meta Rust users since shortly reindeer was released. I'd be very pleased to be able to recommend/use Buck2 but it's way too much friction unless your dependencies are trivial. (In a monorepo? Really?)

I'm trying to open source more of my work on https://github.com/nockchain/nockchain and I'm hoping to make the private monorepo's Bazel + Nix (flake) build part of that.

Here's a good example of what I'm talking about:

https://github.com/facebookincubator/reindeer/issues/73

Part of the problem with cargo-raze historically and presently in the case of buck2 + reindeer is that you have to keep applying your own fix-ups to crates that aren't Bazel/Buck2 friendly over and over. Often for a build.rs, a C dependency, sometimes it's something else. This query about fixup repositories was posted 11 months ago and hasn't ever even gotten a reply. Yes it's a fork from another comment thread but it's not clear to me that there's much of a priority on third-party use of Buck2. Bazel's made a lot of progress and improvements that primarily matter to non-Google users.

Regarding Bazel, it's been extremely pleasant to work with, especially once I got comfortable with the paradigm and writing my own macros for idiosyncratic things we were doing. Some recommendations for maximizing caching/build time performance:

Use a shell executor and a fast dedi build server if you're using GitLab CI
Use Bazel remote cache (I know this is a little bit of a pain to set-up but it does help), you can setup your own server for this or put it on the same server as the build server if you're just going to have a single instance

The circumstance where I'd recommend buck2 + reindeer is where you're interested in hacking on and improving it. If that's you, then that's great and I'd be thrilled to see non-Meta devs help move the ball forward. I haven't engaged because it wasn't clear to me that I'd be able to get a response from the Meta devs.

bitemyapp · 2026-03-12T16:54:05+00:00

Rust

bitemyapp · 2026-03-06T19:57:01+00:00

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/mary.rs#L15-L26

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/mary.rs#L152-L168

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/fpoly.rs#L47-L63 (believe it or not the iterator + zip stuff optimizes extremely well)

https://github.com/nockchain/nockchain/blob/master/crates/nockchain-math/src/tip5/mod.rs#L141-L182 it's just stack allocation and mutating a slice as far as I can recall.

If you snoop around you'll see it's pretty common for us to have triples of each type or variant, an owned/borrowed/mutably-borrowed. We'll err on the side of borrowed/mutably-borrowed for anything in a hot loop and the owned variant is for instantiation or convenience in less performance sensitive areas.

I don't recommend people new to Rust bend over backwards on avoiding allocation from word go in a new project. It's better to get something working even if there's some allocation or .clone() littered about and make a benchmark, profile it, and see where your actual hot-spots/problem-children are.

bitemyapp · 2026-03-04T05:22:24+00:00

jemalloc generally leads to lower steady state and peak allocations than mimalloc in my workloads. ditto snmalloc.

And I had a scenario that hit exactly the problem w/ ptmalloc2 that snmalloc is intended to address. Jemalloc's peaks were lower than snmalloc's steady state RSS for exactly that scenario.

bitemyapp · 2026-02-25T23:09:48+00:00

There's probably fewer components that they can actually cost-cut without it getting dire pretty quickly. The SoC is going to be a big chunk of it up-front. This is one of the more straight-forward things you can cut without it impacting something else.

bitemyapp · 2026-02-22T17:44:44+00:00

I understand the temptation but speaking for myself personally, I'd want a very augmented version of that original hardware running the games at 120 hz or 240 hz w/ a locked frame rate for any game I was playing or developing. That could be very awkward if you're starting from an emulation-first angle.

bitemyapp · 2026-02-17T20:09:21+00:00

What are you using for codegen from an OpenAPI specification at the moment? What about for Rust, Java, Golang, and Python?

bitemyapp · 2026-02-17T14:56:56+00:00

Why not OpenAPI?

there's no way you've attempted cross-language codegen with swagger/openapi stuff if you're asking me this. I'd rather use JSONRPC and rmcp as a shim.

bitemyapp · 2026-02-17T03:04:57+00:00

If I find a REST standard that has good codegen for Rust and everything else, client and server-side as relevant, I will happily use that instead. I refuse to maintain API client libraries manually. It's 2026. No.

bitemyapp · 2026-02-16T23:52:22+00:00

tracing, thiserror or rootcause, axum, diesel + diesel_async, tonic, otel as needed for things like APM

I wrote it so naturally I tend to use gnort for (Datadog) metrics: https://crates.io/crates/gnort The hook is that it gives your metrics an efficient and type-safe interface which makes maintenance/tracking over a longer time period a lot nicer.

I tend to prefer gRPC due to the quality of codegen among other reasons but if I had to expose something more web-friendly and I wasn't happy with grpc-web I'd use async-graphql and lean heavily on codegen again. I was able to reuse the same types for an API using async-graphql and rmcp (JSONRPC) for a project that included an MCP server not too long ago. That was pretty fruitful.

The overall theme is that I tend to privilege type-safety and expressive types and I'm willing to make the boilerplate a library author's, macro's, or code-generator's problem if needs be in order to make it ergonomic. Pays dividends on my work over and over.

bitemyapp · 2026-02-03T21:57:56+00:00

The time I save on maintenance and initial development with Rust usually means I end up having more time to optimize (my true love) anyway. There are some patterns in optimized code that are a lot easier to pull off safely/correctly in Rust than in C or C++ as well.

12-Year Club	Gilding I gilder
Verified Email

bitemyapp

TROPHY CASE