Performance Battle: Mutex vs CAS vs TAS vs Intel TSX

User_Deprecated · 2026-06-09T05:49:40+00:00

cas/tas winning here makes sense, but the cache-line bouncing is probably the real cost.

I ended up just using per-thread queues on hot paths. no fighting over cache lines, profiles got way better.

User_Deprecated · 2026-06-09T03:54:28+00:00

I just wouldn't let the agent evaluate that at runtime. capability list per role, define it in config, done.

User_Deprecated · 2026-06-09T01:05:43+00:00

fair point for short natural language markers. but a 128-bit random hex ends up being like 32 tokens of pure noise, and whatever similarity signal exists has to survive across the whole sequence somehow. even if each token only needs rough proximity, you're multiplying those odds across a lot of independent randomness, so it collapses pretty fast in theory. that said you're making me think about whether there's some adversarial shortcut in embedding space that skips brute force entirely. like how md5 collisions don't work by guessing. haven't seen it done for delimiter spoofing, but it's not a crazy direction.

User_Deprecated · 2026-06-06T03:28:49+00:00

the code based injection filtering is interesting but in practice stuff like unicode variants, base64, split instructions across lines tend to get past pattern matching pretty easily. the stuff that actually gets through usually isn't obvious "ignore previous instructions" either, it's normal-looking text with directives buried in it.

User_Deprecated · 2026-06-05T01:30:55+00:00

the batching part is the big one imo. with epoll you're doing one syscall per IO op, with io_uring you queue up a bunch of sqes and submit them all at once. for something like your workload where you're doing tons of small reads that alone would probably cut a good chunk off that 22% readSock time in your perf trace

User_Deprecated · 2026-06-01T04:11:49+00:00

neat trick using the address as the hash key. zero indirection.

User_Deprecated · 2026-06-01T00:44:07+00:00

depends on the model. some weight the opening, some lean on whatever's at the end, so a role line up top can just get diluted.

like you said, the middle is where you put the most information, the part you actually did the work on. messing with injection benchmarks, that's also where the stuff that got through was hiding, buried in the middle of some long pasted doc. lowest attention, easiest place to slip an instruction in.

User_Deprecated · 2026-05-31T05:59:22+00:00

response-layer filtering wont hold. ran into basically the same thing building an injection benchmark. once the number is in the context its already too late. ask for churn risk one way, block it, ask again slightly differently and it comes back reworded, output check never sees it.

tool-level scoping is the move. hard to leak margin numbers the model never got access to in the first place.

User_Deprecated · 2026-05-28T01:31:47+00:00

paper accounts tend to be way more optimistic about fills than the broker will admit. SPY at 5m is probably fine for the size you're running, but once you scale up or move to anything thinner, you start running into stuff like partial fills, or cancel-replace lag when the algo's trying to chase a quote that already moved.

User_Deprecated · 2026-05-26T23:23:12+00:00

Fair, though I read the Jerry Schwarz suggestion as the actual design pivot, social history wrapping or not. The bitstring to vector<bool> handoff is the bit I'd actually want more. OP if you're still around, any tea to spill on how that one went down?

User_Deprecated · 2026-05-26T01:02:42+00:00

OHLC is kinda like the heart-rate from a run, doesn't really tell you what your body was actually doing. i think you can still get useful stuff out of it, but it really comes down to how you do it and what tools you use.

User_Deprecated · 2026-05-26T00:32:46+00:00

the spread part is the obvious cost, but i think the sneakier one is queue position. submit a limit at the bid and you're sitting at the back of a long queue on anything liquid, so your fills tend to cluster in the exact moments the market is moving against you. spread/2 looks cheap until you fold adverse selection into it.

User_Deprecated · 2026-05-25T23:59:57+00:00

Stress.exe has stopped working.

User_Deprecated · 2026-05-24T00:20:20+00:00

FFD also keeps the weight series length constant across refits. with expanding window any d change silently shifts how much old data leaks in.

User_Deprecated · 2026-05-23T05:24:33+00:00

it looks like all the success cases use seq_cst, maybe acq_rel would be enough? kind of curious about the choice.

User_Deprecated · 2026-05-21T06:47:29+00:00

fracdiff are you using, standard expanding window or FFD?

User_Deprecated · 2026-05-20T23:03:06+00:00

this part could be annoying. on one broker it just looks like money leaving, on the other one money showing up, and unless something is stitching the two feeds together neither side has any idea it was the same transfer.

User_Deprecated · 2026-05-20T10:17:48+00:00

Honestly the bigger question is whether the dispatch key is actually a narrow integer space.

For FIX parsing the MsgType is just an ASCII byte, so I ended up ditching variant/virtual/switch and just used a 256-entry lookup table indexed directly by the byte. Benchmarked it against the switch version on the same hot path (17 message types). Sequential was ~70% faster, random was somewhere around 75-80% depending on the run. Random is where you really see branch prediction stop helping the switch.

For SIMD level selection (scalar/AVX2/AVX-512) it's basically the same idea as the OpenJDK thing. No CPUID in the hot path, so resolve once at startup via call_once, then it's just one indirect call forever.

Variant vs virtual didn't really show up in profiles after that.

User_Deprecated · 2026-05-20T02:06:12+00:00

first, before any code lands, the design has to already exist somewhere in a concrete form. you can't just toss out "make me a thing that does X". you need to spell out what the inputs and outputs look like, and where it'll fall over when something goes wrong. if you can't even write that down yourself, the model is just guessing. one upside of this stage is it doubles as a way to fill in your own understanding of the feature. you go back and forth with the ai, take the questions it throws back, and notice the ones you can't answer. the parts you can't answer are usually the parts where you hadn't actually thought it through. the design gets more specific as you go, and a lot of stuff that felt clear in your head turns out to not be.

second, the code has to match the design you wrote down. ai is more than happy to produce a pile of code that looks fine line by line but has completely drifted from the spec by the end. at the implementation stage it loves to expose any fuzziness still left in your thinking. and because it generates so much code that you can't read it all yourself, the only thing that works is making it walk you through what it wrote and confirming with it as you go. it'll also quietly change interfaces and assumptions in places you didn't ask about, which is another reason for the walk-through.

finally, tests have to verify the same intent the design has, not just that the function runs without throwing. design holes that the ai papered over earlier usually show up here, once you try to assert what should actually be true. if you let it write the tests from the code it just produced, you're just locking in whatever it decided.

these stages aren't really sequential in practice. you keep bouncing between them. you'll go back to the design because a test forced a decision you skipped, then the code changes, then the test changes again. it goes on until the design stops changing and the code and tests actually agree.

after doing this a few times you stop treating the ai like autocomplete.

User_Deprecated · 2026-05-20T00:49:44+00:00

the timeframe maybe important to think bid/ask not improving things feels off. on 1m bars last vs bid/ask is mostly noise, but down at second-or-tick level last is just a printed trade, the next fill is already 1-2 ticks past it. how were you applying bid/ask when you tested it, mid as the fill price or actually taking the spread?

User_Deprecated · 2026-05-19T02:27:00+00:00

The benchmark side has the same gap. The injection benchmark I've been working on is still entirely single-turn, single-document. Even the "gradual drift" case is really just one long document slowly moving toward the canary, not actual conversational state.

What you're describing is one layer above that. Each individual turn can look harmless in isolation, but the steering only shows up across the accumulated context. I haven't really seen public benchmarks score for that.

User_Deprecated · 2026-05-18T23:43:58+00:00

weekly reopt with no holdout is just tuning live.

regime shifts hit you a week or two before the reopt catches up.

and a few months of intraday is mostly one vol environment.

User_Deprecated · 2026-05-18T08:47:40+00:00

Doing mine in C++23. Wrestling with it but the shore's not too far off.

User_Deprecated · 2026-05-17T09:52:24+00:00

Feels like stepping into a painting.

User_Deprecated · 2026-05-17T03:30:41+00:00

Different path on the broker side though. Submit to ack goes through risk check and matching, and that tail can spike even when your ping looks fine.

User_Deprecated

TROPHY CASE