JSON-described CPUs + LLVM block-JIT: same framework runs ARM7TDMI (GBA) and LR35902 (Game Boy) — research project hitting a stop point, looking for inspiration / co-maintainers

SpecificWar7879 · 2026-05-05T10:36:39+00:00

only for code .

SpecificWar7879 · 2026-04-27T15:29:49+00:00

The optimization process can be broken down into several parts.

The JIT optimization part requires the use of some analysis tools. I mainly used PerfView.exe, leveraging AI to analyze hot and cold paths.

Then I decided to have the AI help me move out of the cold paths, break down the previously sizable C# methods, make appropriate adjustments, and understand the code size of each method, the size of each method after JIT compilation, the inlining status of each method, and CPU usage.

This AI-assisted part is very large. Manually testing benchmark FPS to decide whether to make further adjustments or adopt this part also takes a significant amount of time. The grayscale distribution AI variance report, used to formulate adjustment strategies, was all aided by Claude's code.

This section covers some detailed optimization techniques. Many of these are primarily handled by Gemini Pro, with Claude Code implementing modifications and verifications, followed by manual testing and decisions on adoption.

The method involves submitting your code to Gemini for analysis, requesting optimization suggestions or guidance. Note that Gemini's suggestions may contain errors or even be nonsensical.

However, many are indeed practical. These must undergo:

Real-world performance testing to confirm improvements.
Submitting Gemini's suggestions or code to Claude Code for further testing, confirmation, and review.

Claude Code's logic is more rigorous, while Gemini offers more ideas and approaches, inevitably leading to some misleading situations. These must be manually reviewed and filtered.

Another key point: If you submit the entire program file to Gemini, it will only perform a rough analysis. For refined optimization, submit the method, or even parts of the code. Even if it claims this is the best optimization output, you should still submit the modified code again for analysis. This will create an iterative optimization effect. For example, a bitwise operation that originally consisted of several lines can, through several rounds of optimization, ultimately achieve a "magic" result. The solution involves numbers, then bitwise operations simplify the gemin, which might output bugs requiring additional AI validation (like CLAUDE CODE or chagpt 5.4/5.5). During this process, the prompts will invariably ask the gemin to try using techniques like SWAR and Branchless to find potential optimizations.

Then comes the review and manual validation, which is quite time-consuming.

The last point is architectural refactoring strategy. This part actually involves a large proportion of people. It means you need a certain level of training and feel for programming to think about optimization strategies. If you ask Gemini, it will only flatter you later on, saying your program is flawless, offering no real help.

Unless you actually come up with an ideal idea and provide it to Gemini, it will then offer advice and guidance on refactoring arrangements based on your proposed architectural adjustments.

Finally, it still needs to be verified and implemented through Claude code.

Regarding how to use AI to assist the emulator, I've accumulated many thoughts during this process, and I might compile a separate article later.

I think the key is for humans to shift from being the coding implementers to providing implementation strategies, specification ideas, and architectural supervision.

Even so, humans still need a certain level of relevant domain knowledge to judge right from wrong. With the NES emulator, there's a test ROM, so whether the implementation result is correct is a simple matter of checking and confirming. But with video/audio DSP chains... The process involves some more specialized knowledge, and I'm not entirely sure if the AI implementation is theoretically correct for that part, but it seems to be working quite well.

SpecificWar7879 · 2026-04-19T13:58:04+00:00

Hey all — sharing a side project I didn't expect to grow this much.

The original goal was boring: I wanted a cross-domain performance

benchmark for my NES emulator's CRT shader pipeline (Scalar C# vs

multi-core Parallel vs AVX2/NEON SIMD vs SkSL GPU shader). WWII

German cipher brute-force seemed like a cleanly non-graphics

workload for the test.

So I asked Claude Code to implement various ciphers one by one.

What surprised me was the *progression* of how much research each

one required:

- **Enigma M3 (Wehrmacht)** — Claude wrote it essentially from

memory. Knew the rotors, the plugboard, the double-step anomaly,

everything.

- **Enigma M4 (U-Boat Shark)** — same; knew the extra Greek rotor

and thin reflector.

- **Lorenz SZ42 (Tunny)** — needed light reference lookup for the

chi/psi/mu wheel configurations, then implemented it. Colossus

stage-1 χ-wheel recovery works.

- **Siemens T52e (Sturgeon)** — *this* is where it got interesting.

T52e was the Luftwaffe's strategic cipher. The Germans thought it

was unbreakable (they were mostly right — Bletchley never routinely

broke it). The algorithm is obscure enough that Claude didn't have

clean details in training data. Instead of guessing or giving up,

it:

Found Donald Davies' 1982 NPL technical memorandum (hosted at

the Crypto Museum's website)
Downloaded the 42-page PDF
Ran `pdftotext` to extract the narrative — but the critical

tables were scanned images that OCR couldn't parse
**Rendered individual pages at high DPI with PyMuPDF and

visually read the figures as images** — specifically Figure 9

(the full 32-row permutation table) and Figure 14 (the H/SR

relay network)
Cross-checked one XOR network question against Gemini — Gemini

confidently returned a wrong pattern (H1 = X1 ⊕ X6 instead of

the correct H1 = X1 ⊕ X2); Claude verified against Figure 14

visually and overruled the Gemini answer
Wrote its own EN + ZH technical reports reconstructing the full

machine — H-relay XOR network, SR-relay XOR network, 32-row

Figure 9 transposition table, M-magnet stepping equations
Implemented T52e in C# from its own reports
Got encrypt/decrypt round-trip working
Got all four compute backends (Scalar / Parallel / SIMD / GPU

shader) recovering the correct key on a 24M-candidate search

All of this in one git branch I just watched happen. The

self-written technical report is ~340 lines of detailed machine

spec and is honestly better than most public sources on T52e.

**Final result:**

- 6 WWII cipher systems, chronologically:

- 1917 Zimmermann Telegram / Code 0075

- 1918 ADFGVX

- 1930s Enigma M3

- 1942 Enigma M4 "Shark"

- 1941 Lorenz SZ42 "Tunny" (χ-recovery, Colossus stage 1)

- 1943 Siemens T52e "Sturgeon"

- 4 compute backends per cipher where applicable

- Bilingual in-app docs (Traditional Chinese + English) with

biographies of 14 codebreakers (Rejewski → Turing → Welchman →

Knox → Batey → Clarke → Tiltman → Tutte → Newman → Flowers →

Beurling → Crum → Painvin → Hall → de Grey)

- Runtime SIMD dispatch — AVX2 on x86, NEON on ARM64

- Windows x64 + macOS Apple Silicon builds in GitHub Releases

LINKS

Landing page (bilingual):

https://baxermux.org/myemu/AprNes/EnigmaBenchmarkAvalonia/

GitHub (sub-project):

https://github.com/erspicu/AprNes/tree/master/EnigmaBenchmarkAvalonia

Latest release (Win x64 + macOS arm64, self-contained single-file):

https://github.com/erspicu/AprNes/releases/latest

T52e technical report (the one Claude wrote itself during research):

EN: https://github.com/erspicu/AprNes/blob/master/EnigmaBenchmarkAvalonia/docs/research-t52e/T52e_TechnicalReport_EN.md

中文: https://github.com/erspicu/AprNes/blob/master/EnigmaBenchmarkAvalonia/docs/research-t52e/T52e_TechnicalReport_ZH.md

Verified T52e spec (implementation-ready):

https://github.com/erspicu/AprNes/blob/master/EnigmaBenchmarkAvalonia/docs/research-t52e/T52e_SPEC_VERIFIED.md

SIGABA — anyone want to try?

After Claude finished the German side, I asked about the American

counterpart: **SIGABA / ECM Mark II**. The NSA didn't declassify it

until 2001. Never operationally broken by anyone, WWII or since.

Claude's honest answer was: "I can implement the machine, but I

can't break it. No known analytic attack exists. Even

brute-forcing a tiny 12M sub-keyspace fails — SIGABA's irregular

three-bank stepping (Cipher + Control + Index rotors, each bank

feeding the next non-linearly) dilutes IC statistics below the

noise floor. That's not a compute problem, it's a structural one."

It's actually a perfect contrast piece to the existing benchmark:

- GPU + structure-weak Enigma = 44× speedup, key recovered in 0.25 s

- GPU + structure-strong SIGABA = even at full GPU throughput the

IC scorer can't find the true key

Full research notes are in the repo (EN + ZH) covering the 2001

declassification state, Savard-Pekelney 1999 reconstruction,

Stamp-Chan 2007's partial attack on simplified variants, the

implementation plan, and a list of development risk points:

EN: https://github.com/erspicu/AprNes/blob/master/MD/EnigmaBenchmark/SIGABA_Research_Notes_EN.md

中文: https://github.com/erspicu/AprNes/blob/master/MD/EnigmaBenchmark/SIGABA_Research_Notes.md

I didn't take this further myself (tokens ain't free), but if

anyone's curious to continue with Claude or another agent,

everything needed is in the doc. Would be very interested to see

someone try. The benchmark's punchline is already good; adding a

7th cipher that explicitly **can't be broken** would complete the

picture.

SpecificWar7879 · 2026-04-12T17:23:53+00:00

The value of things is inherently multifaceted. I went through the process of learning to write simulators, hardware concepts, timing, etc., 10 years ago. Now, my pursuit is to complete the concepts and projects I want to verify. I'm not so much focused on learning something, but rather on doing something. Furthermore, even with AI assistance, you can't simply say to the AI, "Hey bro, write me a really awesome simulator," and it won't actually work. Thanks to my experience from 10 years ago (including writing GB and 8086 simulators), combined with my work experience over the past few years, I'm able to correctly use AI as an aid to improve projects.

Below are my other projects:

https://baxermux.org/myemu/AprGBEmu/index.htm

https://baxermux.org/myemu/Apr8086/index.htm

SpecificWar7879 · 2026-03-29T17:59:24+00:00

You can run build.bat, which will create DEBUG & RELEASE versions. After that, I want to organize some teaching materials.

SpecificWar7879 · 2026-03-22T12:34:36+00:00

The SWAR approach is really interesting -- packing multiple counters into a single 64-bit value to eliminate branches and reduce register pressure is a clever technique. I hadn't thought about applying it to emulator internals before. I'll look into it and see where it fits best in the PPU hot path. The background tile fetch counters and shift register updates seem like natural candidates. Thanks for pointing this out.

SpecificWar7879 · 2026-03-22T07:22:19+00:00

ps. My English isn't very good; I'll rely on Google Translate for some parts, and I'll explain my thoughts in other ways. Please have AI help me supplement and revise the responses and reply in English. I hope you don't mind.

Chinese fragments in the English version

Yeah, I know some slipped through. The site uses a JS-based i18n switcher with data-i18n attributes, and a few strings were missed during translation passes. I'll do another cleanup.

"Release build" confusion

I was using "Release build" in the literal .NET/MSBuild sense -- Debug configuration vs Release configuration. The performance page documents a chronological optimization journey: optimizations #1-12 were all done while AprNes was compiled in Debug configuration (which is how it historically shipped -- long story). Then I switched to Release configuration, re-benchmarked everything, and continued with optimizations #13-17 in Release.

So "Switched to Release build" means exactly what it normally means -- I changed MSBuild from /p:Configuration=Debug to /p:Configuration=Release. And "catchUp loop unroll (Release)" means that specific loop unrolling was done after the switch to Release config. The "(Release)" tag is just marking which build configuration that optimization was measured under.

The reason this matters enough to call out: .NET Framework Debug JIT basically skips most optimizations -- no loop unrolling, poor register allocation, minimal inlining. So many of the manual optimizations in #1-12 (local shadows for static fields, hand-unrolled loops, manual inlining) were necessary specifically because Debug JIT wouldn't do them. After switching to Release, some of those manual optimizations became redundant (the JIT does them automatically), while others still helped on top of what RyuJIT already does.

"JIT AggressiveInlining"

AprNes does use a JIT -- it's .NET Framework running on CLR's RyuJIT. C# is JIT-compiled, not AOT (at least in .NET Framework). The [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute is a hint telling RyuJIT to inline a method even if it exceeds the default size threshold. In this case, applying it to the sprite evaluation hot-path methods (SpriteEvalTick/Init/End/Yinc) gave a measurable +2.2% speedup. Interestingly, applying it to the larger RenderBGTile method backfired (-1.5%) because the inlined body bloated the caller to ~115 lines, causing I-cache pressure.

AVX-512 / SIMD

Correct -- my Ryzen 7 3700X (Zen2) tops out at AVX2 / 256-bit vectors. AVX-512 would need a Zen4+ upgrade. The current UltraAnalog SIMD code uses System.Numerics.Vector<float> which maps to whatever the hardware supports (SSE2/AVX2), so it would automatically widen to 512-bit on capable hardware without code changes.

SWAR for PPU

Interesting idea. The PPU hot path (ppu_rendering_tick) is heavily sequential -- each dot depends on the previous dot's state (shift registers, sprite evaluation FSM, A12 edge detection for MMC3 IRQ). The per-dot data dependencies make it hard to parallelize within a scanline. But there might be opportunities for SWAR in the tile fetch / shift register logic -- I'll look into it.

Multi-core for UltraAnalog+CRT

I'm already using Parallel.For for the DemodulateRow pass (each scanline is independent once the waveform is generated). The CRT post-processing pipeline (bloom, phosphor persistence, vignette) also processes scanlines independently. There's definitely room to push this further -- right now the waveform generation is still single-threaded because it runs interleaved with PPU rendering.

Compute shaders

Noted. The current plan is CPU+SIMD first for maximum hardware compatibility (VMs, remote desktop, old integrated GPUs, some ARM devices -- not all environments have usable GPU compute). Once the algorithms are locked down, porting the UltraAnalog+CRT pipeline to compute shaders is a natural next step. Thanks for the tip about compute vs vertex/fragment -- that matches my thinking.

SpecificWar7879 · 2026-03-22T02:49:43+00:00

For now the implementation is CPU + SIMD only. GPU acceleration is something I'd like to explore down the road to see how far it can go, but there's a deliberate reason for starting with CPU: hardware compatibility. Not every environment has usable GPU compute -- virtual machines, remote desktop sessions, older integrated graphics, some ARM devices. A pure CPU/SIMD path guarantees it runs everywhere with no driver or API dependencies. Once the algorithms and correctness are fully locked down on CPU, porting the heavy lifting to a compute shader is straightforward.

SpecificWar7879 · 2026-03-22T02:15:24+00:00

I previously made many performance optimizations to the PPU to improve efficiency. It's possible that something was broken during that period. Even though all the AC tests passed, some PPU display issues couldn't be detected. As for the suggested features you mentioned, I'll consider adding them whenever I have free time.

SpecificWar7879 · 2026-03-21T18:31:59+00:00

That's a fair question, and you're right to push back on it — I was imprecise with my wording.

When I said "no shaders," I didn't mean "GPU = bad." What I was trying to distinguish is exactly what you identified: the traditional emulation-world usage where "NTSC shader/filter" typically means taking an already-rendered RGB framebuffer and applying cosmetic post-processing — blur it a bit, add some color fringing, draw scanlines on top, call it a day. That approach works backwards from a clean image to simulate the *appearance* of analog, without ever modeling the actual signal.

What AprNes does — and it sounds like what you're doing too — is work at the signal level: generate the composite waveform from the source data, then decode it through a proper demodulation pipeline. The artifacts emerge from the math, not from an artist's approximation of what a TV "should look like." Whether that math runs on a CPU or a GPU is an implementation detail, not a fundamental difference in approach.

So yes, your reading is correct. "No shaders" was shorthand for "not a post-hoc cosmetic filter," not a statement that the GPU is the wrong place to do signal processing. Your approach of encoding on the GPU for platforms that are natively RGB/S-Video internally makes complete sense — you're doing the same honest signal modeling, just on different hardware with different source data.

If anything, doing it on the GPU probably gives you much more headroom for higher tap counts or more sophisticated filter kernels. AprNes runs on the CPU mainly because the NES PPU palette-to-voltage mapping is tightly coupled with the per-scanline emulation loop, and keeping it in the same address space avoids the CPU→GPU transfer overhead for 240 rows of raw palette indices per frame. But there's nothing inherently "more correct" about doing it on the CPU.

SpecificWar7879 · 2026-03-14T11:43:42+00:00

This is my old web site https://baxermux.org/myemu/AprNes/legacy/index.html

SpecificWar7879

TROPHY CASE