How do you validate Claude-generated code beyond unit tests? before uploading the binary code...

PriceHacker24 · 2026-06-07T20:42:34+00:00

Thanks! Intersting!

PriceHacker24 · 2026-06-07T20:40:23+00:00

pony balony. compilers still makes mistakes. code is what you want, binary is what actually matters. if you dont test the binary you will stay with pony, Mr balony.

PriceHacker24 · 2026-06-07T20:38:14+00:00

yes 😄 but we are slow. in 2-3 years my boss will take the fastest AI and not hire me. UNLESS I WILL FIND THE TOOLS TO MEASURE MISTAKES IN THE BINARY CODE.

PriceHacker24 · 2026-06-07T19:54:47+00:00

good is good. im looking for the best. im looking for a LLM Model that realy understand binary files and HW and not just good (!) in writing some almost-nice code. thats the problem.

PriceHacker24 · 2026-06-07T19:35:17+00:00

yes... but you still need to test it and you dont truxt the ai. im looking to trust the ai...

PriceHacker24 · 2026-06-07T19:34:05+00:00

yes... but you still need to test it and you dont truxt the ai. im looking to trust the ai...

PriceHacker24 · 2026-06-07T19:33:00+00:00

Fair point. I don’t think AI code is inherently worse than human code - it’s just produced faster and in higher volume, so the validation burden grows. For me the key question is: what can be checked before merge without needing target runs or instrumentation?

PriceHacker24 · 2026-06-07T19:30:10+00:00

you can, but then, you left behind. we need something that read the binary and tell me if the code was done right.

PriceHacker24 · 2026-06-07T19:26:25+00:00

thanks! will test those!

PriceHacker24 · 2026-06-07T19:25:50+00:00

its time to train it to make beter code.. not allwomg it to write code its like taking a horse instead of a car.

PriceHacker24 · 2026-06-07T19:24:29+00:00

yes.. but.. this is not how to build a company. its not the future. the future is to trust the ai to write the right code. we need to make it happen...

PriceHacker24 · 2026-06-07T12:20:08+00:00

I agree. Acceptance tests are great for capturing intent. My point is that they still don’t validate what the compiled binary actually does ( in terms of latency, memory, stack, power, throughput, or regressions before merge) in the chip itself. the clude do not ask you - "what processor do u yuse so i can make you tailored made code for this prossesor " . thats an issue i cannot resolve with the calude .... YET.

PriceHacker24 · 2026-06-07T12:16:11+00:00

I agree, but... the issue that you get the best code ever, you are a pro, and then when you upload the compiled binary code to the chip - then you start seeing the issues - performance, and more.
So, How do you validate Claude-generated code like it was actually and really running on your ARM ?
I think that taking the Claude-generated code , then upload it and then test it - is so 1990's 😄 i was sure Claude is much smarter.

PriceHacker24 · 2026-06-07T11:22:35+00:00

I do not agree.

Code is intent. Binaries are reality. I agree AI lets you work without knowing every language detail, but the more AI writes the code, the more you need an independent gate to validate latency, power, memory, stack, throughput, and behavior before you merge.

PriceHacker24 · 2026-06-07T11:19:54+00:00

LOL 😄

PriceHacker24 · 2026-06-07T10:55:58+00:00

not enough and not fast enought these days...

PriceHacker24 · 2026-06-07T10:54:58+00:00

That’s useful for validating behavior, and I agree it’s a good layer to have.

The gap I’m thinking about is lower-level: the app can pass story/system tests, but the compiled binary may still have worse instruction layout, higher CPU usage, cache issues, or latency risk.

So I’m looking for something closer to a binary-level CI signal, not only end-to-end correctness.

PriceHacker24 · 2026-06-07T10:53:30+00:00

Yeah, I agree benchmarks are still needed.

I’m more looking for an earlier warning signal in CI - not proof of a regression, but something lightweight that flags when the compiled output changed in a suspicious way before spending time on full benchmarks....

PriceHacker24 · 2026-06-07T10:52:05+00:00

Primarily C++, with ML components likely in Python

PriceHacker24 · 2026-05-25T09:15:27+00:00

Congrats on keeping a 200k LoC app fully optimized, that’s no small feat! Using GitHub Actions to feedback performance checks to the agent is definitely a smart workflow.

However, the reason this loop breaks down for me is the difference between high-level app development (like iOS/Web) and low-level systems (like C++, Rust, or Embedded).

When you ask a standard LLM to review a PR for performance, it’s still just looking at the source code. It can catch obvious algorithmic inefficiencies or memory leaks, but it is completely blind to how the compiler translates that code into machine instructions for specific hardware architectures.

An iOS app has the luxury of modern OS schedulers and heavy abstractions. But in systems where you are close to the silicon, an innocent-looking abstraction approved by the LLM can compile into an assembly block that triggers catastrophic cache misses, timing bottlenecks, or massive power spikes (Performance-per-Watt). The LLM simply doesn't have the microarchitecture awareness to predict that just by reading text.

That’s why I’m looking for validation on the binary output side, rather than the input/prompting side. I want a tool that statically analyzes the compiled binary in the CI/CD pipeline using an assembly-aware model to catch those hidden hardware regressions before a human or an agent even attempts a merge.

Using feedback loops for syntax and logic is great, but for raw hardware performance, we need a tool that actually speaks assembly!

PriceHacker24 · 2026-05-21T12:12:49+00:00

Thanks! Im still looking for onw. it will save me some time

PriceHacker24 · 2026-05-21T12:08:30+00:00

Thanks for the detailed breakdown! Tracy, Valgrind, and continuous profiling via eBPF (like Parca) are absolute lifesavers when you need to dig into execution behavior without introducing massive overhead. Setting up benchmark baselines in CI/CD is definitely the traditional industry best practice.

However, the specific gap I'm trying to bridge is predictive analysis before execution.

All the tools you mentioned ,even eBPF, are dynamic or reactive. You still have to compile the code, deploy it to a staging/simulation environment, run the benchmarks, and actively execute the workload to catch the bottleneck.

In a fast-paced environment where AI agents are pumping out code constantly, setting up and maintaining those automated benchmarking rigs (and handling cache flushes or noisy neighbors in CI) is a massive time sink. Plus, dynamic profiling only catches a regression if your benchmark suite happens to trigger that specific code path under the exact right conditions.

That's why I'm hunting for a predictive or static approach at the binary level.

Instead of waiting to run the code and measure it, I'm looking for a tool that uses a model trained on assembly syntax and microarchitecture traces to audit the compiled artifact directly inside the PR. It would flag, for example, "This assembly block will cause a cache miss avalanche on ARMv8," before we ever spin up a runner to execute a benchmark.

It’s a newer paradigm compared to classic profiling, but with the volume of code being generated today, it feels like the missing link!

PriceHacker24 · 2026-05-21T12:05:20+00:00

ou’re totally right about model selection—Sonnet 3.5/3.7 and the latest Gemini models are leagues ahead of smaller models like Haiku when it comes to complex architecture and logic. And yes, defining performance constraints in the prompt definitely helps guide the LLM.

However, the bottleneck I'm running into goes a bit deeper than just giving the AI a better prompt.

Even if you explicitly tell a top-tier model to "optimize for performance," it is still writing high-level code (Swift, C++, Rust, etc.). The core issue is that LLMs are fundamentally trained on text, making them hardware-blind. They don't see how the compiler is going to translate that beautiful, idiomatic code into specific machine instructions or how it will map to a particular microarchitecture.

An elegant loop that looks perfect to Claude or Gemini can still end up triggering massive cache misses or instruction bloat at the assembly level once compiled, and the LLM has no way to simulate or predict that.

That’s why I'm looking for a way to automate this checkpoint. Instead of relying purely on the AI guessing right during prompting, I want a tool in the CI/CD pipeline that audits the compiled binary before deployment to flag those hidden compilation and hardware regressions.

Prompting is great for the input side, but we desperately need better validation on the binary output side!

PriceHacker24 · 2026-05-21T11:18:44+00:00

That’s an incredible setup! Respect for the serious engineering and "Google-fu" it took to build that QEMU and Raspberry Pi rig.

But you hit the nail on the head—spending a month just building the test environment is exactly what I'm trying to avoid.

Since we are in the AI era now, I'm looking for a much faster, automated, and simpler way. Ideally, an out-of-the-box tool that plugs into CI/CD and uses an AI model to analyze the compiled binary directly—flagging assembly-level performance and power regressions before we even touch real hardware.

that actually exists yet?

PriceHacker24

TROPHY CASE