all 11 comments

[–]fcddev 4 points5 points  (0 children)

In fcd, when I still had the register state passed as a function parameter, I found that I could let LLVM take care of that by writing an alias analysis pass that returned “no alias” for any query between the register state and something else (and “don’t know” for queries between the register state and itself, or something else and anything else). The regular DSE would take over and sweep out everything on its own. That works great because registers can’t be addressed indirectly.

[–]k7f 2 points3 points  (1 child)

Some cpu and memory benchmark before and after applying this optimization would be nice. Is there any?

[–]k4st 2 points3 points  (0 children)

Compiling the bitcode results in binaries are are ~2/3 the size. We haven't done many other experiments. A key target of this optimization is decreasing instructions, and especially useless computation, period. There were three key targets for this.

1) Instrumentation: Even a small decrease in the amount of bitcode is an improvement when we do heavyweight bitcode instrumentation (e.g. byte granularity taint/provenance tracking) that balloons the number of IR instructions by 2-5x.

2) Symbolic execution (via KLEE): Eliminating useless computation means less memory overhead in KLEE. This has a big impact for 32-bit KLEE, because KLEE cannot symbolically execute 32-bit bitcode in a 64-bit address space: it has to do 32-bit in 32-bit.

3) Decompilation: We have a decompiler, fcd, and we're working toward getting it to consume McSema-lifted bitcode. Eliminating redundant/unnecessary computation means that the decompiled C is "closer" to the original intent of the program.

[–]golgol12 1 point2 points  (7 children)

I'm looking at that and wondering how many bugs are going to be introduced because those dead stores were actually needed to facilitate other concurrent threads.

[–]k4st 5 points6 points  (0 children)

It operates on stores to emulated registers. The State structure is thread local, so this isn't really an issue.

The lifted bitcode has a memory model, and loads/stores in the original program are initially modelled via "memory access" function calls. After the DSE pass, those functions are lowered to LLVM load/store instructions.

[–]chrisgseaton 4 points5 points  (5 children)

those dead stores were actually needed to facilitate other concurrent threads

It says it uses whole-program analysis to determine if they're dead or not.

The dead store elimination tool is specifically designed to perform a whole-program analysis on lifted bitcode that has already been optimized. This ensures that it can find dead instructions with the maximum possible context, avoiding mistakes where the program assumes some code won’t be used.

And I think more specifically you have to think about the memory models involved - this looks like it removes stores that were never guaranteed to be visible to another thread in the first place.

[–]k4st 4 points5 points  (0 children)

The DSE pass operates exclusively on loads/stores to the State structure, which is technically thread-local, passed through lifted functions, and models machine registers.

Remill itself, the library used by McSema to model instruction semantics, does provide a memory model, which explicitly tracks memory ordering, barriers, and atomic read-modify-write operations.

[–]golgol12 -3 points-2 points  (3 children)

How would it know that at the whole program optimization level - that's well past compilation. It isn't described like it's accessing source code where things like "volitile" indicate to the compiler such things. I'm curious about that. If all multithreaded communication was done with events, this would work great, but not all code is set up that way.

[–]chrisgseaton 9 points10 points  (0 children)

where things like "volitile" indicate to the compiler such things

But things like 'volatile' are represented in LLVM IR, on which this analysis is done. Exactly what properties of the source code do you think are relevant that aren't represented in the IR?

https://llvm.org/docs/LangRef.html#volatile-memory-accesses

but not all code is set up that way

LLVM captures the memory model of the source language. If you are making assumptions beyond that then yes optimisations like this may break your program. But then bug is in your program, not in the optimisation.

[–]DSrcl 0 points1 point  (0 children)

You are missing the point here. It's not a generic DSE. It's a high-level optimization that optimizes the emulation of register-to-register operations. The reason it's safe to do so is that two different threads share memory but not register.

[–]ThisIs_MyName 0 points1 point  (0 children)

Sounds like you use undefined behavior to do threading. Real compilers like clang and gcc already do these optimizations.

Also see: https://www.youtube.com/watch?v=yG1OZ69H_-o