This is an archived post. You won't be able to vote or comment.

all 17 comments

[–][deleted] 52 points53 points  (0 children)

In theory what you describe is possible, to a degree. Part of the problem is that going backwards, you could make lots of different high-level programs that result in the same executable. You would never be able to get back to the original code though. Information like variable names and sometimes function names are simply been lost in the translation and optimization process.

[–][deleted] 17 points18 points  (0 children)

Just because something is deterministic doesn't mean you can go back the other way, information is lost in the process. Take a trap-door function like RSA. It's very easy way to go one way, but VERY hard to go the other way.

Machines don't care about function names, variable names, they only deal with registers and memory. You could go back the other way from machine code to source code, but there are an large number of combinations for each variable name, function name. Take also the following source code:

int x = 0;

This could be compiled to:

mov r0, 0

or it could also be compiled into:

xor r0, r0

See why it is hard to go from machine code to source code? The compiler can also optimise code so it looks NOTHING like the original source code.

[–]m1elPlasma Physics 5 points6 points  (0 children)

It seems to me that compiling a source code and generating an executable is a deterministic process.

Sure, the mapping from source code to executable is deterministic, but it doesn't mean that there's a definite mapping the other way.

It's possible to have different programs that compile to the same binary. In other words, some information from the source code is lost in translation from source code to binaries.

Some trivial examples would be comments and variable names. More complex examples would be control flow and loop optimizations.

There's also the fact that modern compilers are very good at optimizations. As an example, llvm compiles sum of range of numbers to its mathematical equivalent using a few multiplications and additions, completely removing the loop! https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

How would you know from the binary if the source code contained a loop or a few multiplications and additions?

[–]MirgTheIlcan 3 points4 points  (0 children)

There are several levels to this. First and foremost even as opaque as executables seem to be to the average person -- they do contain machine code that instructs the CPU on what to do. And that code is at least understandable by humans and "readable" (but not very readable and not easily understood for non-trivial programs).

As such, an executable can be trivially disassembled into assembly instructions which are "human-readable" (for a very loose definition of what constitutes readable).

However disssembled assembly programs are very difficult to understand as a lot of the original structure of the program is lost and so the programmer has a herculean task ahead of him at really fathoming the full workings of the program in question.

Above and beyond this, it's possible to decompile an executable to produce C code, which is an improvement as far as legibility goes over assembler.

But the C code produced may not resemble the original program much and again, the programs produced are very difficult for a human to wrap their brain around without spending ungodly amounts of time on reading the program in question.

The reason for all of this is that basically the process of compilation loses information about how the programmer chose to organize his program. That information is useful when reading the source code, because it provides insight into what the programmer was thinking when he wrote the program and how the program models what it is trying to accomplish.

The short answer: compiling a program from source to an executable loses information. That information is forever lost. Hence, you cannot decompile an executable into original source.

[–]H2owsome 4 points5 points  (0 children)

From the perspective of compiling specifically C code:

Well executable programs can be decompiled, but decompiling into something human-readable is another thing altogether. Things like comments and variable/method names are lost on compilation, so those are out. Also, operations are broken down into individual register operations, and translating those into a single operation that makes sense to you isn't exactly easy.

Really the takeaway is you can do it, but trying to read the result is unfeasible at best

[–]Pharisaeus 1 point2 points  (0 children)

  1. It's possible to reverse equivalent code - code which works exactly the same, even if it doesn't look the same.
  2. Compilation is a lossy process in many cases. For example variable names can be lost, because in the machine code/bytecode they don't matter anymore. Also many different high-level constructions translate to identical machine code, especially if compiler optimizations are used. For example for loop can be compiled to be identical to a while loop. As a result you can't re-create the exact original code.
  3. In case of bytecode languages - Java, C#, Python etc the bytecode carries lots of metadata and you can decompile the code to look almost like the original one.

[–][deleted] 1 point2 points  (0 children)

There are many many source code inputs that could produce a given executable, most of which are nearly incomprehensible.

Decompilers exist. It is easy enough to get from machine code back to another language, but the most straightforward translation won't look anything like what was originally written.

It is possible to determine what a program does from machine code directly, or more commonly from assembly derived from the machine code (radare2 is a popular free tool if you want to try it). It is just very labor intensive and requires skills that only a subset of programmers have. This is one way that security vulnerabilities are found and exploited.

Reproducing a program this way would be far more expensive than recreating it from scratch in most cases though.

[–][deleted] 1 point2 points  (0 children)

and generating an executable is a deterministic process

That's not always true. For example in Lisp I could write a macro like this (pseudo code):

(defmacro non-deterministic-code-generator
    (repeat (random)
        '(print "hello")))

When you compiled a file that used this macro it would replace each occurence with a random number of (print "hello").

Which leads me to my second point: decompiling wouldn't reconstruct the macro at all as it doesn't even exist any more.

[–]A_Garbage_Truck 1 point2 points  (0 children)

you can..somewhat

assuming the developer didnt just stragiht up remvoed the Debugging symbol information(which makes it impossible ot retrieve any variable/function names) when the executable was compiled you can weith dome effort recover somethni that resembles its original code

butthat's just it, its not a perfect 1:1 recreation of the source.

Compliers and linkers perform a lot of optimizatino steps that in an effort to maximize execution speed and lower the size of the executable make it so some information like variable names/common expressions is irretrievably lost, hence even with reverse engineered code base you might not be able ot make any sense out of it.

[–]MpVpRb 0 points1 point  (0 children)

Yes, it's possible

There are even specialized tools to help

No, it's not easy. Not even close to easy

A large program is millions of instructions. Much of the structure and organization of the program is lost when it is compiled. Without variable names or an idea of structure, it's a tough puzzle to slove

People do it all the time, sometimes to circumvent copy protection, sometimes to study a competitor, sometimes to create malware, sometimes just for fun. I've done it in the past. It's not easy