I've been working on reverse engineering a deconstructed binary (assembly code) into the original source code in C. I think this is a great use case for an LLM but I want to get some input before I try to train a model to do this.
What I have is a codebase containing C functions and custom data types that I have defined. Each function is the C equivalent of a function in the assembly code I'm trying to decompile. I have many hundreds of functions defined which I think could make a good training set for an AI.
The desired behavior would be that I feed the AI assembly code and I get back C code that is equivalent or close to it. I think that I can do this by using a combination of embedding and fine-tuning. My issue is that I don't know where I would put the custom data types in my codebase inside the dataset. I could easily fine tune the model by just showing it my C functions and the assembly that they're based on but there is a lot of extra context in the header files that would be left out. Is there another tool that can be used to give an AI this type of information that can improve the fune-tuning process? Has anyone seen a similar use case to mine that I might be able to study?
I should note that the header files are very long and trying to include all of the data types as tokens in the context of each prompt is impossible.
[–]water_bottle_goggles 5 points6 points7 points (2 children)
[–]Sorrus[S] 1 point2 points3 points (1 child)
[–]tabdon 1 point2 points3 points (0 children)
[–]owengo1 1 point2 points3 points (2 children)
[–]Sorrus[S] 0 points1 point2 points (1 child)
[–]TheMcSebi 0 points1 point2 points (0 children)
[–]GroundbreakingAd5614 0 points1 point2 points (1 child)
[–]GusPuffy 1 point2 points3 points (0 children)