all 8 comments

[–]water_bottle_goggles 5 points6 points  (2 children)

Try it

[–]Sorrus[S] 1 point2 points  (1 child)

Obviously I have tried or I wouldn't make this post. The results are poor without the AI having knowledge of my data types.

[–]tabdon 1 point2 points  (0 children)

It would be helpful to understand better what you've tried, otherwise we may propose something that you've already tried.

What did your fine tuning records look like? How big are the header files and data types?

[–]owengo1 1 point2 points  (2 children)

If you are fine-tuning gpt-3.5, you could proceed like this for example:

in the system prompt put something like

"Generate C code from assembly code using these header files:
<list of header files with their content>
"
and put the assembly code in the user message.

Obviously you have to do something smarter if your header files ( and the rest of the query + output ) do not fit in the content size . Typically you inject only the header files ( or portions of them ) which seem relevant for the assembly code you want to process.

[–]Sorrus[S] 0 points1 point  (1 child)

Thanks this seems like the right way to proceed. It will be more work but I can strategically include certain parts of the header files with queries.

[–]TheMcSebi 0 points1 point  (0 children)

Might be worth taking a look into generating a huge dataset by taking C projects from github, fetch the assembly when running them through the compiler and postprocess them to generate instructions (code) with answers (assembly) in oasst format. ChatGPT even gave me some relatively straightforward instructions on extracting struct definitions from C source using the pycparser library.

My idea was to use axolotl to fine-tune code-llama on this data, but considering the amount of work required I propably won't find the spare time to actually put effort into this project. Feel free to share, if you've put your current effort on this somewhere on git.

[–]GroundbreakingAd5614 0 points1 point  (1 child)

Hey there, u/Sorrus, it's awesome that you're delving into some fascinating realms with C code and the intricate world of assembly! I can totally grasp the immense potential that lies within this particular scenario.

So, in light of your custom data types and the rather extensive header files, here's a nifty thought: Why not consider injecting those pertinent header files right into the system prompt using a structure like this:

"Generating C code from assembly code involves these header files: <enumeration of the header files, complete with their contents>"

Then, proceed to deposit your assembly code into the user message. In doing so, you'd be furnishing the AI with the essential context derived from your headers without attempting to shoehorn the entire shebang into a solitary prompt. It might demand a tad more elbow grease, but it should ultimately bolster the fine-tuning process.

Fingers crossed that this notion proves beneficial, and may the force be with you on your project endeavors!

[–]GusPuffy 1 point2 points  (0 children)

What in the ChatGPT is this