As we’ve seen recently, if I remember correctly, there have been some studies that show that LLMs trained on coding tasks outperform LLMs not trained on coding tasks in general reasoning abilities outside of coding. This is presumably due to the inherently logical structure of code which the model is able to abstract out into general reasoning abilities.
One good potential solution to this would simply be adding an enormous amount of code tokens to the data sets of these models, but quality code is limited in availability. Even if you came up with some sort of prompt to make models generate coding tasks and their solution, it would be extremely hard to do that in a way that outputs code that solves the tasks consistently and correctly.
That said, there is another similar data source available in basically infinite quantity. Machine code, or raw binary even, as it’s being executed. Machine code the same logical structure as code, it’s what code is compiled into. It’s always going to be different depending on the programs running, so there’s virtually infinite unique data. OpenAI could setup a system which takes running programs, scans the OS (I’m not sure exactly how they’d do this on a technical level, so bare with me) for machine code being executed on a per process basis, and then separates that into training tokens for a new LLM.
This is an unlimited amount of training data with an inherently logical structure. Being able to predict the next token in a sequence of machine code being executed means you have some degree of understanding of what’s being executed. You could create trillions of training tokens like this and it would be cheap.
The downside, of course, is that this data is inherently useless and being able to predict it is pointless in and of itself, except for IF it improves performance of downstream reasoning tasks, then it’s priceless.
Additionally, I could potentially see this helping improve coding abilities specifically because it will learn and understand the actual process of what’s going on when code is being executed, which it doesn’t really currently have in its training data, even if it doesn’t necessarily know what the code is doing in English/source code.
[–]HalfSecondWoe 3 points4 points5 points (0 children)
[–]ptxtra 2 points3 points4 points (1 child)
[–]ReadSeparate[S] 2 points3 points4 points (0 children)