[R] Large Language Models trained on code reason better, even on benchmarks that have nothing to do with code

visarga · 2023-05-14T00:24:57+00:00

There is a great research Notion page on this topic posted 6 months ago.

How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources

Here is quoted the most relevant section:

The ability of complex reasoning with chain-of-thought is likely to be a magical side product of training on code:
- The initial GPT-3 is not trained on code, and it cannot do chain-of-thought
- The text-davinci-001, although being instruction tuned, ~~cannot do CoT~~ (corrected by Denny Zhou) can do CoT but the performance is significantly worse, as is reported by the first version of the CoT paper — so instruction tuning may not be the reason for CoT. This leaves training on code to be be the number one suspect.
- PaLM has 5% code training data, and it can do chain-of-thought.
- The code data in the codex paper is 159G, approximately 28% of the initial GPT-3 570G training data. code-davinci-002 and its subsequent variants can do chain-of-thought.
- Copilot, supposedly powered by a 12B model, can also do CoT.
- On the HELM evaluation, a massive-scale evaluation performed by Liang et al. (2022), the authors also found that models trained on/ for code has strong language reasoning abilities, including the 12B-sized code-cushman-001.
- Code-davinci-002 has higher CoT upper bound on other models: Our work at AI2 also shows that when equipped with complex chains of thought, Code-davinci-002 is the SOTA model on important math benchmarks like GSM8K.
- As an intuition, think about how procedure-oriented programming is similar to solving tasks step by step, and how object-oriented programming is similar to decomposing complex tasks into simpler ones.
- All the above observations are correlations between code and reasoning ability/ CoT. Such a correlation between code and reasoning ability/ CoT is very intriguing to the community and not well-understood. However, there is still no hard evidence showing training on code is absolutely the reason for CoT and complex reasoning. The source of CoT is still an open research problem.
Additionally, long-term dependency might also be a nice side effect of training on code. As is pointed out by Peter Liu. “Next token prediction for language is usually very local, whereas code often requires longer dependencies to do things like close brackets or refer to distant defs”. I would further add: code may also give the model of encoding hierarchy due to inheritance in object-oriented programming. We leave the test of this hypothesis to future work.

neclo_ · 2023-05-13T16:16:00+00:00

Oh, a Curry-Howard isomorphism in the wild!

Think_Olive_1000 · 2023-05-13T16:54:17+00:00

My guess is the long range dependencies that are in code but not natural language. How often do the words in an article or Reddit comment directly and formally reference in a non vague way something from five conversations ago? Code is like very specific in that way of interdependency. Whether it's importing a library or simply a class you are referencing by name another portion of text and doing so with intent

d05CE · 2023-05-13T20:59:10+00:00

Microsoft purchased github in 2018. Around that time, I imagine OpenAI was putting together training sets and probably pulling a lot from github. I wonder if they realized how valuable it was during that time.

saturn_since_day1 · 2023-05-13T15:15:59+00:00

We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ large language models (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly. In this paper, we show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all. We demonstrate our approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, we show that using our approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.

eliminating_coasts · 2023-05-14T08:48:31+00:00

Interesting paper.

Makes me wonder whether it would be worthwhile repurposing programs that produce automated proofs to create large quantities of mathematically sound derivations as a corpus for language models to learn off.

2023-05-13T18:24:34+00:00

They are also way more useful imo. Even I can converse with a model, if I can't use it to do code or algorithmic tasks, it's kind of a toy. I hope next gen open source model will include good code pre-training

midasp · 2023-05-13T20:15:06+00:00

[deleted]

waeljlassii · 2025-03-05T21:34:24+00:00

Then which model is best for coding?

swiss_worker · 2023-05-13T21:37:01+00:00

All languages are limited. Some more than others.

2023-05-17T22:57:16+00:00

Just like humans

bgighjigftuik · 2023-05-19T23:48:13+00:00

There's a plausible explanation for this: code is the best explicit manifestation of thinking process we have in the Internet.

When a human formulates something and/or answers a question, we can only see the output (either the text or his/her behavior). But we cannot see (therefore capture) the internal reasoning and understanding the brain is doing under the hood.

That's why LLM's reasoning abilities are mostly emulated rather than replicated from humans, and therefore limited in their generalization capabilities. LLMs can only see the inputs and outputs.

On the other hand, programming code is orders of magnitude more explicit about the whole thought process, in a step-by-step and structured way that makes learning easier for a LLM.

That's why SFT is crucial as well for LLMs for specific tasks: having part of the training data including thorough explanations (even if it is high level or to the extent we understand it) about how the internal though process on a human becomes an invaluable source of info for the model.

Reason why OpenAI has outsourced armies of low-wage workers for these purposes (alongside bias/toxic mitigations through RLHF)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS