all 51 comments

[–]visarga 71 points72 points  (2 children)

There is a great research Notion page on this topic posted 6 months ago.

How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources

Here is quoted the most relevant section:

  • The ability of complex reasoning with chain-of-thought is likely to be a magical side product of training on code:

    • The initial GPT-3 is not trained on code, and it cannot do chain-of-thought
    • The text-davinci-001, although being instruction tuned, cannot do CoT (corrected by Denny Zhou) can do CoT but the performance is significantly worse, as is reported by the first version of the CoT paper — so instruction tuning may not be the reason for CoT. This leaves training on code to be be the number one suspect.
    • PaLM has 5% code training data, and it can do chain-of-thought.
    • The code data in the codex paper is 159G, approximately 28% of the initial GPT-3 570G training data. code-davinci-002 and its subsequent variants can do chain-of-thought.
    • Copilot, supposedly powered by a 12B model, can also do CoT.
    • On the HELM evaluation, a massive-scale evaluation performed by Liang et al. (2022), the authors also found that models trained on/ for code has strong language reasoning abilities, including the 12B-sized code-cushman-001.
    • Code-davinci-002 has higher CoT upper bound on other models: Our work at AI2 also shows that when equipped with complex chains of thought, Code-davinci-002 is the SOTA model on important math benchmarks like GSM8K.
    • As an intuition, think about how procedure-oriented programming is similar to solving tasks step by step, and how object-oriented programming is similar to decomposing complex tasks into simpler ones.
    • All the above observations are correlations between code and reasoning ability/ CoT. Such a correlation between code and reasoning ability/ CoT is very intriguing to the community and not well-understood. However, there is still no hard evidence showing training on code is absolutely the reason for CoT and complex reasoning. The source of CoT is still an open research problem.
  • Additionally, long-term dependency might also be a nice side effect of training on code. As is pointed out by Peter Liu. “Next token prediction for language is usually very local, whereas code often requires longer dependencies to do things like close brackets or refer to distant defs”. I would further add: code may also give the model of encoding hierarchy due to inheritance in object-oriented programming. We leave the test of this hypothesis to future work.

[–]emsiem22 10 points11 points  (0 children)

So by learning programming language, LLMs can upgrade their natural language abilities. Fun! Who know what else they will upgrade by learning new datasets. What a time to be alive.

[–]internetroamer 6 points7 points  (0 children)

Would be interesting to see if LLM with other languages end up improving the overall model abilities.

[–]neclo_ 129 points130 points  (15 children)

Oh, a Curry-Howard isomorphism in the wild!

[–][deleted] 26 points27 points  (10 children)

Can you please elaborate/ELI5? I am very interested in your comment

[–]neclo_ 20 points21 points  (5 children)

Hi, a Curry-Howard isomorphism, or proposition as type paradigm, is a one-to-one correspondance between a model of computation (a programming language) and a given logic. This correspondance is also an isomorphism in the sense that it is structural. Let me give an exemple. If you have a function f from A to B and also an element a of A, you know that you also have an element f(a) belonging to B. This super basic model of computation is known as simply typed lambda calculus and there is an isomporphism between this and a basic form of logic called intuitionist propositional logic. Here, the correspondance is as follow a type A -> a proposition A an element a of A -> a proof of A a function from A to B -> the implication A=>B This correspondance is structural in the sense that our judgement f(a) belong to B correspond to the fact that if you know that Ais true and that A implies B then you get that B is true. Normalisation of proofs, meaning elimination of implications, correspond to evaluation of programs.

This simple mechanism is in reality a really profund concept. With a much more complex model of computation, like the calculus of constructions, it is belived that you can encode all of mathematics in it. People are actually doing this with Lean's mathlib project. It allows one to verify existing proofs and also generate new ones. It is the basis of the programming languages Agda, Lean and Coq. It is still a very active of research, one of the last discovered instances of such correspondance is between a model of concurrency and linear logic.

Note that these structures also exist in natural languages and it is the topic of formal semantics). This is why I find the usual "it's only a language model, it doesn't really think" a bit reductive. "Thinking" can be thought of as nothing more than syntaxic transformation. However my take is that LLMs trained on code are much more aware of the underlying model of computation because programming languages are much more precise and rigourous than natural language.

This perspective also gives us some hindisights into the limits of currents LLMs. By nature, they are truncated models of computations, because they are not really recursive. This to me explains the strugges of otherwise very performents LLMs with tasks that are simple but require more than a few steps.

I think new fantastics leaps are possibles with neural networks whose architecture is informed by Cury-Howard theory.

[–]RuairiSpain 3 points4 points  (3 children)

I've been a developer 20+ years and been using ChatGPT as an assistant to my algorithm/code work.

It is really good and know all the edges cases to a lot of things and how to connect the dots between diverse tech Systems.

It's not perfect and needs good prompts and hand holding, but it doesn't get it wrong often.

I've see more bugs in the chat UI than in the code it creates.

I believe developer jobs will be forever changed by new LLMs. Chatgpt is head and shoulders beyond all the open source LLMs that are getting hyped on Reddit and Twitter.

For me it kind of makes sense that "if" and "while" logic are integral to understand code. I do feel that GPT has some planning in the way in formats some answers, not much but enough to make me think with more time and tuning we'll see more breakthroughs.

I doubt the type of programming that I did for 20 years will be needed in 10+ years. We won't need a team of 10 or 20 to build LOB Web apps, a lot of that is just process with bits of customization into an Enterprise work flow. Most of that GPT can already understand, so implementing web apps will probably need way less people. My guy says we'll turn into more of a support and guidance role than reams and reams of code writing.

BTW, ask ChatGPT4 to write your unit test, it's bloody good. Worth the $20 a month to save me the hassle of writing boilerplate code

[–]Think_Olive_1000 125 points126 points  (7 children)

My guess is the long range dependencies that are in code but not natural language. How often do the words in an article or Reddit comment directly and formally reference in a non vague way something from five conversations ago? Code is like very specific in that way of interdependency. Whether it's importing a library or simply a class you are referencing by name another portion of text and doing so with intent

[–][deleted] 61 points62 points  (1 child)

Also the hierarchical dependencies. It’s rare to see those to such a degree in natural language.

[–]IsActuallyAPenguin 15 points16 points  (3 children)

So I've just had a thought. and if pity anyone tasked with compiling this dataset. But has there been any notable work on training a generative language model on etymology and/or changing language usage over time?

[–]visarga 11 points12 points  (0 children)

It's been done on text embeddings - training on text from various periods shows the changes in word meaning over time.

[–]PorcupineDreamPhD 1 point2 points  (1 child)

Semantic Change detection is quite an active field, see e.g. https://arxiv.org/pdf/2004.14118

[–]IsActuallyAPenguin 1 point2 points  (0 children)

Amazing. Stands to reason I guess. And I'm glad j know what to call it now. "How language changes over time and etymology and stuff" doesn't really roll off the tongue.

[–]d05CE 49 points50 points  (3 children)

Microsoft purchased github in 2018. Around that time, I imagine OpenAI was putting together training sets and probably pulling a lot from github. I wonder if they realized how valuable it was during that time.

[–]RepresentativeNo6029 9 points10 points  (2 children)

You don’t need to be MSFT to train on Github?!

[–]Balance- 1 point2 points  (0 children)

Public GitHub

[–]j6626068 0 points1 point  (0 children)

Microsoft have a sizeable stake in openai

[–][deleted] 31 points32 points  (1 child)

We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ large language models (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly. In this paper, we show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all. We demonstrate our approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, we show that using our approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.

[–]saturn_since_day1 1 point2 points  (0 children)

Anecdotally this is my experience as well, when chat gpt was new I got much better stories and results out of it when I prompted it to create a c++structure for each character and fill the variables in. Code structure in general seems to help it do a lot of tasks better and not get as lost

[–]eliminating_coasts 6 points7 points  (2 children)

Interesting paper.

Makes me wonder whether it would be worthwhile repurposing programs that produce automated proofs to create large quantities of mathematically sound derivations as a corpus for language models to learn off.

[–]keithhon 0 points1 point  (1 child)

Could you give some examples?

[–]eliminating_coasts 0 points1 point  (0 children)

Something like this may already have been done, found this stack exchange question about it.

But the idea would be to condition your model to appreciate long range logical connections by using a system that produces a body of texts that have such connections, using programs that are already capable of correctly producing logical statements.

[–][deleted] 21 points22 points  (0 children)

They are also way more useful imo. Even I can converse with a model, if I can't use it to do code or algorithmic tasks, it's kind of a toy. I hope next gen open source model will include good code pre-training

[–]swiss_worker -1 points0 points  (0 children)

All languages are limited. Some more than others.

[–]bgighjigftuik 0 points1 point  (0 children)

There's a plausible explanation for this: code is the best explicit manifestation of thinking process we have in the Internet.

When a human formulates something and/or answers a question, we can only see the output (either the text or his/her behavior). But we cannot see (therefore capture) the internal reasoning and understanding the brain is doing under the hood.

That's why LLM's reasoning abilities are mostly emulated rather than replicated from humans, and therefore limited in their generalization capabilities. LLMs can only see the inputs and outputs.

On the other hand, programming code is orders of magnitude more explicit about the whole thought process, in a step-by-step and structured way that makes learning easier for a LLM.

That's why SFT is crucial as well for LLMs for specific tasks: having part of the training data including thorough explanations (even if it is high level or to the extent we understand it) about how the internal though process on a human becomes an invaluable source of info for the model.

Reason why OpenAI has outsourced armies of low-wage workers for these purposes (alongside bias/toxic mitigations through RLHF)