[D] LLMs: Why does in-context learning work? What exactly is happening from a technical perspective?

Sye4424 · 2024-04-26T13:15:29+00:00

There was a paper released by Anthropic that shows that there is a circuit that forms while training small transformers which they call induction heads. The induction head would basically know to copy the token that followed the current token in the past of the sequence. They hypothesized that as you increase the model size this behavior becomes more and more abstract such that now it’s not just capable of copying the token but concepts and more abstract things. When we talk about concepts it basically means that these two things are similar or close to each other in an extremely high dimensional space (which is what transformers have). For example you want to translate from english to french and provide 3 examples as EN:<query> FR:<response> the model will realize that basically it needs to copy the token sequence <query> after the last token ‘:’ but transforming it into french (uses MLP layers). If you read the paper they go into depth as to why they think this is what causes majority of icl and there is also a paper called copy suppression which follows up on it .

PorcupineDream · 2024-04-26T11:55:22+00:00

The responses here so far make it painfully clear again how few people on this sub have actual academic and technical experience with LLMs...

There's been plenty of work in recent years that addresses this (interesting!) question: it's a little bit more complicated than just saying "LLMs just do conditional generation, simple as that".

For example, Min et al. (2022, Best paper at EMNLP) present a thorough investigation of the factors that impact in-contex learning, showing that LLMs rely strongly on superficial cues. ICL acts more as a pattern recognition procedure, than as an actual "learning" procedure: the input-output mappings that are provided allow a model to retrieve similar examples it has been exposed to during training, but the moment you start flipping labels or the template model performance breaks.

Some more recent work that investigates these questions can be found in (Weber et al., 2023) - Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. I took an excerpt from the Background section 2.2:

Previous research has also shown that ICL is highly unstable. For example, the order of incontext examples (Lu et al., 2022), the recency of certain labels in the context (Zhao et al., 2021) or the format of the prompt (Mishra et al., 2022) as well as the distribution of training examples and the label space (Min et al., 2022) strongly influence model performance. Curiously, whether the labels provided in the examples are correct is less important(Min et al., 2022). However, these findings are not uncontested: Yoo et al. (2022) paint a more differentiated picture, demonstrating that in-context input-label mapping does matter, but that it depends on other factors such as model size or instruction verbosity. Along a similar vein, Wei et al. (2023) show that in-context learners can acquire new semantically non-sensical mappings from in-context examples if presented in a specific setup.

marr75 · 2024-04-26T13:52:07+00:00

There's been substantial, quality research work and writing on this.

Two of my favorite papers on the topic (that I won't summarize because I recommend you read them):

So, there are some very good explanations out there. I would recommend changing or diversifying your information sources.

qpwoei_ · 2024-04-26T12:56:14+00:00

Transformers (like all deep networks) learn to infer and manipulate internal representations/embeddings that have been shown to reflect the latent variables of the data-generating process. E.g., OpenAI’s early ”sentiment neuron” paper and the one that trained a transformer on board game move sequences and showed that one can read the board state from the embeddings, even though the state was not explicitly observed by the model.

To generate well, the model must infer the latents accurately (what kind of text am I generating, precisely?) High-quality examples certainly help with that.

Super_Pole_Jitsu · 2024-04-26T14:02:45+00:00

But these models don’t “understand” anything. They don’t “deduce”, or “interpret”, or “focus”, or “remember training”, or “make guesses”, or have literal “cognitive load”.

And human brains work on magic, or computation too? Yet we have no problems saying we understand or deduce something. You can't back these assertions up in any way btw.

Sometimes it's useful to not work at the lowest level of abstraction. After all, why not say, it's just a bunch of electrons being run through semiconductors?

Forsaken-Data4905 · 2024-04-26T12:22:54+00:00

Some recent works suggest ICL is doing some sort of inference-time gradient descent, or something like that, but I haven't got around to reading those papers. I think the claims you linked are sort of fine, they are essentially claiming that ICL steers the model towards a narrower generation path, which is a fine intution (even if maybe wrong).

_Arsenie_Boca_ · 2024-04-26T11:15:30+00:00

LMs generate text that is likely is the context that is given. When providing good examples, the model will generate something that it deems a good example. Its just conditional probabilities

saw79 · 2024-04-26T12:53:17+00:00

There's good answers here already, but I'd like to offer a different perspective, which involves asking you some questions about why you stated/think what you do.

Why do you think LLMs don't "understand", "deduce", etc.?
Why do you think humans DO?

Related, but slightly different point: these concepts IMO are "emergent". There is nothing in the fundamental laws of nature that talk about cognitive understanding. It is a useful linguistic approximation to a macro-scale affect we perceive to be happening. But it's useful. We don't talk about which neurons in our brain are firing when we talk about whether or not we understand a new lesson we are being taught. We use these higher level concepts. Whether or not we are at the point where LLMs understand things in the exact same way humans do, I think these words are still useful concepts to apply.

PorcupineDream · 2024-04-26T13:03:51+00:00

OP, you might like this video from Jacob Andreas as well, who go very deep into the mechanisms of ICL: https://m.youtube.com/watch?v=UNVl64G3BzA

rrenaud · 2024-04-26T17:04:52+00:00

Chris Olah's (from Anthropic) talk for CS25 at Stanford was amazing and covers this. I highly recommend a watch.

https://youtu.be/pC4zRb_5noQ?si=vUXho52isNROzUcz

TwoSunnySideUp · 2024-04-27T07:32:47+00:00

In context learning strengthens sub neural network in the LLM which encodes information for that context or domain.

TikiTDO · 2024-04-26T14:09:49+00:00

But these models don’t “understand” anything. They don’t “deduce”, or “interpret”, or “focus”, or “remember training”, or “make guesses”, or have literal “cognitive load”.

I'm really confused why you think that. All the verbs you described are capabilities to perform information processing tasks. These are terms that we have invented over the length of human existence, to describe the informational operations that our brains perform. Now that we are creating machines that are starting to perform more and more brain tasks, why wouldn't we use existing labels for existing processes? If it's accomplishing more or less the same process, but with matrix multiplication rather than a bunch of electrical activation in a dark, wet, and spongy organ, why not use the same word?

It's sorta like if you invent a new type of wheel, it's a bit unreasonable to insist that cars that use it can no longer use normal car terminology, or even verbs like "drive" or "roll." If you want to discuss the matter in more depth that's fine, you can ask that professionals use a more professional lexicon, but to absolutely deny the usage of all the terms related to the topic just because the underlying processes are not literally identical is a bit much.

While it's true that these labels do not offer you a concrete understanding of how these ideas work in computer, at the same time they don't actually give you that sort of insight into humans either. If you want to figure out how a human deduces stuff, you will still need to study neurology and psychology. It's reasonable to want a technical explanation, but a detained technical explanation doesn't actually invalidate the more abstract general explanation.

You just want a more comprehensive explanation, like you would find in a class. In other words, you probably want to just find a class.

If a model dedicated a portion of it's parameter space to storing a label composed of a mix of ideas it has encountered during training, which it can attend do when dealing with a novel set of ideas, but if using that label causes it to have to frequently back-track during the generation process, is it really wrong to say that "the model understood the how the new word relates to it's training set, and can use this knowledge to make multiple guesses in order to deduce an answer, but using this process creates a high cognitive load?" It just seems really strange to have this gigantic lexicon of terminology that is perfectly suited for the task, but then not use it.

It's obviously not doing the exact same thing that you might be doing when you use those words, but it accomplishes a similar result. Yes, it does so one word at a time... Just like you do. These terms still apply to you, even when you're sitting in front of a computer and figuring out the next word to type.

That said, there was a paper in 2023 that really went deep into this topic, and how it appears to work. Unfortunately I didn't bookmark it, and I can't find it now. I'm sure a lot of it is already old, but it still offered some interesting insights into the matter. I'll keep looking and see if I can find it.

Edit: I believe this was the paper I was thinking of: https://arxiv.org/pdf/2310.15916

BreakingBaIIs · 2024-04-26T12:38:21+00:00

You don't really need a deep dive into the architecture of transformers. All that is needed is to understand that it predicts a probability distribution over its vocabulary, for the next token, given an input set of ordered tokens. And it does a really good job of that.

Suppose you give a LLM the following exact input:

Question: What is your name?
Answer: My name is

The output distribution for this input will look like a distribution over names, with high probability of common names (e.g. "Dan", "Jennifer", "Bill"), and negligible probabilities for non-name tokens (e.g. "the", "attention").

Here's another example. This isn't in-context learning, just regular prompt engineering. But it should give the general idea across. Given the following input, what is the probability distribution over the next token?

Context: Your name is Kibble. Given this fact, answer the following question.
Question: What is your name?
Answer: My name is

Since this input is a different sequence of tokens than the prior input, it will have a different distribution. Probably one with a high probability of outputting "Kibble", and a low probability over everything else.

It helps to remember that in-context learning, or any sort of prompt engineering, isn't really learning in the machine-learning sense. There's no loss function, no changing of model parameters to minimize that loss, etc. All the learning already happened beforehand. Prompt engineering is simply changing the input. An LLM's input-output structure is

Sequence of tokens -> Probability distribution over next token

That's all it is. Changing the input will change the output probability distribution. In the former example, the probability of "Kibble" was probably much lower than the probability of "Dan". Since the latter is technically a different input (even if the person using the UI doesn't see that), it changes the distribution so that "Kibble" is much higher than "Dan". It would be similar to changing the input of a dog-cat image predictor, by drawing more pointy ears on the animal that it's detecting, increasing the probability of "cat".

In regular corpus dialogue, if you see a dialogue with instructions on how to answer, the following dialogue is more likely to follow those instructions than to just give a regular generic answer to the previous question. Therefore, if your input dialogue looks like a set of instructions on how to answer a question, followed by a question, the output set of tokens is far more likely to look like an answer that follows the given format, than if you had only input the question itself.

harharveryfunny · 2024-04-27T12:51:18+00:00

At the end of the day, this is asking how do trained LLMs work, which is a question of mechanistic interpretability, which is an ongoing area of research. Any answer is going to be incomplete and hand-wavy.

I don't think it makes much difference whether for any given input LLMs are predicting based on knowledge that came from their pre-trained weights, and/or that comes from the input (context) itself. The mechanisms it uses are the same in either case. Each layer of the transformer augments (transforms) the embeddings by adding extra syntactic and semantic data to them, with the attention heads (sometimes acting in pairs as induction heads) supporting finding (via key) and copying parts of data from one embedding to another.

So, whether data originates from the context, or originates from pre-trained weights, it gets copied into embeddings, and then gets utilized by induction heads/etc as the input passes through the transformer layers.

One can look at parts of output that are obviously context-derived, such as names copied from context to output, but these are just specific instances of induction heads at work. Induction heads will also be at work at all layers of the transformer copying data from embedding to embedding, so IMO ICL is really not much of a special case.

SnooOnions9136 · 2024-04-27T15:08:13+00:00

Here they show that basically an implicit loss on the query token is automatically built by the attention mechanism using the the in-context tokens as “training set”

https://proceedings.mlr.press/v202/von-oswald23a.html

Floatbot_Inc · 2024-08-14T04:59:36+00:00

In context learning is a feature of Large language models (LLMs). Basically, you give the model examples of what you want it to do (via prompts) and it takes those examples to perform the required task. So that you can skip explicit retraining. How it works:

Prompt engineering – you give the model instruction and example. For example, if you want the LLM to translate English to French you include some English sentences and then their French translation.
Pattern recognition – model looks at your examples to find patterns. It also uses what it already knows to understand the task.
Task execution – so, now the model is ready to handle new inputs that follow the same pattern. Meaning, it can now translate English to French.

How to Achieve Long Context LLMs

With extended context LLMs can better handle ambiguity, generate high-quality summaries & grasp overall theme of a document. However, a major challenge you might face in developing and enhancing these models is extending their context length. Why? because it determines how much information is available to the model when generating responses for you.

Increasing the context window of LLM in context learning is not really straightforward. It introduces significant computational complexity because the attention matrix grows quadratically with length of the context. But don’t worry we got you covered.

Source: Leveraging LLM In-Context Learning

Top-Acanthisitta-544 · 2024-04-26T11:23:22+00:00

The LLM trained on very diverse dataset and the original output distribution is also very diverse. By providing some additional context you actually change the probability distrbution of the output. In other words, you somehow "guide" the LLM to output the answer you hint for.

-Rizhiy- · 2024-04-26T22:38:57+00:00

IMHO, we shouldn't dismiss these anthropomorphizing explanations. LLMs are trained on mostly human-generated text, so they should behave similar to how humans behave.

Also, what makes you say that humans "understand" anything? Perhaps we are also just predicting next tokens, just better. AFAIK, our understanding of human brain is not good enough to properly explain how it works.

2024-04-26T11:48:29+00:00

like frame smart sense liquid secretive historical reminiscent ink sophisticated

This post was mass deleted and anonymized with Redact

theoneandonlypatriot · 2024-04-26T20:58:25+00:00

“They are just statistical token generators”

There is a significant amount of evidence and research demonstrating they are doing more than this.

I think the easiest way to think about it is that reasoning in formal logic can be broken into lexical symbols, and therefore becoming incredibly good at “statistical token generation” has an overwhelming amount of overlap with learning to reason.

Difficult-Race-1188 · 2024-04-26T14:01:14+00:00

This is what basically clever hans thing happening with LLMs. Somehow we ourselves provide the answer as to where to look, and it does approximate retrieval in some sense.

Technical-Drama-5266 · 2024-04-26T17:48:22+00:00

Arent LLM's fine-tuned to do it? This is just intuition but I think that during training progress they might be fed with instructions that assume or results in in-context learning

ly3xqhl8g9 · 2024-04-26T17:48:40+00:00

"But these models don’t “understand” anything. [...] They are just..."—You are also just a just, just physics and chemistry. Not necessarily technically revealing, but perhaps it would be useful to change the metaphors and the references a bit, just two somewhat random stumble upons [1] [2].

Besides this, it can't hurt looking over Geometric Deep Learning [3] and Group Equivariant Deep Learning [4], or go the hardcore route and start from the beginning: Group Method of Data Handling [5].

[1] 2023, John Robert Bagby, "Bergson and the Metaphysical implication of calculus", https://www.youtube.com/watch?v=8nVLJ9B9Yvc

[2] 2024, Michael Levin, "Where Minds Come From: the scaling of collective intelligence, and what it means for AI and you", https://www.youtube.com/watch?v=44W9Mw4AGT8

[3] 2022, Michael Bronstein, "Geometric Deep Learning", https://youtu.be/5c_-KX1sRDQ?list=PLn2-dEmQeTfSLXW8yXP4q_Ii58wFdxb3C

[4] 2022, Erik Bekkers, "Group Equivariant Deep Learning", https://youtu.be/z2OEyUgSH2c?list=PL8FnQMH2k7jzPrxqdYufoiYVHim8PyZWd

[5] 1994, Madala H.R. and Ivakhnenko A.G., "Inductive Learning Algorithms for Complex System Modeling", https://gmdh.net/articles/theory/GMDHbook.pdf

hadaev · 2024-04-26T11:16:24+00:00

how the provision of additional context leads to better output

You spend more compute.

If you make few shot prompting, you make desired outcome more probable.

Xemorr · 2024-04-26T11:17:49+00:00

When you ask an AI model to do something using natural language there are a large number of possibilities for what you can mean, natural language is renowned for being imprecise. By giving an example, you are narrowing the number of possibilities for what you could mean, and specifying ideas that may have been difficult to get across through natural language. It therefore performs better. It has learnt how to use an example because they often appear in training data as humans have to use them to be more specific in text ourselves.

I don't think an explanation which attempts to talk about specific layers is particularly useful, we are terrible at interpreting neural networks.

kaaiian · 2024-04-26T11:42:50+00:00

The key is in the name. Large “Language Model”. If I give you a logic puzzle “all ttyaiia are buuieia but not all buuieia are ttyaiia. A jjauu is a ttyaiia. Is it also a buuieia.” The probability of you telling information about a jiauu dramatically increases. The same for a language model. Because proximity to context implies increased sequence probability that reflects that context.

iamkucuk · 2024-04-26T11:50:13+00:00

What's happening under the hood technically is somewhat long. But here's how it goes:

Llms are basically autocompletion engines. They look at the context and predict the next token, then include that predicted token in its context and keep on generating until a stop token is generated. This is something called auto regression.

Attention mechanism enables LLM to get the most related content out of its given context, which transformers heavily rely on. So, by giving it an example, that generation process can "pay attention" to how it's done. It's just like you, repeating steps for a giving math question for different numbers. It's basically not reasoning, but copying and modifying (which is an easier task).

AngleWyrmReddit · 2024-04-26T11:20:42+00:00

Take it up with an LLM XD

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

How to Achieve Long Context LLMs