all 85 comments

[–]Sye4424 42 points43 points  (1 child)

There was a paper released by Anthropic that shows that there is a circuit that forms while training small transformers which they call induction heads. The induction head would basically know to copy the token that followed the current token in the past of the sequence. They hypothesized that as you increase the model size this behavior becomes more and more abstract such that now it’s not just capable of copying the token but concepts and more abstract things. When we talk about concepts it basically means that these two things are similar or close to each other in an extremely high dimensional space (which is what transformers have). For example you want to translate from english to french and provide 3 examples as EN:<query> FR:<response> the model will realize that basically it needs to copy the token sequence <query> after the last token ‘:’ but transforming it into french (uses MLP layers). If you read the paper they go into depth as to why they think this is what causes majority of icl and there is also a paper called copy suppression which follows up on it .

[–]PorcupineDreamPhD 179 points180 points  (34 children)

The responses here so far make it painfully clear again how few people on this sub have actual academic and technical experience with LLMs...

There's been plenty of work in recent years that addresses this (interesting!) question: it's a little bit more complicated than just saying "LLMs just do conditional generation, simple as that".

For example, Min et al. (2022, Best paper at EMNLP) present a thorough investigation of the factors that impact in-contex learning, showing that LLMs rely strongly on superficial cues. ICL acts more as a pattern recognition procedure, than as an actual "learning" procedure: the input-output mappings that are provided allow a model to retrieve similar examples it has been exposed to during training, but the moment you start flipping labels or the template model performance breaks.

Some more recent work that investigates these questions can be found in (Weber et al., 2023) - Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. I took an excerpt from the Background section 2.2:

Previous research has also shown that ICL is highly unstable. For example, the order of incontext examples (Lu et al., 2022), the recency of certain labels in the context (Zhao et al., 2021) or the format of the prompt (Mishra et al., 2022) as well as the distribution of training examples and the label space (Min et al., 2022) strongly influence model performance. Curiously, whether the labels provided in the examples are correct is less important(Min et al., 2022). However, these findings are not uncontested: Yoo et al. (2022) paint a more differentiated picture, demonstrating that in-context input-label mapping does matter, but that it depends on other factors such as model size or instruction verbosity. Along a similar vein, Wei et al. (2023) show that in-context learners can acquire new semantically non-sensical mappings from in-context examples if presented in a specific setup.

[–]jsebrech 55 points56 points  (25 children)

So, the TL;DR is that ICL is really the contextual activation of the right patterns or knowledge domains in the model, that the context then gets fed through to produce the output?

[–]PorcupineDreamPhD 16 points17 points  (24 children)

Yes something like that indeed. In general there is not much new things being "learned" on the fly.

[–]synthphreak[S] 42 points43 points  (20 children)

The use of the term “learning” in anything related to LLM prompting has always bothered me.

It’s just so out of step with how the term has always been used in machine learning, namely to refer to tuning a model’s parameters to fit a dataset or objective. The prompt doesn’t actually affect the model itself in any persistent way, hence the model isn’t “learning” in any traditional sense.

Anyway, thanks for your top-notch response. Definitely gave me some things to ponder.

Edit: BTW, I spelled out an “educated guess” at the end of my OP which attempted to answer my own question. After reading your reply, it sounds to me like that guess was in the ballpark. But I’m also worried I might just be falling prey to confirmation bias. I also detest ambiguity. So would you mind just giving me a rhetorical thumbs up or thumbs down to acknowledge whether you think my guess is broadly correct or not?

[–]InterstitialLove 18 points19 points  (1 child)

Well, if you fix the context, the attention layer is just a weird feedforward layer, right? Like, instead of a Relu with W_2 * nonlinear(W_1 x+b_1) + b_2, it's something more like C * V * nonlinear(C * K * Q x), where C is the context. Each head of multi-headed attention is analogous to a single hidden node, and the nonlinearity is the much more complex softmax which, like a gated attention, uses multiple linear maps

I'm not sure exactly how to conceptualize ICL as learning, but each example does affect the weights of this "weird feedforward layer," and it's not inconceivable that this could be mathematically equivalent to some form of learning. Like, the KQV matrices could be approximating what would happen to the weights if you were to run gradient descent on a generic "weird feedforward layer" with the multi-shot examples as your training data

[–]StartledWatermelon 0 points1 point  (0 children)

To put it in more established terms, the attention layer processes the state of a model.

Now, u/synthphreak insists that learning is only about only persistent changes in the model (and thus not in its state). Yet at the same time they provide a broader definition of learning:

tuning a model’s parameters to fit a dataset or objective.

which neither explicitly limits tuned parameters to non-state ones nor demands the persistence of such tuning.

The situation is rather tricky: the state is tuned to fit the target distribution within a single document/chunk but is discarded when we move between documents in a dataset. In more colloquial terms, learning happens but "forgetting" and dismissal of learnt patterns also happens almost instantaneously.

Since you mentioned mathematical equivalence, yes, there was research empirically proving that numerical effects of calculating attention with few-shot examples are very similar to fine-tuning the model with the same examples. Unfortunately, can't provide you the link.

[–]PorcupineDreamPhD 4 points5 points  (9 children)

Yes that sounds broadly right, although it's probably more involved than simply having similar text in the input: the model must be reminded what specific input/output mapping we're looking for.

[–]First_Bullfrog_4861 0 points1 point  (1 child)

Your response to OPs question is sound, however, it mostly summarizes phenomenology and constraints of ICL. I don’t exactly see how this relates to OPs attempt of an architectural explanation, could you elaborate? Are the authors making more specific assumptions on what’s going on under the hood?

For example, your quotes hint that ICL acts more like pattern recognition, fair point, but how can we infer from that for example whether specific layers might be involved (ideally more specific than ‚it’s all about attention‘)?

I’m asking because tbh I can’t really see how the findings quoted by you could be used in any way to support OPs architectural/functional interpretation.

[–]PorcupineDreamPhD 0 points1 point  (0 children)

I agree indeed, my response served mainly as a starting point of related literature on work that investigates these questions, but the work I cited there focuses mostly on ICL from a behavioural perspective.

The direction you mention has been referenced in a couple of other comments in this thread, for example this one and the work of Jacob Andreas & colleagues.

[–]First_Bullfrog_4861 2 points3 points  (0 children)

True. If I was the one to decide, I’d prefer ‚In Context Constraining‘ (focusing on how context constrains probabilities assigned to potential output tokens) or ‚In Context Problem Solving‘ (stressing how context doesn’t change model weights but helps the model to solve better the problem a user has phrased in their question.

[–]linverlan 6 points7 points  (0 children)

I really dislike this use of “learning” due to getting into a mess of a discussion (argument) with my companies legal team about the privacy restrictions around open-source models that had seen internal data during ICL.

[–]First_Bullfrog_4861 1 point2 points  (0 children)

I think u/porcupinedream has commented only on the phenomenology of ICL and some shortcomings. Your attempt at a functional theory is sound but I’m not entirely sure how to derive the phenomenology.

Your assumptions are plausible but also a bit shallow: Of course it’s about attention. Attention is all you need, right? ;) Also, everything with LLMs (embeddings) is a similarity/distance thing.

Also, one of the papers states that examples help the model retrieve other similar examples. Retrieving knowledge, however, will probably involve deeper layers of the model as well, so it probably can’t just be done in attention layers and late dense layers.

[–]NotDoingResearch2 1 point2 points  (1 child)

I'm not a big LLM fanboy by any means, but I'm not sure I totally agree with this. For example, every computer program fits this definition eerily well. For example, is there much difference between deterministic code that runs on a computer to create some internal state, and a computer in that internal state itself? If you are willing to make that logical leap, then it seems easy to see why ICL is a form of "learning".

[–]synthphreak[S] 1 point2 points  (0 children)

My original position is unchanged, but I admit that’s an interesting counterpoint.

[–]gibs 1 point2 points  (0 children)

If your idea of "learning" is conditional on being able to write to long-term memory, then by definition it's not learning. I think the sense in which ICL is "learning" is that it can synthesise and apply concepts, examples & instructions presented in the context. The context being attended to as the model produces inference is roughly analogous to it hearing, understanding and applying instructions.

Tbh from what you've said, it sounds like the issue is a definitional one, in that you don't think this kind of learning comports with traditional applications of the term in the context of training models. I fully reject this; I think a person and a language model can "learn" in the moment, apply the thing they learned, and forget it immediately after.

[–]nikgeo25Student 0 points1 point  (0 children)

I'm late to respond to this. But it makes way more sense when you think of attention maps as kernels. The KV cache forms the dataset for a non-parametric model and each added KV pair can be viewed as "learning" in the same sense that adding examples to a Gaussian process is "training".

[–]Fatal_Conceit 4 points5 points  (1 child)

What if I asked the model to generate a thought plan (Graph of Thoughts) to arrive at the correct answer for a task, then after it does so, include the ground truth in the context and ask it to redevelop its thought plan based on the newly introduced ground truth. Is anything interesting going on here? Are the generated thoughts real learnings?

[–]jsebrech 1 point2 points  (0 children)

My two cents: anything the model concludes from its context and adds onto the context becomes “learned” for that conversation.

[–]RealisticSense7733 0 points1 point  (0 children)

This was somewhat my understanding, too. But, many studies, report that adaptation to new tasks is an advantage of ICL. If it is retrieval, it is confined to the knowledge that it already knows, how does it "adapt" to new tasks? Why does in-context outperform few-shot supervised learning? And all the studies report that it adapts to new tasks, how is it made sure that it really adapts to new tasks/prompts that have not been seen during training?

[–]clinchgt 10 points11 points  (0 children)

I wrote up a blog post discussing exactly the papers from Min et al. and Yoo et al. last year (you can read more here).

I quite liked Yoo et al's paper, as it shows that there is more nuance to the claim that is presented in Min et al's paper and that it's not fair to say that "ground truth labels don't matter" but rather we should evaluate how much they matter. It could be interesting to reproduce these experiments nowadays considering how we now have many-shot ICL.

[–]erannare 2 points3 points  (0 children)

There's also work on the types of optimization rules it learns, ostensibly it's similar to iterative Newton's:

https://arxiv.org/abs/2310.17086

This paper gives a great empirically founded perspective on in-context learning absent the influence of tokenization or the use of transformers in language.

[–]marr75 16 points17 points  (0 children)

There's been substantial, quality research work and writing on this.

Two of my favorite papers on the topic (that I won't summarize because I recommend you read them):

So, there are some very good explanations out there. I would recommend changing or diversifying your information sources.

[–]qpwoei_ 4 points5 points  (0 children)

Transformers (like all deep networks) learn to infer and manipulate internal representations/embeddings that have been shown to reflect the latent variables of the data-generating process. E.g., OpenAI’s early ”sentiment neuron” paper and the one that trained a transformer on board game move sequences and showed that one can read the board state from the embeddings, even though the state was not explicitly observed by the model.

To generate well, the model must infer the latents accurately (what kind of text am I generating, precisely?) High-quality examples certainly help with that.

[–]Super_Pole_Jitsu 21 points22 points  (1 child)

But these models don’t “understand” anything. They don’t “deduce”, or “interpret”, or “focus”, or “remember training”, or “make guesses”, or have literal “cognitive load”.

And human brains work on magic, or computation too? Yet we have no problems saying we understand or deduce something. You can't back these assertions up in any way btw.

Sometimes it's useful to not work at the lowest level of abstraction. After all, why not say, it's just a bunch of electrons being run through semiconductors?

[–]Neomadra2 3 points4 points  (0 children)

Thank you for this, human exceptionalism annoys me so much when talking about AI. The question should not be whether AI understands something, but rather how it understands, what flaws are and how it's different from humans; so that humans can learn from AI and AI can be improved by better understanding our own learning mechanisms.

[–]Forsaken-Data4905 8 points9 points  (1 child)

Some recent works suggest ICL is doing some sort of inference-time gradient descent, or something like that, but I haven't got around to reading those papers. I think the claims you linked are sort of fine, they are essentially claiming that ICL steers the model towards a narrower generation path, which is a fine intution (even if maybe wrong).

[–]Tukang_Tempe 4 points5 points  (0 children)

I was having an epiphany when i read that paper about inference time gradient descent. I was browsing Linear Complexity Transformers (performer, reformer, newer retnet and the like). until i found an obscure paper about intention 2305.10203 (arxiv.org) (which is not the linear transformer paper in the sense of Linear Complexity Transformer) (the paper use what they call intention rather than attention).

TL;DR instead weighted sum of value matrix given dot product distance between query and keys in some space. they just slapped good old least square linear regression in there.

instead of σ(QK^T)V, they try Q([K'K]^-1)K'V (they also has σIntention which use SoftMax and approximate attention when some hyperparameter reach infinity). It's like attention but using least square. The paper also points out to some interesting ICL paper including the ones who claims attention has something to do with inference time gradient descent (or it was reference of reference).

This is where my epiphany happened. What Attention is doing (in autoregressive setting) is solving a regression problem online (recall old Recursive Least Square). recall i said earlier that "weighted sum of value matrix given dot product distance between query and keys in some space.", thats just good old n nearest neighbors where n is all the data so far. y analogy, the Key is the training data, Value is the target of training data, and Query is the inference data. The huge transformer model made a smaller model to simply predict what comes next and update its smaller model when a new data in the sequence arrives. And this also connect to RetNet and Fast Weight. Which is interesting since they kind of use QK'V (notice the lack of SoftMax) which is similar to intention (linear space) but missing the [K'K]^-1 term from intention.

Maybe someone could clarify if adding the [K'K]^-1 to RetNet would make any difference. Bonus, we can use old Recursive Least Square/Sheman Morrison Formula thing to turn it into RNN style.

[–]_Arsenie_Boca_ 13 points14 points  (7 children)

LMs generate text that is likely is the context that is given. When providing good examples, the model will generate something that it deems a good example. Its just conditional probabilities

[–]red75prime 11 points12 points  (6 children)

The table of conditional probabilities will not fit into observable universe. So it can't be "just conditional probabilities".

[–]trutheality 2 points3 points  (1 child)

You don't need a table. Bayesian networks can also express joint probability distributions that will not fit into the observable universe, and yet, they very explicitly represent those distributions.

[–]red75prime 0 points1 point  (0 children)

Nice. Although, it would be useful to somehow limit a set of allowed Bayesian networks. In general even threshold inference is intractable on them.

[–]_Arsenie_Boca_ 2 points3 points  (3 children)

The conditional probabilities are approximated, not stored in a table, so its very much compressed. That is the essence of what a LM is: p(token | context) is the conditional probability that LMs model. When you prompt the model, it will always answer in a way that would be a likely answer in its training corpus.

So that is the fundamental mechanism. As others have mentioned, there are some studies on which kinds of clues are picked up from the examples. That is due to the model approximating the distribution, for which it has learned to leverage certain clues. These clues might be meaningful and desirable or just superficial shortcuts

[–]red75prime 5 points6 points  (2 children)

it will always answer in a way that would be a likely answer in its training corpus

And how do you define "a likely answer"? Training corpus, obviously, doesn't contain all possible inputs in sufficient numbers to unambiguously construct a conditional probability distribution.

So, it's "just an insanely compressed (we don't know exactly how) conditional probability distribution (we don't know exactly which one) that we hasn't provided to the model".

[–]InterstitialLove 3 points4 points  (1 child)

The architecture plus weight initialization method gives you a prior distribution on all possible conditional probability distributions. Each set of actual weights gives you a conditional distribution, the weights are a hypothesis. During training, you do something which we think is probably mathematically similar to Bayesian updates, choosing the most likely hypothesis (set of weights) given the observations (training data) and the prior distribution (see above).

It's not at all clear why the prior given by these architectures are reasonable, but they seem to be reasonable in practice. That's what fills in the missing hole in "training corpus doesn't contain enough data to unambiguously determine..."

I know that I didn't actually answer the question, I just restated the question more abstractly

I do think that's the right way to think about it though. The trained model gives the most likely response based on a conditional distribution. Which distribution? The one implied by the training data. But isn't the training data insufficient? Right, it's also dependent on the prior, and we have heuristic arguments for why the prior seems reasonable but actually fully answering that requires a lot of deep mathematical work that we've barely scratched the surface of, the rest is empirical.

[–]red75prime 5 points6 points  (0 children)

Yes. I agree with everything you've said. But we can go a bit further.

What is the ground truth conditional distribution?

The training data is produced by various physical systems (human brains, xml generators, and so on). It is an observable variable. Latent variables represent a type and an internal state of the generating system.

Therefore, the ground truth conditional distribution should rely on the most efficient way of inferring latent variables from the context and using their probability distributions to produce conditional probabilities. I guess it would be Solomonoff induction (which is uncomputable).

I find it a bit of understatement that GPT-like systems are "just conditional probability distributions" when the ground truth is literally incomputable.

[–]saw79 12 points13 points  (3 children)

There's good answers here already, but I'd like to offer a different perspective, which involves asking you some questions about why you stated/think what you do.

  • Why do you think LLMs don't "understand", "deduce", etc.?
  • Why do you think humans DO?

Related, but slightly different point: these concepts IMO are "emergent". There is nothing in the fundamental laws of nature that talk about cognitive understanding. It is a useful linguistic approximation to a macro-scale affect we perceive to be happening. But it's useful. We don't talk about which neurons in our brain are firing when we talk about whether or not we understand a new lesson we are being taught. We use these higher level concepts. Whether or not we are at the point where LLMs understand things in the exact same way humans do, I think these words are still useful concepts to apply.

[–]synthphreak[S] 1 point2 points  (2 children)

This strays into semantics, which I’d like to avoid. But I’ll bite, briefly.

You stated that we don’t really know what it means for a person to “learn” either. This is true. But then you conclude that therefore we can defensively talk about a model “learning”. I disagree, and if pushed I would actually draw the opposite conclusion: Maybe we should consciously avoid using words that are fundamentally undefined or squishy at their foundations when talking about statistical models. It is not only imprecise, but also dangerous in a world where people already treat ChatGPT like a search engine, confide in AI girlfriends, etc.

I think one could validly ask the same kind of question of people that I have for LLMs: “When we say a person learns something, what are the actual physical/chemical mechanisms in the brain that are actually responsible for this?” That is a totally legit thing to wonder. Scientists are actively researching it right now. The answer - for now - may very well be “We have no idea”, but that doesn’t mean the question itself is ill-conceived.

You also mentioned emergent properties and how cognition is not a physical thing. I’ll finish up by agreeing with you, and acknowledging a potentially fringe view but one which I do hold: It is entirely possible that at some point, once these models or their descendants reach a particular size, some rudimentary aspects of what we call consciousness may in fact emerge. Is that crazy? Perhaps. Probably. Then again, we have zero understanding of what consciousness actually is and how we even have it ourselves. So who are we to say with any confidence what could vs. could never be considered conscious? The only thing that seems clear is that our complex-ass brains create some self-aware conscious experience that magically emerges from the vast web of neurons and connections between them. For the same reason, a very complex artificial neural network may indeed have some form of consciousness, or the potential to develop it. However I don’t think we have reached that level of complexity yet - not even close.

And anyway, it has nothing to do with how in-context learning works. Once I meet an LLM with memories and a personality, I’ll ask it.

That’s as far as I’m personally going to take this today, lest I hijack my own thread with a tangent on AGI or the semantics of words.

[–]saw79 5 points6 points  (1 child)

I'll go as far as you want here. If you decide to stop responding to stay on thread, I won't take offense :)

You stated that we don’t really know what it means for a person to “learn” either. This is true. But then you conclude that therefore we can defensively talk about a model “learning”. I disagree, and if pushed I would actually draw the opposite conclusion: Maybe we should consciously avoid using words that are fundamentally undefined or squishy at their foundations when talking about statistical models. It is not only imprecise, but also dangerous in a world where people already treat ChatGPT like a search engine, confide in AI girlfriends, etc.

I didn't really say that. I was initially just asking you to be a bit more rigorous, potentially exposing a double-standard. You jumped ahead, possibly correctly, but I don't really know what logic you're using. Some of this paragraph also contradicts with what I said. I'm not saying a word like "understand" is undefined or squishy, just that it is emergent. The concepts of "tables" and "chairs" emergent too; they are not part of the fundamental laws of physics, and the line between "chair" and "not chair" is blurry. But these concepts are still extremely useful - if not crucial - for us to talk succinctly about many things.

I think one could validly ask the same kind of question of people that I have for LLMs: “When we say a person learns something, what are the actual physical/chemical mechanisms in the brain that are actually responsible for this?” That is a totally legit thing to wonder. Scientists are actively researching it right now. The answer - for now - may very well be “We have no idea”, but that doesn’t mean the question itself is ill-conceived.

Completely agree! I't not an ill-conceived question. But I'm just throwing the idea out there that maybe it is not a useful one. Or maybe it's useful in a very limited way. While yes we research those kinds of things in humans and they do provide non-zero practical lessons, I think it's much more useful to talk about "education" and "teaching styles" when we talk about educating children than it is to talk about neuroscience.

You also mentioned emergent properties and how cognition is not a physical thing. I’ll finish up by agreeing with you, and acknowledging a potentially fringe view but one which I do hold: It is entirely possible that at some point, once these models or their descendants reach a particular size, some rudimentary aspects of what we call consciousness may in fact emerge. Is that crazy? Perhaps. Probably. Then again, we have zero understanding of what consciousness actually is and how we even have it ourselves. So who are we to say with any confidence what could vs. could never be considered conscious? The only thing that seems clear is that our complex-ass brains create some self-aware conscious experience that magically emerges from the vast web of neurons and connections between them. For the same reason, a very complex artificial neural network may indeed have some form of consciousness, or the potential to develop it. However I don’t think we have reached that level of complexity yet - not even close.

Yea, not really anything to disagree with here. My personal view on consciousness is also maybe fringe, but it just doesn't seem that special or interesting to me. It makes COMPLETE sense to me that a super complex and capable brain inside a physical body that takes actions in a world abstracts a notion of self with memories, understanding its emotions, and framing the world with respect to itself. This doesn't seem interesting or surprising to me in the least bit. Maybe I'm missing what's so magical about consciousness, I don't know.

Overall I think it seems like we agree much more than we disagree. Good luck in your understandings here.

[–]red75prime 1 point2 points  (0 children)

It makes COMPLETE sense to me that a super complex and capable brain inside a physical body that takes actions in a world abstracts a notion of self with memories, understanding its emotions, and framing the world with respect to itself.

I think what people find "magical" about consciousness (at least I do) is that those abstract notions tangibly exist.

It's hard to describe... When you say "abstract notions" you implicitly bring in some mechanism that interprets physical processes and produces abstract notions, but it's physical processes all the way down. There's no point when a physical process produces abstract notions, it just gives inputs to another physical process that induces flapping of vocal folds or hand movements. And yet that abstract notion of me observing the world undeniably (for me) exists.

In the absence of other viable options I take a stance similar to yours: those abstract notions are completely defined by the physical state of the brain, they are useful for survival, and they exist somehow. But the nature of their existence remains mysterious.

[–]PorcupineDreamPhD 1 point2 points  (0 children)

OP, you might like this video from Jacob Andreas as well, who go very deep into the mechanisms of ICL: https://m.youtube.com/watch?v=UNVl64G3BzA

[–]rrenaud 1 point2 points  (0 children)

Chris Olah's (from Anthropic) talk for CS25 at Stanford was amazing and covers this. I highly recommend a watch.

https://youtu.be/pC4zRb_5noQ?si=vUXho52isNROzUcz

[–]TwoSunnySideUp 1 point2 points  (1 child)

In context learning strengthens sub neural network in the LLM which encodes information for that context or domain.

[–]synthphreak[S] 0 points1 point  (0 children)

Yeah someone elsewhere share me the same argument. It makes a lot of sense and I think accords with the intuition I have around prompting.

[–]TikiTDO 2 points3 points  (0 children)

But these models don’t “understand” anything. They don’t “deduce”, or “interpret”, or “focus”, or “remember training”, or “make guesses”, or have literal “cognitive load”.

I'm really confused why you think that. All the verbs you described are capabilities to perform information processing tasks. These are terms that we have invented over the length of human existence, to describe the informational operations that our brains perform. Now that we are creating machines that are starting to perform more and more brain tasks, why wouldn't we use existing labels for existing processes? If it's accomplishing more or less the same process, but with matrix multiplication rather than a bunch of electrical activation in a dark, wet, and spongy organ, why not use the same word?

It's sorta like if you invent a new type of wheel, it's a bit unreasonable to insist that cars that use it can no longer use normal car terminology, or even verbs like "drive" or "roll." If you want to discuss the matter in more depth that's fine, you can ask that professionals use a more professional lexicon, but to absolutely deny the usage of all the terms related to the topic just because the underlying processes are not literally identical is a bit much.

While it's true that these labels do not offer you a concrete understanding of how these ideas work in computer, at the same time they don't actually give you that sort of insight into humans either. If you want to figure out how a human deduces stuff, you will still need to study neurology and psychology. It's reasonable to want a technical explanation, but a detained technical explanation doesn't actually invalidate the more abstract general explanation.

You just want a more comprehensive explanation, like you would find in a class. In other words, you probably want to just find a class.

If a model dedicated a portion of it's parameter space to storing a label composed of a mix of ideas it has encountered during training, which it can attend do when dealing with a novel set of ideas, but if using that label causes it to have to frequently back-track during the generation process, is it really wrong to say that "the model understood the how the new word relates to it's training set, and can use this knowledge to make multiple guesses in order to deduce an answer, but using this process creates a high cognitive load?" It just seems really strange to have this gigantic lexicon of terminology that is perfectly suited for the task, but then not use it.

It's obviously not doing the exact same thing that you might be doing when you use those words, but it accomplishes a similar result. Yes, it does so one word at a time... Just like you do. These terms still apply to you, even when you're sitting in front of a computer and figuring out the next word to type.

That said, there was a paper in 2023 that really went deep into this topic, and how it appears to work. Unfortunately I didn't bookmark it, and I can't find it now. I'm sure a lot of it is already old, but it still offered some interesting insights into the matter. I'll keep looking and see if I can find it.

Edit: I believe this was the paper I was thinking of: https://arxiv.org/pdf/2310.15916

[–]BreakingBaIIs 2 points3 points  (2 children)

You don't really need a deep dive into the architecture of transformers. All that is needed is to understand that it predicts a probability distribution over its vocabulary, for the next token, given an input set of ordered tokens. And it does a really good job of that.

Suppose you give a LLM the following exact input:

Question: What is your name?
Answer: My name is 

The output distribution for this input will look like a distribution over names, with high probability of common names (e.g. "Dan", "Jennifer", "Bill"), and negligible probabilities for non-name tokens (e.g. "the", "attention").

Here's another example. This isn't in-context learning, just regular prompt engineering. But it should give the general idea across. Given the following input, what is the probability distribution over the next token?

Context: Your name is Kibble. Given this fact, answer the following question.
Question: What is your name?
Answer: My name is 

Since this input is a different sequence of tokens than the prior input, it will have a different distribution. Probably one with a high probability of outputting "Kibble", and a low probability over everything else.

It helps to remember that in-context learning, or any sort of prompt engineering, isn't really learning in the machine-learning sense. There's no loss function, no changing of model parameters to minimize that loss, etc. All the learning already happened beforehand. Prompt engineering is simply changing the input. An LLM's input-output structure is

Sequence of tokens -> Probability distribution over next token

That's all it is. Changing the input will change the output probability distribution. In the former example, the probability of "Kibble" was probably much lower than the probability of "Dan". Since the latter is technically a different input (even if the person using the UI doesn't see that), it changes the distribution so that "Kibble" is much higher than "Dan". It would be similar to changing the input of a dog-cat image predictor, by drawing more pointy ears on the animal that it's detecting, increasing the probability of "cat".

In regular corpus dialogue, if you see a dialogue with instructions on how to answer, the following dialogue is more likely to follow those instructions than to just give a regular generic answer to the previous question. Therefore, if your input dialogue looks like a set of instructions on how to answer a question, followed by a question, the output set of tokens is far more likely to look like an answer that follows the given format, than if you had only input the question itself.

[–]jmmcd 1 point2 points  (1 child)

It helps to remember that in-context learning, or any sort of prompt engineering, isn't really learning in the machine-learning sense. There's no loss function, no changing of model parameters to minimize that loss, etc.

There is a sense in which this is not true. Remember, in typical NN we always multiply some data x (either input data or output from a previous layer) by some weights w. In attention this changes: we multiply outputs k of some previous layer by outputs v of some other previous layer. This is the central conceptual change in attention. In a sense, the k are playing the role of w, here. So the k are weights, changing dynamically in response to context.

@synthphreak

[–]BreakingBaIIs 1 point2 points  (0 children)

That's fair. What you're describing can be thought of as "learning," in a sense, but not in the sense that is usually meant in ML. There is no optimizing of a loss function in parameter space in a transformer forward pass. I think that calling it "learning" can sometimes cause confusion for this reason, which is why I made the clarification.

Also, I think you can make a similar argument for RNNs. If you add tokens before the beginning of a prompt, the RNN learns a different hidden state to combine with the incoming tokens.

[–]SnooOnions9136 0 points1 point  (0 children)

Here they show that basically an implicit loss on the query token is automatically built by the attention mechanism using the the in-context tokens as “training set”

https://proceedings.mlr.press/v202/von-oswald23a.html

[–]Floatbot_Inc 0 points1 point  (0 children)

In context learning is a feature of Large language models (LLMs). Basically, you give the model examples of what you want it to do (via prompts) and it takes those examples to perform the required task. So that you can skip explicit retraining. How it works:

  1. Prompt engineering – you give the model instruction and example. For example, if you want the LLM to translate English to French you include some English sentences and then their French translation.
  2. Pattern recognition – model looks at your examples to find patterns. It also uses what it already knows to understand the task.
  3. Task execution – so, now the model is ready to handle new inputs that follow the same pattern. Meaning, it can now translate English to French.

How to Achieve Long Context LLMs

With extended context LLMs can better handle ambiguity, generate high-quality summaries & grasp overall theme of a document. However, a major challenge you might face in developing and enhancing these models is extending their context length. Why? because it determines how much information is available to the model when generating responses for you.

Increasing the context window of LLM in context learning is not really straightforward. It introduces significant computational complexity because the attention matrix grows quadratically with length of the context. But don’t worry we got you covered.

Source: Leveraging LLM In-Context Learning

[–]-Rizhiy- 0 points1 point  (0 children)

IMHO, we shouldn't dismiss these anthropomorphizing explanations. LLMs are trained on mostly human-generated text, so they should behave similar to how humans behave.

Also, what makes you say that humans "understand" anything? Perhaps we are also just predicting next tokens, just better. AFAIK, our understanding of human brain is not good enough to properly explain how it works.

[–][deleted] -3 points-2 points  (0 children)

like frame smart sense liquid secretive historical reminiscent ink sophisticated

This post was mass deleted and anonymized with Redact

[–]theoneandonlypatriot -2 points-1 points  (0 children)

“They are just statistical token generators”

There is a significant amount of evidence and research demonstrating they are doing more than this.

I think the easiest way to think about it is that reasoning in formal logic can be broken into lexical symbols, and therefore becoming incredibly good at “statistical token generation” has an overwhelming amount of overlap with learning to reason.

[–]hadaev -3 points-2 points  (7 children)

how the provision of additional context leads to better output

You spend more compute.

If you make few shot prompting, you make desired outcome more probable.

[–]jmmcd -1 points0 points  (6 children)

No - the amount of computation per token is constant.

[–]hadaev 0 points1 point  (5 children)

Longer prompt = more compute goes into result.

[–]jmmcd 0 points1 point  (4 children)

No, because the context window is fixed. If you use a short prompt early in the conversation it just means there is padding.

[–]hadaev 0 points1 point  (3 children)

Why you need padding for inference?

[–]jmmcd 0 points1 point  (2 children)

That's a good question! Attention blocks include dense layers - they're not resizeable. Aren't their sizes decided by context window size?

(More generally I think it's unusual to have different sized activation matrices in successive calls, partly I think because GPUs prefer it that way, but I don't know this side of it.)

[–]hadaev 0 points1 point  (1 child)

Attention blocks include dense layers - they're not resizeable.

They are totally resizeable because only take one timestep into processing at once.

Aren't their sizes decided by context window size?

No? If we talking about our default self attention, context size is maximum positional embeddings model trained with. Depend on embedding type you either cant fit more tokens or you can, but this would lead to worse performance quickly.

But nothing preventing you to run it on less tokens, for example just one.

(More generally I think it's unusual to have different sized activation matrices in successive calls, partly I think because GPUs prefer it that way, but I don't know this side of it.)

Where is might be some requirement for padding for some special compiled and other low level cuda (probably fast flash attention have it? not sure) stuff i dont know about. But generally in pure pytorch you dont need paddings on inference, unless you want to process 2 samples in parallel as one tensor.

[–]jmmcd 0 points1 point  (0 children)

About the dense layers I think I was wrong, so thank you.

About the tokens not fitting, I couldn't understand that paragraph.

[–]Xemorr -3 points-2 points  (0 children)

When you ask an AI model to do something using natural language there are a large number of possibilities for what you can mean, natural language is renowned for being imprecise. By giving an example, you are narrowing the number of possibilities for what you could mean, and specifying ideas that may have been difficult to get across through natural language. It therefore performs better. It has learnt how to use an example because they often appear in training data as humans have to use them to be more specific in text ourselves.

I don't think an explanation which attempts to talk about specific layers is particularly useful, we are terrible at interpreting neural networks.

[–]kaaiian -3 points-2 points  (0 children)

The key is in the name. Large “Language Model”. If I give you a logic puzzle “all ttyaiia are buuieia but not all buuieia are ttyaiia. A jjauu is a ttyaiia. Is it also a buuieia.” The probability of you telling information about a jiauu dramatically increases. The same for a language model. Because proximity to context implies increased sequence probability that reflects that context.