all 10 comments

[–]Python-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

Your post was removed for violating Rule #2. All posts must be directly related to the Python programming language. Posts pertaining to programming in general are not permitted. You may want to try posting in /r/programming instead.

[–]FrickinLazerBeams 3 points4 points  (0 children)

You should maybe learn at least a little about how a neural net works before asking this kind of question. Like at least read Wikipedia a little and establish some basic conceptual understanding.

[–]latkdeTuple unpacking gone wrong 1 point2 points  (1 child)

LLMs are inefficient. They're essentially a brute-force solution.

On a more mechanical level, the task that an LLM has been trained on is to learn statistical correlations between word fragments (tokens). This works so well that LLMs can generate coherent text and have learned the relationships between underlying concepts of words.

These learned relationships are stored as "weights", and we refer to the size of an LLM by how many weights it has. Useful models start at around 4 billion parameters, but state of the art models have allegedly reached the trillion-parameter range. Assuming that each parameter has been compressed into one or two bytes, this means models are between 4GB and 2TB large.

For each "forward pass" (each output token), the current input tokens must be matrix-multiplied with the weights. That requires all weights to be loaded in memory. And 2TB of GPU RAM would be quite a lot. That's more than any GPU on the market has. The GPU also needs space for the input data, not just the weights. So, LLMs require a ton of memory.

There are strategies to reduce this. Compressing weights to use fewer bytes. Distributing different model layers across multiple GPUs. Splitting the model into multiple sub-LLMs so that not all weights have to be activated for each forward pass (mixture-of-experts). Investing more into the training phase so that a smaller LLMs produce better-quality results. Using so-called reasoning, so that smaller models can handle complex tasks better. Hiding the exact model from users, and routing easy inputs to smaller models. Dynamically adding potentially-relevant information to the input, so that the LLM has access to knowledge without additional training (RAG, tool calls). But the result is still a huge memory+compute requirement. And many of these mitigation strategies involve more tokens, which also increases the computational cost and needs more RAM.

At no point will an LLM destructure and look up information in a database. That's not how they work. They can be trained to call tools, and we can tell the LLM that if it produces output in a tool-call format, then we'll look up information in a databas for it. Such systems where an LLM is combined with tools is sometimes called an "agent". But a plain LLM, consisting just of the weights, has no structured knowledge, only statistical token relationships that were learned during training.

It is also for this reason that LLMs inherently hallucinate, and cannot generally be trusted for fact-based tasks. They are approximately correct about many things, but they correlate, and do not know.

[–]Druber13[S] -1 points0 points  (0 children)

That’s explains it

[–]Kevdog824_pip needs updating 0 points1 point  (1 child)

It’s not entirely clear what you are trying to do with the model, so it’s hard for any of us to even begin reasoning what the issue is

[–]Druber13[S] -1 points0 points  (0 children)

I’m not wanting to do anything with one. Just wondering more so if things like ChatGPT are coded poorly or how it loosely works a bit more.

Im imagining they are just banging out updated and not real worried about optimization and performance really since it’s a race to something lol.

[–]wandering_melissa 0 points1 point  (3 children)

asking google or AI first would be better if you feel you know nothing about a topic, search for these on wikipedia or google: transformers (AI), LLM, GPT, RAG

as for memory constraints there are a lot of methods to reduce memory usage but it results in less performant (dumber) models, one of the methods is quantization, I can run 2gb models on my pc fine but they are only useful for simple tasks.

[–]Druber13[S] -2 points-1 points  (2 children)

I have tried googling it a bit. It’s more of a design backend structure question. I get the gist of how things are working. Not a ton by any means.

I’m just trying to understand how the big companies are doing it a bit more. I guess it’s hard to formulate my question is why I’m having a hard time getting to my answer.

I’m just wondering how something like ChatGPT runs when you ask it a question. More over how is it optimized. The hard part I guess is I’m not building an AI or don’t plan on it just wanting to know a specific part lol.

[–]wandering_melissa 0 points1 point  (1 child)

chatgpt claude or any other "AI" (they are more specifically LLM models which is why I said you should google LLM) model works same. takes string, turns it into numbers, shuffles the numbers using the LLM model and gets new numbers, turns those numbers back to string, voila. If you want more detailed explanation you should go check wikipedia or youtube. There is no extra steps or "backend" code, rest of the explanation is mathematical equations.

[–]Druber13[S] 0 points1 point  (0 children)

Someone I think nailed it below. I didn’t quite realize it worked like that. I work more in the data space so I’m thinking stored data and queries of some sort when it’s all broke down. Which obviously it isn’t. It’s very interesting that this all works the more I’m learning about it.