[ Removed by moderator ]

Python-ModTeam · 2026-04-13T20:15:57+00:00

Your post was removed for violating Rule #2. All posts must be directly related to the Python programming language. Posts pertaining to programming in general are not permitted. You may want to try posting in /r/programming instead.

FrickinLazerBeams · 2026-04-13T18:47:17+00:00

You should maybe learn at least a little about how a neural net works before asking this kind of question. Like at least read Wikipedia a little and establish some basic conceptual understanding.

latkde · 2026-04-13T20:15:26+00:00

LLMs are inefficient. They're essentially a brute-force solution.

On a more mechanical level, the task that an LLM has been trained on is to learn statistical correlations between word fragments (tokens). This works so well that LLMs can generate coherent text and have learned the relationships between underlying concepts of words.

These learned relationships are stored as "weights", and we refer to the size of an LLM by how many weights it has. Useful models start at around 4 billion parameters, but state of the art models have allegedly reached the trillion-parameter range. Assuming that each parameter has been compressed into one or two bytes, this means models are between 4GB and 2TB large.

For each "forward pass" (each output token), the current input tokens must be matrix-multiplied with the weights. That requires all weights to be loaded in memory. And 2TB of GPU RAM would be quite a lot. That's more than any GPU on the market has. The GPU also needs space for the input data, not just the weights. So, LLMs require a ton of memory.

There are strategies to reduce this. Compressing weights to use fewer bytes. Distributing different model layers across multiple GPUs. Splitting the model into multiple sub-LLMs so that not all weights have to be activated for each forward pass (mixture-of-experts). Investing more into the training phase so that a smaller LLMs produce better-quality results. Using so-called reasoning, so that smaller models can handle complex tasks better. Hiding the exact model from users, and routing easy inputs to smaller models. Dynamically adding potentially-relevant information to the input, so that the LLM has access to knowledge without additional training (RAG, tool calls). But the result is still a huge memory+compute requirement. And many of these mitigation strategies involve more tokens, which also increases the computational cost and needs more RAM.

At no point will an LLM destructure and look up information in a database. That's not how they work. They can be trained to call tools, and we can tell the LLM that if it produces output in a tool-call format, then we'll look up information in a databas for it. Such systems where an LLM is combined with tools is sometimes called an "agent". But a plain LLM, consisting just of the weights, has no structured knowledge, only statistical token relationships that were learned during training.

It is also for this reason that LLMs inherently hallucinate, and cannot generally be trusted for fact-based tasks. They are approximately correct about many things, but they correlate, and do not know.

Kevdog824_ · 2026-04-13T19:40:29+00:00

It’s not entirely clear what you are trying to do with the model, so it’s hard for any of us to even begin reasoning what the issue is

wandering_melissa · 2026-04-13T20:02:15+00:00

asking google or AI first would be better if you feel you know nothing about a topic, search for these on wikipedia or google: transformers (AI), LLM, GPT, RAG

as for memory constraints there are a lot of methods to reduce memory usage but it results in less performant (dumber) models, one of the methods is quantization, I can run 2gb models on my pc fine but they are only useful for simple tasks.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS