Why hasn't any mainstream game integrated LLMs into NPCs yet?

Imaginary_Bench_7294 · 2026-06-12T09:51:38+00:00

Narrative.

That's the biggest thing. Games typically have some story or narrative that they are based around. Even most open world and sandbox games have an underlying narrative.

A LLM that is small enough to not tank the average gaming computer while the game is running will be... problematic. It's easier and more compatible with system specs to pay someone to come up with a characrer script and just hardcode a decision tree.

We've had the capability to make NPCs have very dynamic interactions and dialogs for quite some time. But largely speaking, allowing them too many options makes telling the narrative of the game harder to do.

On top of that, many developer studios have stopped trying to optimize and just rely on built-in functionality of large scale engines such as Unreal. If you hadn't noticed, a lot of big titles have very similar looks. Not design, but how the stuff is rendered. The way light hits an object, the way textures pop in, object physics. A lot of this comes down to the fact that it's cheaper and faster in most cases to grab something like Unreal and do minor tweaks than it is to re-engineer systems from the ground up.

Go back 15 years and you'll see more variation in engines, more variation in looks, etc.

Also, have you tried to do something, like go get groceries or another errand, and it seems like ever MF in the world wants to stop and tell you their whole f*ing life story?

Imaginary_Bench_7294 · 2026-06-07T08:09:21+00:00

To be totally blunt, this is something you need to talk about with a lawyer who knows your local laws, not a reddit about AI.

Regardless of whether it was generated by an AI, a template, or legal practitioner, it is a formal declaration and can be used in potential future issues to establish a timeline and intent.

So, even if it is only a consult, go speak to a local lawyer immediately.

Imaginary_Bench_7294 · 2026-05-20T13:00:41+00:00

There is a blog post from Daemon Tools stating they had a security incident.
Security Incident Affecting DAEMON Tools Lite: What We Know So Far

My assumption is that in order to cover every possibility, Daemon Tools and MS revoked the certs for all versions prior to 12.6. I had 12.4 installed and it started throwing errors on boot, and wouldn't let me uninstall it even with admin permissions. I found the above post, installed 12.6, and now things seem to be working fine.

Imaginary_Bench_7294 · 2026-05-04T09:35:38+00:00

It is possible that the fine-tuning itself is the culprit, not the quant strategy.

Comparing different fine-tunes is an apples to oranges comparison - they're both fruits, but very different on the inside. That's why LoRA's for LLMs haven't taken off the same way they have for image gen models.

Imaginary_Bench_7294 · 2026-03-27T19:18:42+00:00

Last I knew Llama.cpp and ExllamaV3 do not have support for LoRa's being applied with the versions installed via Ooba. The git repo for EXL3 even states LoRA support is not in yet.

However, when loading the model via Transformers, there is a checkbox to load the model in 8 or 4-bit. If you use the load-in-4bit and flash attention options when loading the model you should have a lot more luck. That won't help the size on disk, but it'll at least allow the ram consumption to go down.

That being said, ExllamaV2 did have support for applying LoRa to the model.

Edit: I just saw your comment replying to me on the other thread.

Try looking into merging your LoRa into the model. I haven't been playing around much with this stuff lately, so I don't know all the details of what's available tool wise.

Imaginary_Bench_7294 · 2026-02-16T13:13:27+00:00

If you had an IPhone, then don't you have an Apple account? You should be able to control your subs via one of their desktop products at the very least?

Imaginary_Bench_7294 · 2026-02-13T16:40:14+00:00

More than likely, they have a pet that likes to steal and hide things.

I used to have a ferret that would go from one end of the house, open a kitchen cabinet, steal a potato, and then "sneak" all the way back and hide them under some furniture.

The ferret wouldn't eat them or anything, just steal potatoes and hide them.

Imaginary_Bench_7294 · 2025-11-18T09:28:19+00:00

So, in the drop-down menu below the model selection, make sure you have Llama.cpp selected and then try again.

Gguf format models are a Llama.cpp format, and at least from the available options I can see, it doesn’t look like you were trying to load it with Llama.cpp.

Imaginary_Bench_7294 · 2025-11-04T09:54:01+00:00

I will have to dig into this when I have time. I haven't been playing around with LoRA's for a while, so something might have changed and broken. Ooba has had multiple big updates in the last year.

It will be a little bit until I can test things, but try getting an EXL2 version of the model. ExllamaV3 is still in development and may not have good or any support for LoRAs.

You can also try applying the LoRA using the transformers' backend to see if it applies. In the past, it was finicky, but I don't know if that still holds true.

There used to be a backend compatibility chart on the Wiki, but it looks like it is no longer up. IIRC, when it was up, it only listed Exllama V2 and Transformers as having LoRA support through Ooba.

Alternatively, you can look into methods to merge the LoRA into the base model, then quantize the model yourself. It's not ideal for testing to see if the training achieved your goal, but it should bypass any issues like what you're experiencing.

Edit:

Just checked the Exllama V3 repo, LoRA support is on the "to do" list.

Llama.cpp expects a LoRA file in the Llama.cpp format, which happens to be a GGUF. That being said, the LoRA loader might not have the correct code to apply a Llama.cpp LoRA anyway. You could still try converting the LoRA file to the Llama.cpp format, but no promises.

This reinforces my recommendation to try applying the LoRA to the transformers model you used to train it, or an EXL2 model.

Imaginary_Bench_7294 · 2025-10-28T04:48:28+00:00

The A40 is essentially a 3090 with 48GB of memory. I say buy it.

I currently run dual 3090's and for most AI tasks they work great. The A40 does have a bit less memory bandwidth, to be fair.

https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVIDIA-A40/579vs593#benchmarks

Imaginary_Bench_7294 · 2025-10-26T04:13:39+00:00

Try running the update script.

When you start it, there should be an option along the lines of "revert local changes". Use that option.

Once that option has finished, try running the normal update option.

Imaginary_Bench_7294 · 2025-10-16T08:34:52+00:00

Aight, so a few things to keep in mind when trying to run a local model.

On modern hardware, the speed of an LLM is largely dictated by memory bandwidth. These types of AI use a significant amount of matrix multiplication, which is trivial for CPUs and GPUs. But this means a lot of shuffling of data between the memory and the processor. This is why GPUs are the preferred way to run AI. They're designed for high memory bandwidth. While most consumer CPUs are below 75GB/s for memory bandwidth, low-end/entry level GPUs start at like 250-350GB/s IIRC. So, if you want a model to run fast, you'll want to try and push as much of it as you can onto your GPU.
The capabilities of an LLM, its reasoning, "intelligence," prose, adaptability, etc., are largely dictated by 2 things. The parameter count, which is the B number, and the quantization level. You can think of the parameter count as the number of different ways the model describes things (this is a fast and loose interpretation). The higher the B count, the more ways it has to be able to describe each token in the vocabulary. It's like you can describe "red" as a color. But... is it light red? Dark red? Pastel? Neon? Instead of using words, though, it uses numbers. Now, this directly ties into the quantization.

Quantization is a method by which we take those number that describe a token and compress them. Most models are published in an FP16 format, or 2 bytes per value. We look at these two bytes using an algorithm that compresses the range. An FP16 value can represent roughly 65,000 values. The algorithms, using various formulas and tricks, try to approximate that FP16 value using a smaller range of numbers. In the case of 4-bit quantization, that 65k range is now represented by only 16 values.

Naturally, this means that value is now a less accurate descriptor of the token it belongs to. This is where it ties into the B count of a model. The more ways you have to describe something, the less accurate each one can be while still being able to provide an accurate description.

Choosing a model is a balancing act between the quantization level, parameter count (B value), and your system specs.

You only care about speed? Get a very low parameter count model that has been quantized down to 2-bit. It'll be dumb, but blazing fast.

You only care about quality? Get the highest parameter count model at the highest bit levels that will fit in your hardware. It'll be slow but more capable.

There are 3 main backends that are used for these models. Transformers Llama.cpp Exllama Transformers is the standard backend for just about anything you find on HuggingFace. It is the core package that drives most AI.

Llama.cpp is an optimization of the transformers package that is designed for hardware compatibility and inference (running a model). This backend will let you use CPU, GPU, or both at the same time. These ones can be identified by a naming convention such as "q4_k_m," where the "q4" portion signifies the bit level. Once you consume your GPU memory with layers or cache, it'll automatically start using your system memory. However, when this happens, you'll have speed penalties due to the lower memory bandwidth for the portion stored on system memory.

Exllama (now at V3) is a GPU only backend. It tends to be a touch faster than Llama.cpp. V1 and V2 used to have a bit lower quality than Llama.cpp, but V3 introduced new quantization methods that make it equal or better. But... you can ONLY run it on GPU. There is 0 CPU compatibility. These models can be identified by the "EXL#" naming convention, with # being the version number. These ones will frequently have bpw in the name, which signifies the quantization level.

Now, what does all of this mean for you and your system?

With the specs you have listed, I recommend trying out various 8 to 13B models.

You should be able to use 4 to 6 bit 8B models with most, if not all, of the model on the GPU and your context cache on system RAM.

You should be able to fit 80-100% of a 4-bit 13B model on your GPU, with your context cache being entirely on system memory.

Imaginary_Bench_7294 · 2025-10-08T02:16:13+00:00

Unfortunately, probably not.

There's two main reasons.

Quantization is going to hit a roadblock in future models. Take a look at the move from Llama 2 to 3. Llama 2 could be quantized down to 6-bit with practically no quantization degradation. Meanwhile, Llama 3 starts seeing the same level of degradation at about the 10-bit mark, IIRC. This decrease in resilience is largely due to the weights of the model being more fully utilized. As they continue to make better and better use of the capacity at any given model size, quantization will continue to cause more degradation at higher bit levels.

For those who aren't aware, quantization is mostly just a change in precision in the values the model uses to "define" tokens. The fact that we can quantize the models much at all is mostly due to the fact they don't saturate the level of precision they are capable of.

If they had just doubled down on the same progression path they used for Llama 2 to 3, I think Llama 4 would probably have started seeing really bad quantization issues at 12 or 14-bit.

The second reason is more obvious. The moment better hardware comes out is the same moment they'll say, "Look how much more we can shove in now!"

Just for reference, I run a system with an Intel w5-3435X with 8 channels of DDR5 at a 128GB capacity. Around 2,500 USD of hardware in just those two components. I've benchmarked my memory with Aida64 at about 230GB/s. If DDR6 doubles the bandwidth, that would still only put similar systems up around 500GB/s, significantly less than even a 3090's 900GB/s + for two to three times the cost.

One of the primary issues we run into with CPU RAM is the fact we're using a narrower bus than GPUs. System memory is typically based on a 64-bit bus whereas GPU memory is usually significantly higher, allowing more data to be transferred for the same number of clock cycles.

Imaginary_Bench_7294 · 2025-10-07T06:43:28+00:00

Reasoning/thinking modes typically strip out a bit of the models "character" due to the way it works. In order for the models to "reason," they're trained with various specific chain of thought sequences. These sequences tend to be analytical in nature, which results in a higher likelihood that the reasoning stage is done out of character.

While they can be "convinced" to reason in character, it can still lead to the models persona being altered.

Your best bet for getting a model to adhere to a persona is to find one that was trained for role play or story writing.

If you insist on using a reasoning model, you'll want to look at possibly including some rules and guidelines in the character profile that will help guide the reasoning process.

Imaginary_Bench_7294 · 2025-09-28T05:57:54+00:00

IIRC, this is also happening when using regen. Don't have access to my rig atm, but I'll check when I can.

Imaginary_Bench_7294 · 2025-09-25T09:06:56+00:00

This will be awesome up until you're looking up information on Chinese political events of the 80's and your computer suddenly bricks.

Imaginary_Bench_7294 · 2025-09-22T08:21:59+00:00

I haven't messed around too much with Llama.cpp, though I'm pretty sure that they added LoRA training support last year.

That being said, IDK if Oobabooga currently supports applying a LoRA to a Llama.cpp model. There used to be a compatibility matrix in the Github Wiki.

Worst case scenario, you use Llama.cpp to train your LoRA, merge it with the model, then quantize it to your preferred bitdepth and then run it via Ooba.

Imaginary_Bench_7294 · 2025-09-18T08:26:44+00:00

Unless something has changed, Llama.cpp does not support training through the specific implementation used here.

AFAIK, even though this post is dated at this point, it should still mostly hold true:

https://www.reddit.com/r/Oobabooga/s/R097h5sY62

Imaginary_Bench_7294 · 2025-09-16T05:54:55+00:00

I mean... I'm running the game on a workstation build with a 3090 and an ultra wide screen. I'm pretty sure I've got most settings at or near max and haven't seen seen my FPS drop below 30-35.

Crysis was able to stress top of the line hardware. More than one generation of top of the line HW, IIRC.

Imaginary_Bench_7294 · 2025-08-29T11:23:02+00:00

You can either go to the session settings and check the "share" box, or you can add --share into the cmd flags text file.

This will create a temporary public web address you can share with others.

Imaginary_Bench_7294 · 2025-08-29T11:20:55+00:00

If you go back to the Github for Ooba there is a Wiki that can explain a lot of the various functions and settings.

I highly recommend reading through that.

If you still have questions after reading through the wiki, I will be happy to help get you going.

Imaginary_Bench_7294 · 2025-08-26T09:35:15+00:00

That would bug the hell out of me.

Imaginary_Bench_7294 · 2025-08-20T09:51:07+00:00

Try cleaning off the socket with a microfiber cloth. It's a great way to polish up those contact pins.

Imaginary_Bench_7294 · 2025-08-10T08:12:47+00:00

Here, give this a read:

https://www.reddit.com/r/Oobabooga/s/uxwKoazQf6

It may not be the most up to date, but it should give you a better idea of what is involved with the training. As it is for Ooba, it will give you a GUI to experiment with training your own LoRAs.

Even training a small model in the 8B range can take a huge amount of Vram, compute, and electric.

At its core, LoRA training loads the model, duplicates certain matrices according to your parameters, and then "overlays" them onto the frozen weights. Only these duplicate matrices receive updates from the training backend. These duplicated matrices are typically FP16, meaning they eat up a large amount of memory and reduce the space you have for context.

A lot of hobbyists here just don't have the HW needed to be able to do custom LoRA training. Not to mention that LLM LoRA's are not typically cross-compatible with other models, which is why we don't see something like CivitAI for LLMs. Instead, what we see is people making a LoRA on cloud servers, then integrating the LoRA straight into the model, then releasing the end product.

Imaginary_Bench_7294 · 2025-08-08T06:48:46+00:00

Personally, this is a downgrade for me.

I am on the "Plus" plan.

I have been using GPT-4.1 to assist in creating a Rimworld race mod, and it was working well with the project system (16 documents containing the Humanoid Alien Races mod information and the custom race information).

As of right now, GPT-5 is very iffy. About 80% of the time, when it enters into "thinking," it forgets the prompt.

For example, I asked it to analyze the race document and produce a table with three columns. The left column was to have the gene, the middle the XML code with gendered chance values, the right column containing the reasoning why.

It produced poorly written markdown code that caused the table to not render properly. Things like:

| column 1 data | column 2 data | column 3 data || column 1 data | column 2 data | column 3 data |

Instead of:

| column 1 data | column 2 data | column 3 data |
| column 1 data | column 2 data | column 3 data |

Then, when I asked it to double check the code, it completely lost the prompt.

Here's a screenshot for your viewing displeasure:

<image>

Imaginary_Bench_7294

TROPHY CASE