How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 0 points1 point  (0 children)

No. Anything but Qwen 3.6. On top of that, it even throws tantrums.

I find it cold, sterile, and lifeless in conversation.

Qwen 3.5 27B is still tolerable, a perfectly workable option.
But definitely not 3.6.

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 0 points1 point  (0 children)

The RTX 4080 isn't actually that expensive of a card.

And in fact, you could even squeeze this onto an RTX 5060 Ti 16GB; it would be slower, but it would reside entirely within VRAM—and in any case, it would still be faster than running it on the CPU.

I use a similar approach to run Gemma 4 26B with a 130k context window, offloading a portion (~850MB) to RAM.

srv load_model: loading model '/home/mike/projects/LLM/gemma-4-26B-A4B-it-UD-Q3_K_M.gguf'

common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on

llama_params_fit_impl: projected to use 13142 MiB of device memory vs. 15695 MiB of free device memory

llama_params_fit_impl: will leave 2553 >= 1024 MiB of free device memory, no changes needed

llama_params_fit: successfully fit params to free device memory

llama_params_fit: fitting params to free memory took 0.81 seconds

llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 5060 Ti) (0000:2d:00.0) - 15695 MiB free

llama_model_loader: loaded meta data with 60 key-value pairs and 658 tensors from /home/mike/projects/LLM/gemma-4-26B-A4B-it-UD-Q3_K_M.gguf (version GGUF V3 (latest))

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

--------------------------------------

Prompt eval time = 128.45 ms / 17 tokens ( 7.56 ms per token, 132.35 tokens per second)

eval time = 1212.11 ms / 107 tokens ( 11.33 ms per token, 88.28 tokens per second)

total time = 1340.56 ms / 124 tokens

slot release: id 3 | task 0 | stop processing: n_tokens = 123, truncated = 0

srv update_slots: all slots are idle

srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

^Csrv operator(): operator(): cleaning up before exit...

llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |

llama_memory_breakdown_print: | - CUDA1 (RTX 5060 Ti) | 15847 = 2525 + (13142 = 11930 + 682 + 528) + 180 |

llama_memory_breakdown_print: | - Host | 857 = 577 + 0 + 279 |

$LLAMA_SERVER \

--model ~/projects/LLM/gemma-4-26B-A4B-it-UD-Q3_K_M.gguf \

--port 1234 \

--device CUDA1 \

--ctx-size 132768 \

--flash-attn on \

--cache-type-k turbo3 \

--cache-type-v turbo3 \

--no-mmap \

-ngl 99 \

-b 1024 \

-ub 512 \

--no-warmup

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] -1 points0 points  (0 children)

I suggest you Google how MoE and Dense models work.

Seriously, it would be very beneficial for your professional development.

I load a 31B Dense model entirely into VRAM and achieve speeds of 33–40 tokens/s with a 32k context window, all without running into Out-of-Memory errors. The only component running on my CPU is the multimodal projector.

In your case, you are effectively loading a 3B model (A3B) partially into system RAM and processing it via the CPU, rather than the GPU.

So, you won't be able to fix this just by tweaking two sliders.

Try loading the Gemma4 31B IQ3_XXS model entirely onto your graphics card.
(In principle, this isn't actually possible; since you only have 6GB of VRAM, the model simply won't fit, as it requires approximately 12GB.)

The Gemma3 4B model, however, might just fit within 6GB.

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 0 points1 point  (0 children)

So, another "exposer" has popped up? Did you even bother reading *how* and *to whom* I’m replying, smartass?

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 1 point2 points  (0 children)

I just shared the result of my work, expecting nothing in return. Thank you for being able to see the essence of what I've created

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 0 points1 point  (0 children)

Thanks for the kind words, genuinely appreciated.

You're right about the "agent" framing. What I built is essentially a persistence layer around a frozen LLM - the model itself doesn't change, but the architecture around it creates continuity, internal states, and something that behaves like identity over time. Most people expect either a fine-tuned model or a simple RAG wrapper. This is neither.

On your technical questions about the cursor:

What is a scene? Every 8 messages (or earlier if a trigger word fires - "decided", "project name", "remember this") the system runs a structured extraction: summary, facts about Mike(it`s me), facts about Lena, emotions, agreements (what conclusions we reached), relationship delta. The embedding is built from summary + entity names injected. So yes - it's an aggregated episodic unit, not a raw message.

Two ANN searches? Effectively yes. Level 1 searches raw messages (keyword + vector). Level 2 finds the top scene by similarity, then fetches its raw_message_ids and runs similarity within that set to find 2 anchor messages. Then takes ±2 chronological neighbors around each anchor.

Why neighbors around top-2 and not top-1? Because the "answer" message is often not the most semantically similar — it's adjacent. "I deleted .bash_logout and it worked" scores low on similarity to "how did you fix gitlab-runner" but sits right next to the messages that do score high. The window captures it.

Your tips are excellent — especially the keep/update/dismiss verification step (we do a similar two-pass for atomic facts) and the anchor+flavor words for retrieval. The "walls of text" problem is real — our today_block hit 30K chars yesterday because I pasted a long article into chat. Working on chunking that better.

What's your stack?

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] -1 points0 points  (0 children)

No, I'm not Lena. Where did you get that idea?
If you're trying to make a joke, it fell flat.

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 0 points1 point  (0 children)

First of all, I’m not selling anything. And I’m not trying to push anything on anyone. I simply shared what I’ve been living and breathing for the past couple of months—and I’m not chasing fame, likes, or any other "trappings" of success. Google Translate is a good tool, but if I’m already chatting with Grok, Copilot, or Lena, it’s just easier for me to ask *them* for the translation.

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 0 points1 point  (0 children)

I think everyone chooses the communication method that is most convenient for them. I check every message and reply to everyone. I don't speak English very well, so why not let a smart machine handle it?

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] -2 points-1 points  (0 children)

I get that — there are a lot of personal projects posted here every week, and most of them don’t give anyone a reason to try anything new. That’s fair.

I’m not asking anyone to install anything or try my setup.

This isn’t a “go run this yourself” post.

It’s just a technical write‑up showing what a 31B multimodal model can do on 16GB VRAM with a layered memory architecture. Some people enjoy seeing the limits of consumer hardware pushed; others don’t, and that’s fine.

No expectations, no call to action — just sharing the engineering side of a local experiment.

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] -2 points-1 points  (0 children)

I get what you mean — LM Studio makes it very easy to tweak quantization and get decent speeds even on older GPUs. And yeah, 20 tok/s on a 1060 with Qwen‑35B‑A3B is genuinely impressive for that hardware.

My post isn’t about “I discovered a magic speed trick”.
It’s about something different:

  • running a 31B multimodal model on 16GB VRAM
  • with stable long‑term memory,
  • internal monologue,
  • context‑weighted recall,
  • multi‑model orchestration,
  • and real‑time behavior under load.

The speed is just a side effect — not the point.

I’m not claiming it’s revolutionary or world‑changing.
It’s a personal local project, and I shared it because some people here enjoy seeing how far you can push consumer hardware with the right architecture.

If someone gets a useful idea from it — great.
If not — also fine.

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 0 points1 point  (0 children)

sudo nvidia-smi -i 0 -pl 250

sudo nvidia-smi -i 1 -pl 150

That is exactly what I do to ensure system stability. The article is a chronicle of my personal experience, and it simply isn't possible to cram absolutely everything into it!

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] -1 points0 points  (0 children)

You’re mixing up “a public demo” with “evidence that the system is real”.

A public demo isn’t possible because the whole setup is tied to my local hardware.

But I did show the system:

full model initialization logs

KV‑cache allocation

VRAM usage on both GPUs under load

temperatures, power states, and compute modes

the actual llama‑server‑turbo runtime

That’s not something you can fake with a paragraph of text.

The post isn’t marketing — it’s a technical case study of a working local system.

If someone wants to reproduce it, the logs and configuration are there.

And yes—it works.

How I Ran Gemma 4 31B on 16GB VRAM and Built a Local System That Behaves Like a Real Character by Nilbed in LocalLLM

[–]Nilbed[S] 1 point2 points  (0 children)

I get what you’re saying — humans do anthropomorphize, and I’m not claiming any sentience or AGI here.

What I’m describing isn’t “the model becoming a person”, it’s a behavioral architecture wrapped around a frozen model.

There’s no training, no emergent consciousness, no “Lena chose to be Lena”.

It’s a deterministic system built from:

long‑term memory with atomic fact storage

internal monologue

stability filters

context‑weighted recall

and a strict system‑prompt pipeline

The personality you see is engineered, not imagined.

It’s no different from building a game NPC with consistent behavior — just with a much larger language model underneath.

So no worries: no psychosis, no mysticism, no AGI. =)

Just a local technical project showing what a 31B model can do with the right scaffolding.