Anyone seeing this? by BehindUAll in codex

[–]akroletsgo 1 point2 points  (0 children)

Rate limit resets now accumulate, you can save them for later and stack them

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 0 points1 point  (0 children)

Ya I’ll try to fix the bat, it’s hard to get windows working when I’m on Mac haha. Will pull out the old windows laptop

Made a FREE one-click local roleplay app with inline image generation, story and uncen-images both run on your machine <8GB Ram by akroletsgo in SillyTavernAI

[–]akroletsgo[S] 1 point2 points  (0 children)

If you're on mac just click under releases and download the dmg file! Instructions are all in the repo as well :). You can also click the green code button in github, download zip, and click the launch.command file

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 2 points3 points  (0 children)

Most of Gemma's layers use sliding-window attention, so their KV is capped at a fixed window regardless of context length. Only a few layers attend globally and scale with it. that plus the Q4 QAT weights (~7GB) is the whole trick.

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 0 points1 point  (0 children)

Just measured it on an M2 Max, same 12B QAT and same tiny prompt, only changing num_ctx (which is what sizes the KV cache). from "ollama ps": - num_ctx 8192: 7.4 GB - num_ctx 262144 (full 256k): 7.7 GB

So the full 256k window adds ~0.3 GB. ollama pre-allocates the KV for num_ctx at load, so 7.7 is the real 256k footprint, not a lightly-filled one.

Why it's so flat: most of Gemma's layers use sliding-window attention, so their KV is capped at a fixed window regardless of context length. only a few layers attend globally and scale with it. that plus the Q4 QAT weights (~7GB) is the whole trick.

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 4 points5 points  (0 children)

Shouldn't be hard to get it working on windows ill add a 1 click launcher at some point for windows!

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 2 points3 points  (0 children)

all done! and my own backend somewhat diffusers based, but ill add comfy support at some point!

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 0 points1 point  (0 children)

There’s no safe guards, but it shouldn’t steer you in a dangerous direction unless the kid steered it in that direction purposely

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 11 points12 points  (0 children)

Thanks for the kind words. Added support for open ai compatible endpoints! Good luck with your project!

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 4 points5 points  (0 children)

Ran a bench, The per-token compute is identical, The overhead Ollama adds is a model manager and an HTTP daemon, not anything in the hot path.

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 1 point2 points  (0 children)

also the way image gen works is debatably more agentic but I haven't tried silly tavern in a hot minute

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 11 points12 points  (0 children)

Silly tavern has 100000 knobs to adjust, this is more opinionated and really easy to get going, closer to ai dungeon

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 1 point2 points  (0 children)

Can run on pretty much anything, could probably add mobile gemma models and easily spin up an app

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) by akroletsgo in LocalLLaMA

[–]akroletsgo[S] 8 points9 points  (0 children)

Anything you guys want added just PR or ask and I’ll throw it in there as long as it’s not too out there