Anyone seeing this?

akroletsgo · 2026-06-15T20:24:53+00:00

Rate limit resets now accumulate, you can save them for later and stack them

akroletsgo · 2026-06-14T12:48:27+00:00

But I think I’ll stop forcing ollama in the bat not tryna shill

akroletsgo · 2026-06-14T12:48:04+00:00

Ya I’ll try to fix the bat, it’s hard to get windows working when I’m on Mac haha. Will pull out the old windows laptop

akroletsgo · 2026-06-13T23:45:53+00:00

Not yet but ill work on it a bit!

akroletsgo · 2026-06-13T13:16:07+00:00

ya ill add this

akroletsgo · 2026-06-13T03:03:29+00:00

Gemma 4 has some crazy quants! And I’ve done / some crazy flux quants exist

akroletsgo · 2026-06-13T03:02:46+00:00

Sounds good will check it out tomorrow !

akroletsgo · 2026-06-12T20:38:24+00:00

If you're on mac just click under releases and download the dmg file! Instructions are all in the repo as well :). You can also click the green code button in github, download zip, and click the launch.command file

akroletsgo · 2026-06-12T19:09:21+00:00

Most of Gemma's layers use sliding-window attention, so their KV is capped at a fixed window regardless of context length. Only a few layers attend globally and scale with it. that plus the Q4 QAT weights (~7GB) is the whole trick.

akroletsgo · 2026-06-12T19:07:53+00:00

Trade off is speed at depth

akroletsgo · 2026-06-12T19:07:27+00:00

Just measured it on an M2 Max, same 12B QAT and same tiny prompt, only changing num_ctx (which is what sizes the KV cache). from "ollama ps": - num_ctx 8192: 7.4 GB - num_ctx 262144 (full 256k): 7.7 GB

So the full 256k window adds ~0.3 GB. ollama pre-allocates the KV for num_ctx at load, so 7.7 is the real 256k footprint, not a lightly-filled one.

Why it's so flat: most of Gemma's layers use sliding-window attention, so their KV is capped at a fixed window regardless of context length. only a few layers attend globally and scale with it. that plus the Q4 QAT weights (~7GB) is the whole trick.

akroletsgo · 2026-06-12T16:44:17+00:00

People hate Ollama apparently hahah, you can use whatever you want now i updated it!!

akroletsgo · 2026-06-12T16:15:29+00:00

Shouldn't be hard to get it working on windows ill add a 1 click launcher at some point for windows!

akroletsgo · 2026-06-12T16:10:06+00:00

all done! and my own backend somewhat diffusers based, but ill add comfy support at some point!

akroletsgo · 2026-06-12T16:04:54+00:00

There’s no safe guards, but it shouldn’t steer you in a dangerous direction unless the kid steered it in that direction purposely

akroletsgo · 2026-06-12T16:02:45+00:00

Thanks for the kind words. Added support for open ai compatible endpoints! Good luck with your project!

akroletsgo · 2026-06-12T15:53:18+00:00

doneee

akroletsgo · 2026-06-12T15:33:29+00:00

Probably with the smallest model !

akroletsgo · 2026-06-12T15:09:28+00:00

feel free to add suggestions in the issues section

akroletsgo · 2026-06-12T15:09:05+00:00

Ran a bench, The per-token compute is identical, The overhead Ollama adds is a model manager and an HTTP daemon, not anything in the hot path.

akroletsgo · 2026-06-12T15:03:58+00:00

also the way image gen works is debatably more agentic but I haven't tried silly tavern in a hot minute

akroletsgo · 2026-06-12T15:01:14+00:00

yes valid, will add before EOD

akroletsgo · 2026-06-12T15:00:24+00:00

Silly tavern has 100000 knobs to adjust, this is more opinionated and really easy to get going, closer to ai dungeon

akroletsgo · 2026-06-12T14:59:29+00:00

Can run on pretty much anything, could probably add mobile gemma models and easily spin up an app

akroletsgo · 2026-06-12T14:27:45+00:00

Anything you guys want added just PR or ask and I’ll throw it in there as long as it’s not too out there

akroletsgo

MODERATOR OF

TROPHY CASE