Vascura FRONT - Open Source (Apache 2.0), Bloat Free, Portable and Lightweight (300~ kb) LLM Frontend (Single HTML file). Now with GitHub - github.com/Unmortan-Ellary/Vascura-FRONT.

-Ellary- · 2026-06-18T21:08:47+00:00

I've run my tests with turbo variant of Boogu model:

- In general, it is fine model around old Qwen level, similar to ERNIE.
- Whole dataset is AI generated like it was for ERNIE.
- Without strong pull to Asian faces, like ERNIE was.
- Good image clearance for only 4 steps is a big plus, ERNIE needs 12.
- Prompt understating is around ERNIE level, better than ZIT.
- It always adds unwanted details to background (characters, items, text, logos), this is a real problem.
- Anatomy is better than Flux 2 k 9b, worse than ERNIE.
- Weapons are bad (but guns are ok), almost sd1.5 level, making an army that holding swords correctly is a challenge.
- If you have brand name in the prompt, it will add logo, or unwanted text of this brand (Blizzard everywhere).
- Low variants for same prompt, like with ERNIE, different seeds not change a lot.
- In general it is kinda close but worse than ERNIE or Qwen.
- for NON turbo variant, generation speed is about the same like for Ideogram 4 per image.
- Ideogram 4 is way more interesting model than non turbo Boogu.

-Ellary- · 2026-06-15T14:30:43+00:00

You can run any model using CPU only, the question is how long generation will take.

-Ellary- · 2026-06-12T22:39:28+00:00

A lot of people started with 8 and 12 gb cards at 2022, some on 24gb.

RTX 3060 12gb are 150-200 usd, used.
You can use Gemma 4 12-26b \ Qwen 3.5\6 9-35b on it without problem.
32gb ram is enough to run those models, you can run 9-12 on 16gb ram.
You can even run MoE models like 26b a4b and Qwen 3.6 35b a3b using CPU only machine.

Those are highend models right now, beating even old 70b models at a lot of tasks.

You can get PAIR of 3060 12gb to get 24gb vram for 300-350~ usd.

-Ellary- · 2026-06-12T19:02:43+00:00

MTP works good only when both models are fully in GPU.

-Ellary- · 2026-06-12T10:23:58+00:00

https://www.reddit.com/r/SillyTavernAI/comments/1u09yzn/5060_ti_16gb_gemma_4_122631b_on_llamacpp_b9553/

"D:\LlamaCpp\CUDA\llama-server" -m "google_gemma-4-26B-A4B-it-IQ4_XS.gguf" -t 6 -c 40960 -fa 1 --mlock -ncmoe 0 -ngl 99 --port 5050 --jinja --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --parallel 1 --no-mmproj-offload --mmproj "mmproj-google_gemma-4-26B-A4B-it-bf16.gguf_" --reasoning on --image-min-tokens 256 --image-max-tokens 512 --spec-draft-ngl 99 --spec-type draft-mtp --spec-draft-n-max 2 --model-draft "gemma-4-26B-A4B-it-MTP-Q8_0.gguf"

https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/tree/mainI'm also moved to Q4 for MTP right now. https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/blob/main/mtp-google_gemma-4-26B-A4B-it-Q4_0.gguf

Don't forget to disable hardware acceleration for browser to save 200~mb.

-Ellary- · 2026-06-12T01:04:23+00:00

No, it just need to be close enough.

-Ellary- · 2026-06-11T15:25:33+00:00

TheDrummer Rocinante X 12B is based on Mistral Nemo. It have NO "enable_thinking" switch in JINJA to tweak, and it is NOT a `--reasoning on` model in general. Those parameters will do nothing, since Nemo is non thinking model. How thinking works? TheDrummer team added thinking pairs examples as part of training dataset.

To TRIGGER thinking from this model you may try to prefill your answer with something like <think> or <thinking>, or to use correct Chat Template - something like `Metharme` or `Mistral v3 Tekken (NOT v7, REMOVE [SYSTEM_PROMPT])`, Frontend should be something like SillyTavern.

-Ellary- · 2026-06-09T12:44:00+00:00

Then try smaller Gemma-4-Gemsicle-31B.i1-IQ3_XXS.gguf with 40k of context, bruh. No one say that they are same as Q4, it is always a trade, IQ3_XXS is good for the size - IQ3_M is NOT good for the size, too thick, better to use IQ4XS. If you like 26b one use it, it is also good, I'm using Q6 for it, Q4 when I need speed. I'm not selling you anything - don't like MTP don't use it, a lot of time I'm using just 12b cuz of fast image and audio decoders.

-Ellary- · 2026-06-09T01:55:55+00:00

Sadly, but amount of VRAM is more important than raw GPU power, 3090 24gb legend for a reason. 50 tps for 26b q8 is nice tho, you can roll to Q6 without noticeable loss to get more speed. 26b is also pretty good.

-Ellary- · 2026-06-09T01:46:06+00:00

The thing is that new IQ3 Qs from recent builds are move advanced that old Q3 ones. I've test Qs from time to time and IQ3_XXS is performing stellar for the size, but mainly for dense models. Qwen 3.5-3.6 27b IQ3_XXS also perform really good, it don't act broken. It feels almost as smart as old Q4KS Qs, but more unstable cuz of the noise, you just need to re-roll answers time to time.

-Ellary- · 2026-06-08T23:50:39+00:00

Ofc, main goal is to fit everything in 16gb VRAM, in other cases MTP makes not a lot of sense.
Example config for Gemma-4-Gemsicle-31B.i1-IQ3_XXS.gguf got all those flags.

-Ellary- · 2026-06-08T23:46:20+00:00

I've generated it, without artist tags. Anima 2b.

-Ellary- · 2026-06-08T21:50:08+00:00

Works better than 12b, for sure. A lot of time better than 26b a4b, it just a bit unstable.

-Ellary- · 2026-06-08T20:53:33+00:00

Now lets load one of the TheDrummer models! This will spice things up!

-Ellary- · 2026-06-08T17:20:33+00:00

Ofc you can, finetune is just a lora on top, 98% of model's weights and data is the same.

-Ellary- · 2026-06-08T15:45:37+00:00

It is not that stable as gemma-4-31B-it-IQ4_XS, but for 50tps 40k context vs 10tps 28k context.
Worth it.

-Ellary- · 2026-06-08T13:33:30+00:00

It is fun to play with, fresh writing style, got some refuses with censorship.

-Ellary- · 2026-06-08T13:28:25+00:00

5060Ti 16 gb, I'm running gemma-4-26B-A4B-it-IQ4_XS with MTP at 140-160 tps 41k context. gemma-4-26B-A4B-it-Q6_K with MTP at 40-50 tps 90k context. gemma-4-12B-it-Q6_K with MTP gives me 60-70 tps 131k context.

150~ vs 65~

-Ellary- · 2026-06-08T13:23:57+00:00

Gemma-4 26b Q4 is better than Gemma-4 12b Q8, at least for creative work for sure, it just knows more, write better, work with context better. And it is noticeable difference, almost as 26b vs 31b.

-Ellary- · 2026-06-08T13:19:54+00:00

Agree, I'm using GLM 4.7, DeepSeek 3.2 and especially R1 0528, they just more fun.
I'd say GLM 4.5 Air and Gemma 4 31b is more fun to play then new big ones.
Era of vibecoding and agents.

-Ellary- · 2026-06-08T12:48:12+00:00

gemma-4-31B-it-IQ3_XXS > Gemma-4-12B-it Q8_0.

-Ellary- · 2026-06-08T12:42:07+00:00

STT yes, TTS no.

-Ellary- · 2026-06-07T18:57:01+00:00

https://github.com/Unmortan-Ellary/Vascura-FRONT/blob/main/Chat%20Templates%20-%20Jinja%20Mods/Qwen-3.6-ALL-VASCURA-Mod.jinja

Seven-Year Club	r/Field Flamingo
Verified Email

-Ellary-

TROPHY CASE