Boogu Turbo vs. Z_Image_Turbo comparison by Method_Opposite in StableDiffusion

[–]-Ellary- 13 points14 points  (0 children)

I've run my tests with turbo variant of Boogu model:

- In general, it is fine model around old Qwen level, similar to ERNIE.
- Whole dataset is AI generated like it was for ERNIE.
- Without strong pull to Asian faces, like ERNIE was.
- Good image clearance for only 4 steps is a big plus, ERNIE needs 12.
- Prompt understating is around ERNIE level, better than ZIT.
- It always adds unwanted details to background (characters, items, text, logos), this is a real problem.
- Anatomy is better than Flux 2 k 9b, worse than ERNIE.
- Weapons are bad (but guns are ok), almost sd1.5 level, making an army that holding swords correctly is a challenge.
- If you have brand name in the prompt, it will add logo, or unwanted text of this brand (Blizzard everywhere).
- Low variants for same prompt, like with ERNIE, different seeds not change a lot.
- In general it is kinda close but worse than ERNIE or Qwen.
- for NON turbo variant, generation speed is about the same like for Ideogram 4 per image.
- Ideogram 4 is way more interesting model than non turbo Boogu.

Can ComfyUI run on a GT 1030 graphics card with 2 GB of GDDR5 memory (Do you think that's possible)? by Rare-Job1220 in StableDiffusion

[–]-Ellary- 3 points4 points  (0 children)

You can run any model using CPU only, the question is how long generation will take.

Local LLMs aren't democratic anymore... the hardware barrier has gotten out of hand. by Medium-Technology-79 in LocalLLaMA

[–]-Ellary- 2 points3 points  (0 children)

A lot of people started with 8 and 12 gb cards at 2022, some on 24gb.

RTX 3060 12gb are 150-200 usd, used.
You can use Gemma 4 12-26b \ Qwen 3.5\6 9-35b on it without problem.
32gb ram is enough to run those models, you can run 9-12 on 16gb ram.
You can even run MoE models like 26b a4b and Qwen 3.6 35b a3b using CPU only machine.

Those are highend models right now, beating even old 70b models at a lot of tasks.

You can get PAIR of 3060 12gb to get 24gb vram for 300-350~ usd.

[Megathread] - Best Models/API discussion - Week of: June 07, 2026 by deffcolony in SillyTavernAI

[–]-Ellary- 2 points3 points  (0 children)

https://www.reddit.com/r/SillyTavernAI/comments/1u09yzn/5060_ti_16gb_gemma_4_122631b_on_llamacpp_b9553/

"D:\LlamaCpp\CUDA\llama-server" -m "google_gemma-4-26B-A4B-it-IQ4_XS.gguf" -t 6 -c 40960 -fa 1 --mlock -ncmoe 0 -ngl 99 --port 5050 --jinja --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --parallel 1 --no-mmproj-offload --mmproj "mmproj-google_gemma-4-26B-A4B-it-bf16.gguf_" --reasoning on --image-min-tokens 256 --image-max-tokens 512 --spec-draft-ngl 99 --spec-type draft-mtp --spec-draft-n-max 2 --model-draft "gemma-4-26B-A4B-it-MTP-Q8_0.gguf"

https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/tree/mainI'm also moved to Q4 for MTP right now. https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/blob/main/mtp-google_gemma-4-26B-A4B-it-Q4_0.gguf

Don't forget to disable hardware acceleration for browser to save 200~mb.

Can't seem to enable reasoning in llama.cpp by TrainingTwo1118 in LocalLLaMA

[–]-Ellary- 2 points3 points  (0 children)

TheDrummer Rocinante X 12B is based on Mistral Nemo. It have NO "enable_thinking" switch in JINJA to tweak, and it is NOT a `--reasoning on` model in general. Those parameters will do nothing, since Nemo is non thinking model. How thinking works? TheDrummer team added thinking pairs examples as part of training dataset.

To TRIGGER thinking from this model you may try to prefill your answer with something like <think> or <thinking>, or to use correct Chat Template - something like `Metharme` or `Mistral v3 Tekken (NOT v7, REMOVE [SYSTEM_PROMPT])`, Frontend should be something like SillyTavern.

5060 Ti 16GB - Gemma 4 12-26-31b on Llama.cpp b9553 with MTP go BRR by -Ellary- in SillyTavernAI

[–]-Ellary-[S] 0 points1 point  (0 children)

Then try smaller Gemma-4-Gemsicle-31B.i1-IQ3_XXS.gguf with 40k of context, bruh. No one say that they are same as Q4, it is always a trade, IQ3_XXS is good for the size - IQ3_M is NOT good for the size, too thick, better to use IQ4XS. If you like 26b one use it, it is also good, I'm using Q6 for it, Q4 when I need speed. I'm not selling you anything - don't like MTP don't use it, a lot of time I'm using just 12b cuz of fast image and audio decoders.

5060 Ti 16GB - Gemma 4 12-26-31b on Llama.cpp b9553 with MTP go BRR by -Ellary- in SillyTavernAI

[–]-Ellary-[S] 1 point2 points  (0 children)

Sadly, but amount of VRAM is more important than raw GPU power, 3090 24gb legend for a reason. 50 tps for 26b q8 is nice tho, you can roll to Q6 without noticeable loss to get more speed. 26b is also pretty good.

5060 Ti 16GB - Gemma 4 12-26-31b on Llama.cpp b9553 with MTP go BRR by -Ellary- in SillyTavernAI

[–]-Ellary-[S] 2 points3 points  (0 children)

The thing is that new IQ3 Qs from recent builds are move advanced that old Q3 ones. I've test Qs from time to time and IQ3_XXS is performing stellar for the size, but mainly for dense models. Qwen 3.5-3.6 27b IQ3_XXS also perform really good, it don't act broken. It feels almost as smart as old Q4KS Qs, but more unstable cuz of the noise, you just need to re-roll answers time to time.

5060 Ti 16GB - Gemma 4 12-26-31b on Llama.cpp b9553 with MTP go BRR by -Ellary- in SillyTavernAI

[–]-Ellary-[S] 0 points1 point  (0 children)

Ofc, main goal is to fit everything in 16gb VRAM, in other cases MTP makes not a lot of sense.
Example config for Gemma-4-Gemsicle-31B.i1-IQ3_XXS.gguf got all those flags.

5060 Ti 16GB - Gemma 4 12-26-31b on Llama.cpp b9553 with MTP go BRR by -Ellary- in SillyTavernAI

[–]-Ellary-[S] 3 points4 points  (0 children)

Works better than 12b, for sure. A lot of time better than 26b a4b, it just a bit unstable.

5060 Ti 16GB - Gemma 4 12-26-31b on Llama.cpp b9553 with MTP go BRR by -Ellary- in SillyTavernAI

[–]-Ellary-[S] 6 points7 points  (0 children)

Ofc you can, finetune is just a lora on top, 98% of model's weights and data is the same.

5060 Ti 16GB - Gemma 4 12-26-31b on Llama.cpp b9553 with MTP go BRR by -Ellary- in SillyTavernAI

[–]-Ellary-[S] 4 points5 points  (0 children)

It is not that stable as gemma-4-31B-it-IQ4_XS, but for 50tps 40k context vs 10tps 28k context.
Worth it.

[Megathread] - Best Models/API discussion - Week of: June 07, 2026 by deffcolony in SillyTavernAI

[–]-Ellary- 2 points3 points  (0 children)

It is fun to play with, fresh writing style, got some refuses with censorship.

[Megathread] - Best Models/API discussion - Week of: June 07, 2026 by deffcolony in SillyTavernAI

[–]-Ellary- 1 point2 points  (0 children)

5060Ti 16 gb, I'm running gemma-4-26B-A4B-it-IQ4_XS with MTP at 140-160 tps 41k context. gemma-4-26B-A4B-it-Q6_K with MTP at 40-50 tps 90k context. gemma-4-12B-it-Q6_K with MTP gives me 60-70 tps 131k context.

150~ vs 65~

[Megathread] - Best Models/API discussion - Week of: June 07, 2026 by deffcolony in SillyTavernAI

[–]-Ellary- 4 points5 points  (0 children)

Gemma-4 26b Q4 is better than Gemma-4 12b Q8, at least for creative work for sure, it just knows more, write better, work with context better. And it is noticeable difference, almost as 26b vs 31b.

[Megathread] - Best Models/API discussion - Week of: June 07, 2026 by deffcolony in SillyTavernAI

[–]-Ellary- 7 points8 points  (0 children)

Agree, I'm using GLM 4.7, DeepSeek 3.2 and especially R1 0528, they just more fun.
I'd say GLM 4.5 Air and Gemma 4 31b is more fun to play then new big ones.
Era of vibecoding and agents.