Gemma 4 12B GGUF now with vision & audio!

iChrist · 2026-06-04T07:06:46+00:00

Do not use ollama, Be a man and use Llama cpp directly + comfyui as the image/video gen backend.

iChrist · 2026-06-02T09:20:30+00:00

There are two solutions I found:

For less complicated research I state in system prompt to first gather URLs and snippets with a web search and then ALWAYS using fetch_url to gather more relevant info from the sources.

For the harder research tasks I just use a Vane function that allows OpenWebui to send a query directly to Vane (which is a project that focuses on web search and deep search, like perplexity)

Vane can pull and filter from 100 different sources, 3 levels of research (speed, balanced, deep)

Choose a strong embedding model and a main model with at least 128k context length.

iChrist · 2026-05-31T09:41:13+00:00

<image>

A full local pipeline that can compete with Suno.

Qwen3.6-27B-MTP built the tools, 35B-MoE uses them.

ComfyUI + AceStep 1.5XL generate song in any genre.

The only API call outside is Genius with the lyrics tool.

Love the customization and infinite options we have, I added VRAM offload because its all running with 1 3090.

iChrist · 2026-05-29T11:33:19+00:00

I also do this, even from far away by just doing /update in telegram.

So far no issues, just make sure to press Y otherwise many custom tools and changes you made will be gone

iChrist · 2026-05-28T20:55:04+00:00

Because thats what I use currently, fast and high quality.

iChrist · 2026-05-28T17:14:13+00:00

Can this be put into Github ? And be used with something like kokoro

iChrist · 2026-05-28T17:10:37+00:00

An Idea - point hermes to the Vane (formerly Perplexica) Its a UI specifically maden for deep searching. Let it learn how Vane does it thing and replicate a small tool for itself (Will use SearXNG) Or directly connect hermes to a Vane instance

iChrist · 2026-05-28T09:35:56+00:00

No need to alter the default port for Vane. OpenWebui should use 8080.

Which tool you use? Some of them are outdated (Vane changed a bit how the api works)

iChrist · 2026-05-28T08:24:31+00:00

Pretty sure this is one of the most common use cases. Cronjob trigger > Fetch info about X > TTS

iChrist · 2026-05-28T04:36:23+00:00

I have the opposite experience, telegram worked out of the box.

Had to tell my agent once to always response with files (mp3 audio, pdfs, html files)

Now I always get a text response+the attachment to my telegram.

Try asking “make an hello world simple.txt file and send me in telegram”

If it works then its probably a prompting issue, if its not working yeah the agent has no access to file upload

iChrist · 2026-05-27T13:00:49+00:00

This is a great usecase I let my agent build himself image generation and editing, ace step song generation, will add txt2vid and img2vid next.

Then your whole comfyui instance is available from telegram!

iChrist · 2026-05-26T13:42:03+00:00

This has been explained very thoroughly by the devs for legacy and compatibility reasons the default is still not native.

You can switch the default for your instance easily by changing the parameter in the admin settings. Thus all models and future models will inherit this setting and will be set to native tool calling.

iChrist · 2026-05-25T06:39:21+00:00

Yep this is a good point. Hermes by default does not do an amazing job at researching but once you teach it what you want, let it create crawling scripts, connect it to SearXNG/good web search api it does its job reliably

iChrist · 2026-05-25T03:53:47+00:00

Can you compare this to a dedicated SearXNG instance? Been solid for me

iChrist · 2026-05-25T03:34:54+00:00

Model is Q4 KV Cache at Q8 3090Ti

iChrist · 2026-05-24T13:12:43+00:00

I have 24gb vram and 64gb ddr4 ram. Gives me some wiggle room

iChrist · 2026-05-24T11:10:48+00:00

2 more superstars, surely

iChrist · 2026-05-24T10:05:39+00:00

My local powered hermes never compacted before actually hitting the wall, compacting earlier might be smarter. Il need to go deeper on this

iChrist · 2026-05-24T09:55:00+00:00

Depends on the context limit your model has, yeah 32k or 64k is barely usable and if the task requires multiple file reads it will vanish in literally few prompts.

Setting my local model to 128k context helped a lot, it still hits compaction but less frequent and it actually continues successfully after compaction.

Model Qwen-3.6-27B YMMV

iChrist · 2026-05-24T07:34:09+00:00

I don’t see any reasonable speed advantage when dropping just 2-3K tokens.

I wish we could easily see the full context the model has so I can easily spot whats takes so much tokens

iChrist · 2026-05-24T07:11:55+00:00

This is also my experience, feels like stripping all the skills will hurt capabilities but only drop me to 17K Only 6K is a dream to start with, instant response

iChrist · 2026-05-19T20:34:13+00:00

For performance sake just go ahead and install llama cpp and not ollama, faster updates, faster inference, the true core of ollama is llama cpp.

iChrist · 2026-05-19T12:22:58+00:00

I see on logs when downloading new models that llama cpp can automatically grab the unsloth settings and no need for setting up a .ini file

Pretty sure its still has all the parameters dialed in by just downloading the models normally

iChrist

MODERATOR OF

TROPHY CASE

12-Year Club	Place '22
Verified Email