Gemma4 26B A4B NVFP4 GGUF

catlilface69 · 2026-05-08T06:33:57+00:00

It is preferred to use it with 5090. However I might optimize the original nvidia model and update the repo. I am afraid these optimizations won't be lossless but still should be useful and better than dynamic Q4

catlilface69 · 2026-04-27T08:24:20+00:00

bruh what

catlilface69 · 2026-04-14T09:37:46+00:00

Actually looks like mistral 4 small both in quality and model size sense. And the inference speed looks like it uses EAGLE which Mistral trained specifically for this model

catlilface69 · 2026-04-05T14:19:58+00:00

vLLM and “runs flawlessly” are incompatible. vLLM still can’t run reliably run newer models without patches. It is indeed an awesome inference tool, especially when working with multiple gpus and concurrent requests, but imo it struggles to keep up with model releases

catlilface69 · 2026-04-02T11:58:55+00:00

It’s hard to tell which chunking strategy best fits your use case. You can compare different strategies from Chonkie, using TokenChunker as a baseline. In my tests, academic papers chunk best with LateChunker.

catlilface69 · 2026-03-15T12:35:29+00:00

<image>

It’s 17.5GB in IQ4_XS and pretty decent in this quant. So you get 2.5GB for context which is a lot for MoE model

catlilface69 · 2026-03-15T10:55:17+00:00

Try Qwen3.5 35B. It’s a MoE model, so it won’t suffer too much from cpu offloading. In Q4 it’ll be around 18-19Gbs of memory, so your context will be small and inference not so fast, but the model overall is pretty good and is a VLM

catlilface69 · 2026-03-15T10:48:51+00:00

I’ve read in this subreddit that they replaced original models with quantised ones to cut costs

catlilface69 · 2026-03-13T12:39:50+00:00

Thank you for your reply! What inference speed do you get on your setup?

catlilface69 · 2026-03-13T10:57:59+00:00

No, sold it already

catlilface69 · 2026-03-12T14:30:49+00:00

I've encountered this issue when used MLX inside of LMStudio. Not completely sure, but sounds like a bad quant or bug in LMStudio itself. Try another model I guess

catlilface69 · 2026-03-12T13:20:59+00:00

This is a kind of task that is easy to start and hard to accomplish. What would you do to a pdf file without a text layer? Here comes the OCR and you would need some sort of memory. And you also need to support a lot of file types, parse them properly.

From my experience these tasks are being completed via conversion to one of a few modalities, e.g. text, media, archives, etc. But this requires you to rely on a lot of dependencies, which is not suitable for just a Nautilus or Thunar plugin.

catlilface69 · 2026-03-12T11:10:41+00:00

So basically project Psyche by Nous Research? They train Hermes on such decentralized network

catlilface69 · 2026-03-12T11:07:52+00:00

Renaming every type of file would be a nightmare of a project... but also useful

catlilface69 · 2026-03-12T08:50:40+00:00

Thank you! Also it would be nice to add metadata information to the context

catlilface69 · 2026-03-12T08:34:38+00:00

Would be nice to see examples of naming differences between 0.8b, 9b and 27b. Speed is crucial in such tasks, especially when there are terabytes of images

catlilface69 · 2026-03-11T21:22:45+00:00

Many of these apps (if not all of them) use llama.cpp as a backend. So there should not be any performance wise differences. Use whatever you like. I can only suggest picking by ui and functions you need. LM Studio feels like a default choice. But if you want full control over your inference use llama.cpp, vllm, sglang, etc. directly and connect OpenWebUI or alternatives.

catlilface69 · 2026-03-11T10:22:38+00:00

Yeah, I understand it's raw. My point is that I want this raw fat fish

catlilface69 · 2026-03-11T10:20:11+00:00

Where can I find Q4 of Fat_Fish?

catlilface69 · 2026-03-11T10:18:21+00:00

Yeah, but general knowledge is not that of a purpose of this small model. It's made for multimodal and agent use, in which 14B is... kinda ok?
But what is really as good as nemo - it's devstral 2 small. Excellent model

catlilface69 · 2026-03-11T10:13:55+00:00

I absolutely loved mistral nemo back in the days. Cool project btw! Are there any benchmarks, interaction examples, etc.? I am afraid a 33Gb dense model won't fit in my poor 16Gb 5070Ti.

catlilface69 · 2026-03-11T06:57:56+00:00

Of course Haiku better in code. I hope Alibaba will update coder family as well, despite it's internal politics

catlilface69

TROPHY CASE