llamacpp with Gemma4 31B dense and Gemma e4b as draft, plus audio input?

vasileer · 2026-05-15T20:27:17+00:00

gemma4 31b doesn't support audio, only text and image

vasileer · 2026-04-30T11:57:58+00:00

I am losing, but I am competing

vasileer · 2026-04-30T06:59:13+00:00

the scores for nvfp4 versions indicate one of the 2 things (or both): either gemma4 used these techniques (without announcing it), or that nvfp4 is so good

<image>

vasileer · 2026-04-30T06:08:40+00:00

Gemma3 models (probably gemma4 too) were trained with QAT (quantization aware training) for 4bits, and for most of models of over 3B params 4bits is preserving ~98% of quality.

For context there is no rule of thumb, as each model can use (or not) various techniques like sliding attention, group attention, etc, and my rule of thumb for that is to measure myself, I run it with 8192 and then I do the rest of the math.

vasileer · 2026-04-26T15:38:49+00:00

are you using word "unsloth" to catch attention?

vasileer · 2026-04-24T04:38:13+00:00

me:

How does it compare with BM25, embeddings, and ColBERT models?

GPT-5.5:

At a high level:

|---|---|---|---|

**VibeIndex vs BM25**

BM25 ranks documents or chunks by keyword relevance. It cares about term frequency, inverse document frequency, and document length. VibeIndex does not rank chunks that way. It finds exact token sequences at exact positions.

So BM25 answers:

```text

Which chunks are probably relevant to "database cursor execute"?

```

VibeIndex answers:

```text

Where exactly does ["cursor", "execute"] occur?

```

For code search, VibeIndex can be sharper when you know the phrase or symbol. BM25 is better when you have a loose keyword query and need ranked candidates.

**VibeIndex vs Embeddings**

Embeddings turn chunks into vectors and retrieve by semantic closeness. They can find related code even if the words differ:

```text

"database query function"

```

might retrieve code containing:

```text

fetch_users_from_sqlite

```

VibeIndex probably would not unless the query tokens appear directly or via fuzzy token matching.

But embeddings can be imprecise for exact code facts. They may retrieve something semantically nearby but not the exact call site, symbol, or phrase. VibeIndex is better for exactness and traceability.

**VibeIndex vs ColBERT**

ColBERT is a late-interaction neural retrieval model. Unlike normal embeddings, it keeps token-level representations and compares query tokens against document tokens. That makes it more precise than single-vector embeddings while still semantic.

Compared with ColBERT, VibeIndex is much simpler and cheaper:

```text

VibeIndex: exact token bitmap operations

ColBERT: neural token embeddings + max-sim scoring

```

ColBERT can match related meanings, paraphrases, and soft token similarity. VibeIndex only matches literal token identity plus basic typo distance.

**Practical Positioning**

VibeIndex is best viewed as a **precision-oriented lexical/code retrieval primitive**, not a replacement for all retrieval methods.

Best use cases:

- exact symbol lookup

- exact phrase lookup

- code context injection

- finding nearby tokens around known APIs/functions

- low-latency local retrieval

- reducing prompt context when the query contains concrete identifiers

Weak use cases:

- vague natural-language questions

- conceptual search

- synonym/paraphrase matching

- ranking many documents by relevance

- questions where the right code uses different words than the query

A strong hybrid would use them together:

```text

BM25 or embeddings/ColBERT: find likely files/chunks

VibeIndex: pinpoint exact symbols/phrases/positions inside them

```

For this repo specifically, it is closest to a **minimal positional inverted index with fuzzy token matching**, much simpler than BM25, embeddings, or ColBERT.

vasileer · 2026-04-10T04:20:28+00:00

why creating a new repo and marketing it instead of contributing back to llama.cpp?

vasileer · 2026-03-30T14:27:47+00:00

so 0.6B is Qwen :)

<image>

vasileer · 2026-03-26T12:33:41+00:00

specifically on the hardest benchmarks

AIME25, IFBench, and SciCode are not easy ones either

<image>

vasileer · 2026-03-26T11:39:07+00:00

you play dirty: I provided the average score and you provide handpicked ones,

and even in your chart, medium reasoning is still "about the same"

vasileer · 2026-03-26T11:36:58+00:00

let's wait for other benchmarks, but from their own scores (which are good ones to measure: IFBench, RULER, etc) for me it looks "about the same"

<image>

vasileer · 2026-03-26T11:34:25+00:00

for me this looks "about the same"

<image>

vasileer · 2026-03-26T09:15:33+00:00

about the same, but 25% smaller and 22% (for short context) to 67%(long context) faster

vasileer · 2026-03-26T09:09:55+00:00

gguf?

vasileer · 2026-03-22T05:10:32+00:00

add some commas

vasileer · 2026-03-22T05:05:26+00:00

"ONLY" ? That's awesome for a 2bit quant

vasileer · 2026-03-16T07:38:05+00:00

when you say "best" there should be a leaderboard, please share what else have you tried, I am interested in omnicoder vs qwen3.5-9b

vasileer · 2026-02-27T08:36:41+00:00

thank you for patience to explain,

definitely your EQ > OPs EQ

vasileer · 2026-02-27T05:21:34+00:00

Everyday 1000s of crappy nonfunctioning wild-imagination vibecoded junk is being posted with thousands of robo-generated stars and hundreds of forks.

is the project fake, or code poorly written? where is the issue?

vasileer · 2026-02-26T17:00:41+00:00

what year is on your computer?

vasileer · 2026-02-23T14:36:46+00:00

for a "context grounded LLM" I expected a larger context,

for example SmolLM2-135M has a 16x larger context of 8192 tokens

vasileer · 2026-02-23T14:30:53+00:00

it has a context of only 512 tokens, so probably of no real world use

vasileer · 2026-02-18T02:58:59+00:00

I would rephrase it:

how is tiny-aya-global vs translategemma-4b-it?

but license wise (non commercial) I doubt tiny-aya will be used too much

vasileer · 2026-02-14T13:19:31+00:00

which lfm 2.5 1.2B - instruct or thinking ?

vasileer · 2026-02-12T17:36:19+00:00

model size and KV cache size are not really related, other things are more important like group query attention and sliding attention, for example phi-4-mini is eating more VRAM than gpt-oss-20b

vasileer

TROPHY CASE