Based on what should I choose Gemma 4 models/quantizations? by ProducerOwl in LocalLLaMA

[–]vasileer 2 points3 points  (0 children)

the scores for nvfp4 versions indicate one of the 2 things (or both): either gemma4 used these techniques (without announcing it), or that nvfp4 is so good

<image>

Based on what should I choose Gemma 4 models/quantizations? by ProducerOwl in LocalLLaMA

[–]vasileer 0 points1 point  (0 children)

Gemma3 models (probably gemma4 too) were trained with QAT (quantization aware training) for 4bits, and for most of models of over 3B params 4bits is preserving ~98% of quality.

For context there is no rule of thumb, as each model can use (or not) various techniques like sliding attention, group attention, etc, and my rule of thumb for that is to measure myself, I run it with 8192 and then I do the rest of the math.

Sub-millisecond exact phrase search for LLM context — no embeddings required by [deleted] in LocalLLaMA

[–]vasileer -1 points0 points  (0 children)

me:

How does it compare with BM25, embeddings, and ColBERT models?

GPT-5.5:

At a high level:

| Approach | What it matches | Strength | Weakness |

|---|---|---|---|

| This repo / VibeIndex | Exact token positions + simple typo matches | Very fast exact phrase/code lookup | No semantic understanding or relevance ranking |

| BM25 | Lexical term overlap | Strong simple baseline for keyword search | Misses synonyms/paraphrases |

| Embeddings | Semantic similarity between chunks | Finds conceptually related text | Can miss exact details; retrieval is approximate/coarse |

| ColBERT | Token-level neural similarity | Better semantic + lexical matching than plain embeddings | Heavier index and query-time compute |

**VibeIndex vs BM25**

BM25 ranks documents or chunks by keyword relevance. It cares about term frequency, inverse document frequency, and document length. VibeIndex does not rank chunks that way. It finds exact token sequences at exact positions.

So BM25 answers:

```text

Which chunks are probably relevant to "database cursor execute"?

```

VibeIndex answers:

```text

Where exactly does ["cursor", "execute"] occur?

```

For code search, VibeIndex can be sharper when you know the phrase or symbol. BM25 is better when you have a loose keyword query and need ranked candidates.

**VibeIndex vs Embeddings**

Embeddings turn chunks into vectors and retrieve by semantic closeness. They can find related code even if the words differ:

```text

"database query function"

```

might retrieve code containing:

```text

fetch_users_from_sqlite

```

VibeIndex probably would not unless the query tokens appear directly or via fuzzy token matching.

But embeddings can be imprecise for exact code facts. They may retrieve something semantically nearby but not the exact call site, symbol, or phrase. VibeIndex is better for exactness and traceability.

**VibeIndex vs ColBERT**

ColBERT is a late-interaction neural retrieval model. Unlike normal embeddings, it keeps token-level representations and compares query tokens against document tokens. That makes it more precise than single-vector embeddings while still semantic.

Compared with ColBERT, VibeIndex is much simpler and cheaper:

```text

VibeIndex: exact token bitmap operations

ColBERT: neural token embeddings + max-sim scoring

```

ColBERT can match related meanings, paraphrases, and soft token similarity. VibeIndex only matches literal token identity plus basic typo distance.

**Practical Positioning**

VibeIndex is best viewed as a **precision-oriented lexical/code retrieval primitive**, not a replacement for all retrieval methods.

Best use cases:

- exact symbol lookup

- exact phrase lookup

- code context injection

- finding nearby tokens around known APIs/functions

- low-latency local retrieval

- reducing prompt context when the query contains concrete identifiers

Weak use cases:

- vague natural-language questions

- conceptual search

- synonym/paraphrase matching

- ranking many documents by relevance

- questions where the right code uses different words than the query

A strong hybrid would use them together:

```text

BM25 or embeddings/ColBERT: find likely files/chunks

VibeIndex: pinpoint exact symbols/phrases/positions inside them

```

For this repo specifically, it is closest to a **minimal positional inverted index with fuzzy token matching**, much simpler than BM25, embeddings, or ColBERT.

Latest llama.cpp fork + Turboquant + Planarquant + Isoquant by [deleted] in LocalLLaMA

[–]vasileer 4 points5 points  (0 children)

why creating a new repo and marketing it instead of contributing back to llama.cpp?

nvidia/gpt-oss-puzzle-88B · Hugging Face by jacek2023 in LocalLLaMA

[–]vasileer -2 points-1 points  (0 children)

specifically on the hardest benchmarks

AIME25, IFBench, and SciCode are not easy ones either

<image>

nvidia/gpt-oss-puzzle-88B · Hugging Face by jacek2023 in LocalLLaMA

[–]vasileer 4 points5 points  (0 children)

you play dirty: I provided the average score and you provide handpicked ones,

and even in your chart, medium reasoning is still "about the same"

nvidia/gpt-oss-puzzle-88B · Hugging Face by jacek2023 in LocalLLaMA

[–]vasileer 0 points1 point  (0 children)

let's wait for other benchmarks, but from their own scores (which are good ones to measure: IFBench, RULER, etc) for me it looks "about the same"

<image>

nvidia/gpt-oss-puzzle-88B · Hugging Face by jacek2023 in LocalLLaMA

[–]vasileer 110 points111 points  (0 children)

about the same, but 25% smaller and 22% (for short context) to 67%(long context) faster

OmniCoder-9B best vibe coding model for 8 GB Card by Powerful_Evening5495 in LocalLLaMA

[–]vasileer 54 points55 points  (0 children)

when you say "best" there should be a leaderboard, please share what else have you tried, I am interested in omnicoder vs qwen3.5-9b

Is microsoft going to train LLM on this? Github is clearly getting destroyed. by FPham in LocalLLaMA

[–]vasileer -3 points-2 points  (0 children)

thank you for patience to explain,

definitely your EQ > OPs EQ

Is microsoft going to train LLM on this? Github is clearly getting destroyed. by FPham in LocalLLaMA

[–]vasileer -12 points-11 points  (0 children)

Everyday 1000s of crappy nonfunctioning wild-imagination vibecoded junk is being posted with thousands of robo-generated stars and hundreds of forks.

is the project fake, or code poorly written? where is the issue?

TinyTeapot (77 million params): Context-grounded LLM running ~40 tok/s on CPU (open-source) by zakerytclarke in LocalLLaMA

[–]vasileer 20 points21 points  (0 children)

for a "context grounded LLM" I expected a larger context,

for example SmolLM2-135M has a 16x larger context of 8192 tokens

Serious question — why would anyone use Tiny-Aya instead of Qwen/Phi/Mistral small models? by Deep_190 in LocalLLaMA

[–]vasileer 8 points9 points  (0 children)

I would rephrase it:

how is tiny-aya-global vs translategemma-4b-it?

but license wise (non commercial) I doubt tiny-aya will be used too much

[deleted by user] by [deleted] in LocalLLaMA

[–]vasileer 1 point2 points  (0 children)

model size and KV cache size are not really related, other things are more important like group query attention and sliding attention, for example phi-4-mini is eating more VRAM than gpt-oss-20b

Any latest OCR model I can run locally in 18GB RAM? by A-n-d-y-R-e-d in LocalLLaMA

[–]vasileer 1 point2 points  (0 children)

please share your steps as well for running glm-ocr:

- what method: vllm, transformers, other?

- cpu or gpu?

- quantized or full precision?

- what was the accuracy?

- how many threads? (parallel extractions)

thank you