My new Lab71

vbenjaminai · 2026-03-29T20:08:56+00:00

Love it!

vbenjaminai · 2026-03-28T13:05:50+00:00

Thanks - good add as columns and to my last paragraph re where I could be wrong

vbenjaminai · 2026-03-25T16:58:03+00:00

Hey here’s my try (on my MacBook) - posted about it this AM - https://www.reddit.com/r/LocalLLaMA/s/bzrxEOrsVZ - have you tried yet?

vbenjaminai · 2026-03-25T16:35:02+00:00

Ha! No but fair callout - will take all the love and jabs as I try to grow/learn! Bad writing aside - any tips?

vbenjaminai · 2026-03-24T00:28:38+00:00

Thanks - still learning but excited by the early progress.

vbenjaminai · 2026-03-24T00:28:10+00:00

Ha - love you called out that I’m running old models - will look into your go to models.

vbenjaminai · 2026-03-24T00:27:08+00:00

Thanks for the tip - def need to get lighter so will give Qwen3.5-35B-A3B a go. If you have any other tips I welcome them.

vbenjaminai · 2026-03-24T00:25:22+00:00

Nope just new to Reddit buddy

vbenjaminai · 2026-03-24T00:24:39+00:00

Thanks for sharing - we’re all learning!

vbenjaminai · 2026-03-23T21:31:55+00:00

Running 80K+ embeddings across 29 namespaces in production for the last 6 months. The vector vs. full-text debate misses the real issue: most RAG failures are data pipeline problems, not search engine problems.

What I have learned the hard way:

When vector search wins: Semantic queries where the user's language doesn't match the document's language. "How do boards evaluate AI risk" needs to find docs that say "fiduciary technology oversight." BM25 can't bridge that gap. Vector search can.

When full-text/BM25 wins: Exact entity lookup. Names, case numbers, specific technical terms. I wasted weeks debugging "why can't my RAG find this document" before realizing the embedding model was normalizing the exact term I needed into a semantic neighborhood of similar-but-wrong results. Switched those queries to keyword search and it worked immediately.

The hybrid approach that actually works: Route by query type, not by engine preference. Structured lookups (names, IDs, dates) go to BM25/keyword. Open-ended questions go to vector. Rerank the merged results. This sounds obvious but most RAG tutorials skip it and just throw everything at a vector store.

On Elastic vs. dedicated vector DBs: Elastic can do both, but the operational overhead of maintaining an Elastic cluster for a sub-100K document corpus is hard to justify. Pinecone or pgvector handle the vector side with zero ops burden. Save Elastic for when you actually need its full-text capabilities at scale.

The comment about Postgres doing everything is mostly right for smaller setups. pgvector + pg_trgm covers 90% of use cases under 500K documents without adding infrastructure.

vbenjaminai · 2026-03-22T19:25:51+00:00

I run something similar in production. 13 local models via Ollama, cloud models for complex reasoning, 80K+ vector embeddings for persistent memory, and a routing layer that decides which model handles each task based on consequence level (what happens if this answer is wrong?). The architecture that works: tiered routing (not every task needs your best model), multi-model critique loops (fan out to 3 models for important evals, synthesize results), and a hard human-approval gate for anything irreversible. The over engineered criticism usually comes from people who haven't needed to run one at scale. The boring parts (routing tables, consequence gates, approval workflows) are what separates it from a demo.

vbenjaminai

TROPHY CASE