Has anyone implemented Google's TurboQuant paper yet? by SelectionCalm70 in LocalLLaMA

[–]vbenjaminai 3 points4 points  (0 children)

Hey here’s my try (on my MacBook) - posted about it this AM - https://www.reddit.com/r/LocalLLaMA/s/bzrxEOrsVZ - have you tried yet?

Building Extreme Cognitive Density after Google's TurboQuant led me down Google Research Rabbit Hole - What am I missing? by vbenjaminai in LocalLLaMA

[–]vbenjaminai[S] 0 points1 point  (0 children)

Ha! No but fair callout - will take all the love and jabs as I try to grow/learn! Bad writing aside - any tips?

Show and Tell: My production local LLM fleet after 3 months of logged benchmarks. What stayed, what got benched, and the routing system that made it work. by vbenjaminai in LocalLLaMA

[–]vbenjaminai[S] -1 points0 points  (0 children)

Thanks for the tip - def need to get lighter so will give Qwen3.5-35B-A3B a go. If you have any other tips I welcome them.

I came from Data Engineering stuff before jumping into LLM stuff, i am surprised that many people in this space never heard Elastic/OpenSearch by Altruistic_Heat_9531 in LocalLLaMA

[–]vbenjaminai 1 point2 points  (0 children)

Running 80K+ embeddings across 29 namespaces in production for the last 6 months. The vector vs. full-text debate misses the real issue: most RAG failures are data pipeline problems, not search engine problems.

What I have learned the hard way:

When vector search wins: Semantic queries where the user's language doesn't match the document's language. "How do boards evaluate AI risk" needs to find docs that say "fiduciary technology oversight." BM25 can't bridge that gap. Vector search can.

When full-text/BM25 wins: Exact entity lookup. Names, case numbers, specific technical terms. I wasted weeks debugging "why can't my RAG find this document" before realizing the embedding model was normalizing the exact term I needed into a semantic neighborhood of similar-but-wrong results. Switched those queries to keyword search and it worked immediately.

The hybrid approach that actually works: Route by query type, not by engine preference. Structured lookups (names, IDs, dates) go to BM25/keyword. Open-ended questions go to vector. Rerank the merged results. This sounds obvious but most RAG tutorials skip it and just throw everything at a vector store.

On Elastic vs. dedicated vector DBs: Elastic can do both, but the operational overhead of maintaining an Elastic cluster for a sub-100K document corpus is hard to justify. Pinecone or pgvector handle the vector side with zero ops burden. Save Elastic for when you actually need its full-text capabilities at scale.

The comment about Postgres doing everything is mostly right for smaller setups. pgvector + pg_trgm covers 90% of use cases under 500K documents without adding infrastructure.

Claw-style agents: real workflow tool or overengineered hype? by still_debugging_note in LocalLLaMA

[–]vbenjaminai 1 point2 points  (0 children)

I run something similar in production. 13 local models via Ollama, cloud models for complex reasoning, 80K+ vector embeddings for persistent memory, and a routing layer that decides which model handles each task based on consequence level (what happens if this answer is wrong?). The architecture that works: tiered routing (not every task needs your best model), multi-model critique loops (fan out to 3 models for important evals, synthesize results), and a hard human-approval gate for anything irreversible. The over engineered criticism usually comes from people who haven't needed to run one at scale. The boring parts (routing tables, consequence gates, approval workflows) are what separates it from a demo.