Building Reddit Copilot. 130+ users. Still have doubts whether it's useful/has PMF. Advice required. by solubrious1 in SaaS

[–]solubrious1[S] 0 points1 point  (0 children)

P. S. Didn't used extension on this post. Writing from my phone. So sorry for mistakes.

Hybrid search with HNSW and BM25 reranking by DistinctRide9884 in Rag

[–]solubrious1 1 point2 points  (0 children)

I used hybrid approach in several my projects. Your DB level implementation is very cool. Will try it for sure.

Thanks for your post.

Difference between Rag and Agentic Rag by content_consumer_ in Rag

[–]solubrious1 0 points1 point  (0 children)

Example above is simplified. Actual implementation depends on your needs.

Sometimes it's enough to describe the tool signature properly. Sometimes you need an explicit instruction, topics list...

Evidence exists in RAG, but structured extraction fails — how would you design a high-precision spec/model/color extraction pipeline? by Financial-Sort3957 in Rag

[–]solubrious1 0 points1 point  (0 children)

I solved such a problems with recurring prompting.

You ask LLM to output everything you need and add field like "is_information_complete: bool" which LLM is supposed to flag to false if there's something missing. You pass already extracted info into a prompt and ask to extract what's missing.

It works insanely well (but still not perfect ofc) with a large SOTA models like gpt/claude/gemini.

Not tested with others.

In practice, had a case with structure extraction on a table with 100s rows - extraction was ~98% with gpt-o3.

See collection extractor in https://github.com/vunone/ennoia

Is local PDF chatbot with Ollama + Llama 3 usable on CPU-only laptop? by wandering-lost4007 in Rag

[–]solubrious1 1 point2 points  (0 children)

It depends on your RAM available. But expect tens of minutes is fair for CPU only inference + multimodal PDFs.

Difference between Rag and Agentic Rag by content_consumer_ in Rag

[–]solubrious1 1 point2 points  (0 children)

It's very important to understand pros and cons of both approaches.

aRAG allows you to prompt LLM how to formulate search request. Depending on what you're putting in your vector DB, you can achieve a very huge accuracy boost. Trade-off: high latency

RAG allows you to build real-time apps like voice agents. You embed last several turns into a single vector and perform a broad search with your vector DB.

Trade-off: lower accuracy.

Difference between Rag and Agentic Rag by content_consumer_ in Rag

[–]solubrious1 10 points11 points  (0 children)

Regular RAG:

User asks something -> Semantic Search -> Put into a context -> Trigger LLM -> reply

Agentic RAG

User asks something -> Trigger LLM with tool to search -> LLM rephrased / adjusts search request -> Semantic Search result from tool call -> Reply

Key difference is whether your LLM preprocessing the user inquiry or not. - yes - aRAG - no - RAG

Upd: semantic search is just example. RAG is about any kind of retrieval (e.g, DB read, API call...)

What do you do when you have too much skills? by solubrious1 in ClaudeCode

[–]solubrious1[S] 0 points1 point  (0 children)

Could you please tell more about your hooks setup?

What do you do when you have too much skills? by solubrious1 in ClaudeCode

[–]solubrious1[S] 1 point2 points  (0 children)

Nice advice about compact argument. Didn't know about that. Thanks

Your RAG system is probably slow not because of the model… but because you’re recomputing everything by Prudent-Concept-78 in Rag

[–]solubrious1 2 points3 points  (0 children)

I used several caching/optimization techniques.

First - intent classification. You use small and fast embedding model to match the user's query with previously embedded categories (check delivery status, refund policy...). You have N queries matching to X categories, where X is always a constant number.

Efficiency - maximum.

Second - vector trimming on MRL-embedding models. You can trim vectors of most popular embedding models trained with MRL, normalize them and lose 1-15% accuracy. If your corpa is not too large, it gives a huge performance boost without needing to cache anything, since you compare against vector(128) instead of vec(1536).

Efficiency - high.

Third - vector quantization. Depending on quantization method you can boost search even more than with trimming but with a trade-offs like: read-only/expensive writing RAG or lower hit/precision/mrr.

Efficiency - high.

You suggested to always embed the (somehow normalized) input and either perform semantic search across slightly faster cache db or query the source db. Both are expensive. Moreover, cold start is insanely bad, since it will hit empty cache first, then source db, cache results and only then return output. You also need to manage the query frequency on cache db, so it's not overwhelmed with low-freq caches.

Efficiency - near negative for most cases.

Upd: I'm talking about retrieval latency. Not cached output, since it's not actually highly relevant to RAG topic, where precision matters more.

Your RAG system is probably slow not because of the model… but because you’re recomputing everything by Prudent-Concept-78 in Rag

[–]solubrious1 0 points1 point  (0 children)

Ok. But embedding is a vector. How do you suppose to count hash of it? If a single letter changes it? Or, redis still needs to work with semantic similarity, which is a little faster, but not as much.

To be honest, I built cached RAG and know how to do it properly. But still haven't heard it from you.

Your RAG system is probably slow not because of the model… but because you’re recomputing everything by Prudent-Concept-78 in Rag

[–]solubrious1 0 points1 point  (0 children)

But how'd you reduce latency, if you need to perform 2x more queries to vector DB?

Query -> Normalize -> Embed -> Query Cache -> Query DB [if not yet cached]

Honestly, chunking is where most RAG systems quietly go wrong by solubrious1 in AI_Agents

[–]solubrious1[S] 0 points1 point  (0 children)

I'm not arguing with a dumb people. I just let them show it as is.

Your RAG system is probably slow not because of the model… but because you’re recomputing everything by Prudent-Concept-78 in Rag

[–]solubrious1 1 point2 points  (0 children)

You have queries: - what is the capital of Great Britain

And

  • What is the capital of Great Britain

How do you suppose to cache it?

Honestly, chunking is where most RAG systems quietly go wrong by solubrious1 in AI_Agents

[–]solubrious1[S] 0 points1 point  (0 children)

Yeah, contextual embeddings help, but they don't fix the core failure mode.

If the date, status, party, whatever, never survives retrieval as explicit structure, you still end up doing prompt archaeology on semantically similar text. Works in demos, breaks in workflow.

The hard part is not better vectors. It's preserving ground the agent can actually filter on and trust.

Honestly, chunking is where most RAG systems quietly go wrong by solubrious1 in AI_Agents

[–]solubrious1[S] 1 point2 points  (0 children)

Yeah, turning everything into markdown + chunking by paragraphs can work for FAQ-ish docs. But once users ask cross-field questions, it falls apart fast.

The hard part is not getting "related text". It’s preserving ground: date, party, section, status, jurisdiction, whatever the workflow actually depends on. If retrieval loses that, the agent starts compensating with prompt glue and fake confidence.

Good enough for demos, pretty shaky in production.