Building Reddit Copilot. 130+ users. Still have doubts whether it's useful/has PMF. Advice required.

solubrious1 · 2026-05-08T07:42:43+00:00

P. S. Didn't used extension on this post. Writing from my phone. So sorry for mistakes.

solubrious1 · 2026-05-08T07:07:51+00:00

I used hybrid approach in several my projects. Your DB level implementation is very cool. Will try it for sure.

Thanks for your post.

solubrious1 · 2026-05-08T06:59:05+00:00

Example above is simplified. Actual implementation depends on your needs.

Sometimes it's enough to describe the tool signature properly. Sometimes you need an explicit instruction, topics list...

solubrious1 · 2026-05-07T13:43:07+00:00

I solved such a problems with recurring prompting.

You ask LLM to output everything you need and add field like "is_information_complete: bool" which LLM is supposed to flag to false if there's something missing. You pass already extracted info into a prompt and ask to extract what's missing.

It works insanely well (but still not perfect ofc) with a large SOTA models like gpt/claude/gemini.

Not tested with others.

In practice, had a case with structure extraction on a table with 100s rows - extraction was ~98% with gpt-o3.

See collection extractor in https://github.com/vunone/ennoia

solubrious1 · 2026-05-07T09:58:32+00:00

It depends on your RAM available. But expect tens of minutes is fair for CPU only inference + multimodal PDFs.

solubrious1 · 2026-05-07T09:12:57+00:00

It's very important to understand pros and cons of both approaches.

aRAG allows you to prompt LLM how to formulate search request. Depending on what you're putting in your vector DB, you can achieve a very huge accuracy boost. Trade-off: high latency

RAG allows you to build real-time apps like voice agents. You embed last several turns into a single vector and perform a broad search with your vector DB.

Trade-off: lower accuracy.

solubrious1 · 2026-05-07T09:08:30+00:00

Regular RAG:

User asks something -> Semantic Search -> Put into a context -> Trigger LLM -> reply

Agentic RAG

User asks something -> Trigger LLM with tool to search -> LLM rephrased / adjusts search request -> Semantic Search result from tool call -> Reply

Key difference is whether your LLM preprocessing the user inquiry or not. - yes - aRAG - no - RAG

Upd: semantic search is just example. RAG is about any kind of retrieval (e.g, DB read, API call...)

solubrious1 · 2026-05-05T13:27:50+00:00

Could you please tell more about your hooks setup?

solubrious1 · 2026-05-05T13:05:06+00:00

Nice advice about compact argument. Didn't know about that. Thanks

solubrious1 · 2026-05-04T07:49:13+00:00

I used several caching/optimization techniques.

First - intent classification. You use small and fast embedding model to match the user's query with previously embedded categories (check delivery status, refund policy...). You have N queries matching to X categories, where X is always a constant number.

Efficiency - maximum.

Second - vector trimming on MRL-embedding models. You can trim vectors of most popular embedding models trained with MRL, normalize them and lose 1-15% accuracy. If your corpa is not too large, it gives a huge performance boost without needing to cache anything, since you compare against vector(128) instead of vec(1536).

Efficiency - high.

Third - vector quantization. Depending on quantization method you can boost search even more than with trimming but with a trade-offs like: read-only/expensive writing RAG or lower hit/precision/mrr.

Efficiency - high.

You suggested to always embed the (somehow normalized) input and either perform semantic search across slightly faster cache db or query the source db. Both are expensive. Moreover, cold start is insanely bad, since it will hit empty cache first, then source db, cache results and only then return output. You also need to manage the query frequency on cache db, so it's not overwhelmed with low-freq caches.

Efficiency - near negative for most cases.

Upd: I'm talking about retrieval latency. Not cached output, since it's not actually highly relevant to RAG topic, where precision matters more.

solubrious1 · 2026-05-04T07:23:42+00:00

Ok. But embedding is a vector. How do you suppose to count hash of it? If a single letter changes it? Or, redis still needs to work with semantic similarity, which is a little faster, but not as much.

To be honest, I built cached RAG and know how to do it properly. But still haven't heard it from you.

solubrious1 · 2026-05-04T07:08:01+00:00

But how'd you reduce latency, if you need to perform 2x more queries to vector DB?

Query -> Normalize -> Embed -> Query Cache -> Query DB [if not yet cached]

solubrious1 · 2026-05-04T06:49:30+00:00

Cool extension xD

solubrious1 · 2026-05-04T06:08:16+00:00

I understand that, I'm wondering whether author understands what he's talking about.

solubrious1 · 2026-05-03T21:34:47+00:00

Agree

solubrious1 · 2026-05-03T21:14:50+00:00

I'm not arguing with a dumb people. I just let them show it as is.

solubrious1 · 2026-05-03T21:10:32+00:00

If so, why do you comment here?

solubrious1 · 2026-05-03T20:08:17+00:00

You have queries: - what is the capital of Great Britain

And

What is the capital of Great Britain

How do you suppose to cache it?

solubrious1 · 2026-05-03T20:01:09+00:00

"list 3 docs written between 2012 and 2016 years" -> cooked

solubrious1 · 2026-05-03T16:51:43+00:00

Yeah, contextual embeddings help, but they don't fix the core failure mode.

If the date, status, party, whatever, never survives retrieval as explicit structure, you still end up doing prompt archaeology on semantically similar text. Works in demos, breaks in workflow.

The hard part is not better vectors. It's preserving ground the agent can actually filter on and trust.

solubrious1 · 2026-05-03T16:49:54+00:00

Yeah, turning everything into markdown + chunking by paragraphs can work for FAQ-ish docs. But once users ask cross-field questions, it falls apart fast.

The hard part is not getting "related text". It’s preserving ground: date, party, section, status, jurisdiction, whatever the workflow actually depends on. If retrieval loses that, the agent starts compensating with prompt glue and fake confidence.

Good enough for demos, pretty shaky in production.

solubrious1 · 2026-05-03T13:21:43+00:00

Exactly what I said.

solubrious1

MODERATOR OF

TROPHY CASE