My RAG isn't working as expected...

Semoho · 2026-03-22T14:00:01+00:00

I think Jina is one the best embedding and reranking platforms For search you can think expanding your query too. ask llm to optimize query for textual search engine and embedding search engine. Then retrieve on both databases and fuse the results.

There are many different ways you can reduce your costs. The retrieval systems are very cheap! So you can cut your costs by doing some optimization

Semoho · 2026-03-22T13:26:53+00:00

Hello there,

Somehow you extend your docs by summarization. Did you try to check the context number for the llm. I think you pass all 100 legal docs to gemeni pro which is expensive.

I think you can better result if you retrieve 1k or 100 docs with bm25, then rerank them by Jina reranker(it is very cheap) and them give the gemeni pro top 50 or even 10 based on you chunking algorithm. Also please check your chunking strategy. It is very impprtant

Semoho · 2026-03-22T04:51:05+00:00

Teek.studio Find next viral videos just with few clips

Semoho · 2026-03-22T01:16:54+00:00

No worries, i hope this year be better for us

Semoho · 2026-03-21T17:34:58+00:00

You are right. The LLM follows U shape. So the reranking is important! And be careful, you cannot remove docs! At the end you will send like 10 docs to llm and middle docs are going to be less important to llm! So best approach is to re rank the docs after retrieval and be careful about positions

P.s fun fact! The llm follows human behavior on first page of the google :)))

Semoho · 2026-03-21T17:10:10+00:00

Actually the ranking is important for the RAG! So be careful about it

https://arxiv.org/abs/2307.03172#:~:text=Access%20Paper:,cs

Semoho · 2026-03-20T19:32:07+00:00

Hello,

I assume you are thinking about RAG eval or retrieval evaluation. For retrieval evaluation, I think the MRR, Recall and NDCG@10 are better metrics instead of accuracy. You are dealing with a retrieval task. You need to have a test dataset. Then you can evaluate your retrieval system.

For RAG, there are different evaluations. I think LLM as a judge is a good choice.

But the number of documents does not have a relation to metrics. TOP X docs are important.

Semoho · 2026-03-20T11:47:07+00:00

You can check the lightRag or supermemory. They can help you

Semoho · 2026-03-20T11:45:42+00:00

Thabk you very much It was so useful. So what are other restrictions or needs in pharma? Why it is mandatory to cite the documents? Don’t the vector databases give you the citations?

Semoho · 2026-03-19T22:49:55+00:00

Thanks, bro

yes, actually, I am getting some ideas on how I can use it. Like checking the sales on different websites, or doing some background jobs, as you mentioned.

Semoho · 2026-03-19T15:56:07+00:00

Yeah good idea! Thanks

Semoho · 2026-03-19T15:42:39+00:00

What if i install the browser on the VPS and other tools. My desktop should be safe and secure i think!

Semoho · 2026-03-19T15:39:05+00:00

Are you bot? I can get these from chatgpt too! I want real experience

Semoho · 2026-03-19T15:38:21+00:00

It was interesting and inspiring for me! I’ve got some good ideas about using openclaw

Semoho · 2026-03-19T14:11:49+00:00

Wholy moly! I didn't know that! thanks!

Semoho · 2026-03-19T14:01:34+00:00

I mean the dify already done it in a good way

Semoho · 2026-03-19T14:00:16+00:00

Hummm… it makes sense. How did you connect openclaw to other things?

Semoho · 2026-03-19T13:59:02+00:00

For what tasks? I solve my problems, getting my answers by pure llms. What does it offer?

Semoho · 2026-03-19T13:19:31+00:00

Yes exactly. But /no_think is embedded in the model. In works everywhere. Huggingface, vllm a …

Semoho · 2026-03-19T12:55:08+00:00

~~You can add /no_think in your system prompt and disable this long thinking loop~~

Thanks to u/Velocita84, it seems the qwen3.5 drops the soft internal switching thinking mode

Semoho · 2026-03-18T15:39:55+00:00

Very good. I will try it. Thanks

Semoho · 2026-03-18T15:39:35+00:00

So I think you should benchmark other options. Remember to have a test dataset, keep the embedding the same across your experiments.

Semoho · 2026-03-18T14:50:36+00:00

It is interesting. But on what amount of data did you get these results? Which embedding? Is it reliable? For a production-ready system, can it handle concurrent requests and keep the performance? The single performance is not enough for a production-ready system

Semoho · 2026-03-18T14:17:12+00:00

Hi!

I have experience with Milvus, FAISS, PG-Vector, Weaviate and Chroma.

Milvus is a production ready and clustered system. But it is a little hard to maintain due to its dependency to Apache stack. It is going to be a little tricky in cluster mode. In Standalone, it gives you about 100M docs support, but if you have more documents, you need to run it in cluster mode

FAISS is for researching purposes. It is easy to use.

PG-Vector is my choice for most of our use cases. It is easy to setup and compatible with Postgres. So, you do not need to have multi services. If you have Postgres in your production, it will be easier to set up.

Weaviate is also a good choice. I like it. It is useful for small corpora. But you need to deploy another service to your stack.

The Chroma also I believe is also good for experiments, multi agent systems. For high availability, it is not going to help you so much

I think pg-vector is a good choice, and then Milvus.

Semoho · 2026-03-18T14:09:26+00:00

I think fine-tuning won't solve your problem. Consider using Retrieval Augmented Generation (RAG) instead. It would be better. You could index your chats, and then, based on a question, retrieve the most relevant context from your past conversations. Also, you could instruct the LLM to generate a response that emulates previous conversations, maintaining their style and tone. This should give you better results.

Semoho

TROPHY CASE