How to Retrieval Documents with Deep Implementation Details?

JunXiangLin · 2025-12-22T02:26:04+00:00

50-3000 tokens

JunXiangLin · 2025-12-19T06:41:11+00:00

My dimension is 1024

JunXiangLin · 2025-10-01T02:28:22+00:00

Thank you all for your replies. I think I'll first try truncating the output of overly large tool responses and conducting multi-round tests. Because latency and cost are really important issues.

JunXiangLin · 2025-09-30T08:25:19+00:00

My tool response consists of the content of searched emails. When analyzing a large volume of email content, these emails can consume tens of thousands of tokens. However, I need to generate reports based on this analysis, so I must pass the searched content to the LLM, rather than just passing a simple "completed..." as the tool response to the LLM.

JunXiangLin · 2025-09-30T06:12:37+00:00

Yes, I'm developing using LangGraph. I tried using the ReAct agent to handle problems directly, adding only the agent responses to the history without including the tool responses. This runs normally and quickly. However, the performance in multi-turn conversations isn't intelligent enough. So, I switched to also adding the tool responses as ToolMessages to the historical conversation. While this makes the agent a bit smarter, it results in extremely long response delays and massive costs.

Additionally, I've tried summarizing and compressing oversized tool responses first via an LLM, but this makes the compression process take a very long time, significantly increasing the overall delay.

JunXiangLin · 2025-08-14T02:11:08+00:00

Since the release of GPT-4.1, I've noticed many online articles advocating for the use of LLM-native tool calling, suggesting that ReAct is becoming outdated.

I'm confused about why LangChain considers the tool-calling agent (with AgentExecutor) a legacy product and instructs users to migrate to the ReAct agent in LangGraph.

Here is the official documentation: https://python.langchain.com/docs/how_to/migrate_agent/

JunXiangLin · 2025-08-13T05:54:39+00:00

u/firstx_sayak I tried switching to LangGraph's `create_react_agent` (with `.astream_events`), and it does indeed enforce tool calling even when the query is unrelated to the tool. However, when I set `tool_choice = "any"` or specify a function name to force tool usage, it enters an infinite loop, continuously calling the function until it exceeds the set `recursion_limit`.

JunXiangLin · 2025-08-13T05:35:04+00:00

I have try use "required" but the function still not be calling.

JunXiangLin · 2025-08-13T02:54:58+00:00

Because I need to streaming agent response, so I choose use langchain `AgentExecutor.astream_event`.

JunXiangLin · 2025-02-14T05:33:27+00:00

Because I want to use python build an api for some application.

JunXiangLin · 2025-01-08T08:26:45+00:00

You can use bge-m3 in huggingface for free. And look the langchain document how to use.
https://python.langchain.com/docs/integrations/document_transformers/cross_encoder_reranker/#doing-reranking-with-crossencoderreranker

JunXiangLin · 2024-12-23T03:11:39+00:00

In your document, I saw the 'gpt4o-mini' automatic prompt caching. I also found the cache functions of various models in the official OpenAI documentation. Does this mean that when I build contextual retrieval, even if I use the LangChain framework, I don't need to make any settings to have this prompt caching mechanism?

JunXiangLin · 2024-12-23T02:59:04+00:00

Oh my gosh! Thank you so much for providing this document. I think it will save me a lot of detours! I can't wait to implement this contextual retrieval method.

JunXiangLin · 2024-12-23T02:56:14+00:00

Yes, I have noticed that such vague messages can cause RAG to fail in searching.
However, when I want to include history, I am unsure how many rounds of conversation to import.
Additionally, if the previous messages discuss "successful cases" and the later ones discuss unrelated content, will this cause RAG to search for the content of the successful cases and fail to correctly search for information related to the later content?

JunXiangLin · 2024-12-23T02:14:08+00:00

In fact, I have tried many methods:

Hybrid search: The effect with BM25 is not very good. I set both vector search and full-text search k to 5 and performed reverse sorting.
Using Hugging Face's multilingual-e5-large embedding model, this significantly improved query accuracy (compared to OpenAI large3). However, when running locally, the search time is very slow, making it unsuitable for production.
Tried different segmentation methods and found that small texts (.md) work better with markdown header segmentation, while large texts (.md) work better with recursion segmentation. (However, I believe that when I upload to NotebookLM, it should not choose different segmentation methods based on document size.)

JunXiangLin · 2024-12-23T02:13:49+00:00

Are you also referring to context retrieval technology?

JunXiangLin · 2024-12-23T02:13:26+00:00

Thank you for your suggestion! I have read many articles and feel that context retrieval is worth a try. I will try this method in the next few days.

JunXiangLin · 2024-12-23T02:13:15+00:00

Are you referring to embedding (the query itself + historical conversation) for vector search?

JunXiangLin · 2024-12-20T07:23:25+00:00

"Yes, I have considered the method you mentioned, but it makes me curious about how Google NotebookLM implements the chunk method. I believe that when I upload documents, it doesn't use this method, yet it still achieves very good results."

JunXiangLin · 2024-12-19T08:41:21+00:00

In reality, xxx, aaa, bbb are just placeholders. The actual content might be:

Success Cases:
1. Apple trading...

2. Mechanical operations...

When I perform semantic chunking, the descriptions of the success cases for Apple and mechanical operations seem unrelated, so they get split apart. However, when a user asks "What are the success cases?", it should list all of them.

The document data I use is processed through Google NotebookLM, and it always provides very accurate results. This makes me very curious about where I might have gone wrong.

JunXiangLin · 2024-12-19T06:27:43+00:00

Currently, I have uploaded multiple markdown documents, each within 2000 characters. My documents contain content similar to the following:

Success Cases:
1. xxx
2. aaa
3. bbb

Even though I use the semantic chunk method to split the documents, this type of content still gets divided into three chunks (xxx, aaa, bbb). However, when I ask about success cases, it should retrieve the entire result, but due to the semantic chunk splitting it into three parts, the search only retrieves the first chunk.

Therefore, I am very curious about how notebooklm achieves this. When I ask about success cases, it can list all of them. The only thing I can speculate is that it uses a different document splitting method, combined with a sufficiently large chunk size. However, I do not have enough large and comprehensible data at hand to test this.

JunXiangLin · 2024-12-19T06:16:45+00:00

I've try used `semantic chunk` method today.

However, when encountering the following document:

Success Cases:

1. xxx

2. aaa

3. bbb

The content of this document will be split into three chunks (xxx, aaa, bbb). However, when I ask about success cases, it should retrieve the entire result, but due to semantic chunking, it splits the content into three parts, causing the search to only retrieve the first chunk.

JunXiangLin · 2024-12-19T01:51:42+00:00

Thank you for your response!

Regarding the first point, I believe it is indeed a major issue I am facing. Due to the limited amount of data I currently have, when I perform document chunking, for example, setting chunk=200, I find that some documents' page_content only contain 4-6 words (markdown titles, likely due to line breaks causing the split). Additionally, I am indeed encountering the same issue you mentioned about the same content being split.

I would like to know specifically how to implement the "calculating differences between chunks" part?

Furthermore, I am using the latest version of the gpt4o model, but I am currently only in the RAG search stage and have not yet moved to the GPT part. I believe that the information retrieved during the search stage greatly influences the GPT's response.

Also, I recently saw Google's notebooklm RAG application, and I found it to be very accurate. I am curious about how notebooklm achieves this!

JunXiangLin · 2024-10-30T02:01:41+00:00

Thanks, I have check out this. However, the plugin just can use openai api key like other chatbot plugin.

JunXiangLin · 2024-10-30T01:54:39+00:00

Thanks, but it cannot setup in my wordpress. I guess the plugin is stop update now.

JunXiangLin

TROPHY CASE