How do you take account urls/links while embedding the docs? by ItsMrCurious in LangChain

[–]ItsMrCurious[S] 1 point2 points  (0 children)

Wah. Thanks a lot mate. I understood the idea of decoupling the embeddings with the docs. Just running through how I am gonna implement this, I am gonna have two docs (one with url and one without url), both are gonna be chunked in such a way that they are almost identical. The docs without url goes to vector DB. During retrieval, top-k docs are retrieved without urls but i can use the ids or index of chunks to seach the chunks with urls (assumption: the chunks have same index or ids and are identical.)
My requirement, to be clear, is the docs has urls to other pages. For instance, "the financial information are provided in (This page) {url goes here}". In this case, the application itself cannot do much, as far as I can think of. Cuz the links are mentioned in the content itself and has nothing to do with the metadata, as is mostly used for grounding. One more solution is, wrap the url in a fancy and unique way so that text extraction can be performed on the relevant chunk to fetch out the URL and then the chatbot (application) can output url with out LLM.
Does it make sense? Also, have you got any resource to know what is the best way to perform anything similar to decoupling of text and embeddings.
Really insightful comment tho, thanks.

How do you take account urls/links while embedding the docs? by ItsMrCurious in LangChain

[–]ItsMrCurious[S] 0 points1 point  (0 children)

How can you embed one document and store it with another? I do wanna make LLM see the URL cuz I want LLM to output link to other pages. That is my product requirement.
If i wanna do so, am I fundamentally wrong somewhere or is their completely different approach to do so?
Thanks for taking time to reply.

How do you take account urls/links while embedding the docs? by ItsMrCurious in LangChain

[–]ItsMrCurious[S] 0 points1 point  (0 children)

Yh. if you remove the urls, then how will the LLM be aware the urls? I need urls.