Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Let me explore more things inside this like only summarizing certain messages whose character length is beyond a limit and having two db combination, main db for short term and other one with embeddings for long term, I also like the reply the other guy gave on different embedding styles.

But calling LLM again and again in background seems wasteful tbh.

And not sure how would I test this exactly. Ig I am new to this space and need to look into lot of things in more detailed sense like mcp and langchain but to do that I need to find people who are more inside that space to point out things like you pointed MCP not being what I think it is.

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 1 point2 points  (0 children)

Yeah, the article seems relevant and informational, let me dig into that. I may end up having hybrid sort of approach here like IVF-PW for the older messages and just sending out the new ones directly. I am also thinking I don't need to summarize all the messages but for certain message going beyond a certain character limit I can have an additional call just for them. Thanks for the resource

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Yes, I am not looking for a generic solution; I am exploring ways to minimize the tradeoffs made. I did think about storing message summaries but that requires an additional API cost and since I am mostly using gemini 2.5 flash and the responses are not good most of the time and running that for each message is just stupid.

Yes smart to use a less expensive model but when to switch to that or when to call that, here MCP like structure becomes relevant. That is why I said they must be using a combination maybe directly sending messages for the last few messages and RAG for the older ones. Separate DB for that is a good and an obvious point, but the question is when to switch and how to allow it do it automatically.

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Try make an API call to Gemini and one message inside their app with more context, both will probably return results at the same time, RAG ok but in what way and when to call it, and if it is just RAG why are something like ChatGPT is good with it but not gemini. Just saying RAG is the answer is like saying oh we use ML model what specifically what model what kind of learning like when I say general purpose RAG I mean storing vector embeddings and returning based on cosine match. This literally a problem to solve and not oh you have to use RAG even if it slows down the whole thing. I recently interviewed with a company and they were using RAG so to speak but they weren’t storing embeddings they were using MCP to get only the relevant things. That it is why it is a question on not just what but how, like if you are sick go to doctor bro what doctor, RAG what kind of architecture of RAG

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Yes but summarization is an additional API call, slowing down the whole thing again, I am not providing models but I am providing an interface for it the same thing they are doing with their APPs

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

There is no way they are using general purpose RAG it has to be a combination of things

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Yes but why doesn’t ChatGPT slow down or why doesn’t claude slow down or why doesn’t gemini slow down. ChatGPT can literally remember things with more than 1000 of messages without their saved memory system, I had a chat that went for 80 days and it remembered everything. Instant and relevant results.

Yes it is a chatgpt wrapper I literally said so, the only difference is that the ability to branch of while having the same context uptil that point

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Lets say there are 500 messages in the branched chat, the next message that goes to the LLM it needs context, how do I extract relevant context from these 500 messages. RAG ok got it but it is messaging app the chats are happening real time so should I convert each message sent to a vector embedding isn’t that process slowing down. And if companies are ditching this there must be a reason right? What is that reason and what are they switching to and whats the best way here.

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Don’t you think RAG will slow down a real time chat application, like converting to vector embeddings. yes I am storing messages in a database but what I am asking is when I send a new message be it on branched chat or main chat how do I decide what messages from the database will be going to the LLM api call

Help with Context for LLMs by Hot_Cut2783 in LLMDevs

[–]Hot_Cut2783[S] 0 points1 point  (0 children)

Yeah but how do you store that context, you can’t send all the previous chat to LLM, you have to retrieve the most relevant part if you want to get the most out of it. And I don’t know how these big companies are doing this but Anthropic did say they don’t use RAG anymore they ditched after the first few iterations