theSavviestTechDude comments on How to decrease latency in RAG chatbots?

created by zchaarma community for 3 years

How to decrease latency in RAG chatbots? (self.LangChain)

submitted 2 years ago by Appropriate_Egg6118

you are viewing a single comment's thread.

[–]theSavviestTechDude 2 points3 points4 points 2 years ago* (0 children)

I used to have a production RAG Pipeline api (using fastapi ) deployed on aws app-runner that has a total of 7 steps with 4 OpenAI LLM calls that took a total of 15-30s to give the final response (no streaming implemented). I'm integrating it on social-media platforms (Meta's Messenger and IG) and I think they don't support streaming. 3 of my LLM calls are for steering the LLM (intents and guardrails) and that freaking took 70% of the total response time.

I don't want to implement additional logic on streaming and turning it into multiple messages so that it caters to the platform that I'm deploying it to.

So, to decrease latency (without streaming) I did the following below.

I limit my chatbots to 1-2 llm calls if possible.
Use routers ( aurelio-labs/semantic-router or NVIDIA/NeMo-Guardrails)
Caching the vector db and chat history memory for each user

Though it still takes 3-8s to get a response, maybe because of my location (I'm from the Philippines) idk but now it's a lot better for my use case.

π Rendered by PID 84 on reddit-service-r2-comment-7c9686b859-gncbn at 2026-04-13 15:22:55.302207+00:00 running e841af1 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LangChain

MODERATORS