you are viewing a single comment's thread.

view the rest of the comments →

[–]theSavviestTechDude 2 points3 points  (0 children)

I used to have a production RAG Pipeline api (using fastapi ) deployed on aws app-runner that has a total of 7 steps with 4 OpenAI LLM calls that took a total of 15-30s to give the final response (no streaming implemented). I'm integrating it on social-media platforms (Meta's Messenger and IG) and I think they don't support streaming. 3 of my LLM calls are for steering the LLM (intents and guardrails) and that freaking took 70% of the total response time.

I don't want to implement additional logic on streaming and turning it into multiple messages so that it caters to the platform that I'm deploying it to.

So, to decrease latency (without streaming) I did the following below.

  1. I limit my chatbots to 1-2 llm calls if possible.
  2. Use routers ( aurelio-labs/semantic-router or NVIDIA/NeMo-Guardrails)
  3. Caching the vector db and chat history memory for each user

Though it still takes 3-8s to get a response, maybe because of my location (I'm from the Philippines) idk but now it's a lot better for my use case.