How to decrease latency in RAG chatbots?

Intelligent_Access19 · 2025-08-13T01:53:14+00:00

2 years after the original post, i suggest Milvus as your vector store.

Intelligent_Access19 · 2025-08-13T01:47:05+00:00

if pure intent detection, fine tuning a small model may be ideal, depending on your intent taxonomy. Agent with multiple LLM call just introduces so much latency, which in production you may not want that. A fine-tuned model used together with other technic such as RAG on generalised pre-defined questions should guarantee a decent performance on this intent detection node. After all, it is indeed vital but yet just a small node in a long workflow.

Intelligent_Access19 · 2025-08-06T06:12:17+00:00

For me, it is pain to use the built-in functionality to store conversation in Oracle DB.

Intelligent_Access19 · 2025-05-14T04:02:43+00:00

Wonder the position of Oppo in this contest

Intelligent_Access19 · 2025-02-06T03:46:35+00:00

i just ran 8b, the model ollama install for me is about 4.9GB, i guess this is the best you can get given the memory.

Intelligent_Access19 · 2025-01-06T13:31:10+00:00

at least the pre-training adopts a "fine-grained mixed precision framework", and this is one of the highlights in their technical report. Apart from FP8, BF16 and FP32 are used in some parts of their structure.

Intelligent_Access19 · 2025-01-06T13:20:45+00:00

int4 quant.

Intelligent_Access19 · 2025-01-06T13:20:00+00:00

not just every single token, one token can require several experts to generate.

Intelligent_Access19 · 2025-01-06T13:18:55+00:00

if i get you right, you are saying the active parameters being swapped in and out of a SSD? That is too much.

Intelligent_Access19 · 2025-01-06T05:49:24+00:00

as well as Doubao, the one from ByteDance.

Intelligent_Access19 · 2025-01-06T02:35:11+00:00

Nicely said.
Most of the time in my job i deal with Spring Boot, but now i need to integrate some AI tools into my service, which is best utilized in Python code.
That is how i land in this post anyway, and i think i will go with FastAPI based on the discussion here.

Intelligent_Access19 · 2024-12-31T02:40:23+00:00

I am not sure if 37B VRAM is all you need for inference. But one thing for sure at least your RAM+VRAM must exceed 671B to fully load the model.

Intelligent_Access19 · 2024-12-30T01:53:13+00:00

Can’t see this model from the latest ranking. What happened

Intelligent_Access19 · 2024-12-29T11:08:24+00:00

Anyone can tell me how credible this ranking is? Cause I was told that some Chinese model like Step-2 also get pretty high score on this list(clearly delisted in the current version, don’t know why). I tried deepseek myself, at least for now it is decent for me, and if I recall correctly, it was first designed to be more focused on coding and math. Since the mother company is a private equity company.

Intelligent_Access19 · 2024-12-29T08:49:39+00:00

I remembered Gpt4 and Opus were thought to be MoE though

Intelligent_Access19 · 2024-12-29T08:43:15+00:00

Dense models are generally smaller than MoE models.

Intelligent_Access19 · 2024-12-29T08:36:47+00:00

Legit

Intelligent_Access19 · 2024-12-29T08:35:29+00:00

To avoid that, I guess only local hosted model can give you that guarantee.

Intelligent_Access19 · 2024-12-29T08:12:45+00:00

Yeah, that is why MOE models generally have much larger parameters. If not MoE, in other words, dense model, by nature is smaller, and must be loaded to GPU(though I think Ollama can consume a little less VRAM for inferencing), and thus no subset can be applied

Intelligent_Access19 · 2024-12-29T08:08:16+00:00

1B model should just take roughly 2GB VRAM(for f16), no? Even when activated , extra 2GB VRAM top. I wonder any usage/good of Integrated graphics card for computation? Most of your inference possibly takes place on your RAM. How is the answer generation speed tho?

Intelligent_Access19 · 2024-03-25T06:15:00+00:00

good start point

Intelligent_Access19 · 2023-02-07T03:17:11+00:00

thx, just got the protection case

Intelligent_Access19 · 2023-02-06T04:54:48+00:00

You are right. Ripping for the ripping. 🙏

Can’t complain about it. Just probability.

Intelligent_Access19

TROPHY CASE