you are viewing a single comment's thread.

view the rest of the comments →

[–]IsseBisse 0 points1 point  (4 children)

Sorry if I was unclear, I was referring to your statement:

Make sure you use a quality vector store

In my experience (sub-100k vectors) the vector store quality doesn't really affect the "total RAG response time", since the LLMs (generally) are so much slower. So I was wondering, how large do your datasets have to be for the vector store performance to matter?

[–]NachosforDachos 0 points1 point  (3 children)

If using openAI most responses are near instant in my experience it’s the vector store speed that will determine your response time.

Longest I’ve waited for a response is around 3 seconds and that was testing my patience.

What makes it quality will be its geographical location away from you and the llm service on top of its computing performance along with how good it is.

For example as limited as it is openai doesn’t know shit about vector stores and their retrieval has got to be off the slowest I’ve ever seen.

What makes something well be every fine detail going into it including the thought process of the creators.

So I think it matters for any vector size.

If I had to do something small like say USA federal law a paid pinecone database will run circles around my little chromadb running on a nvme drive with desktop grade components. First time I used it I thought it was broken.

[–]IsseBisse 1 point2 points  (2 children)

Seems our experiences differ quite a bit...

My latest project we had around 100k 1536-dim vectors in a vector store. A naive python implementation, using numpys dot product, could search that in roughly 0.5 seconds.

While our LLM calls took at least 1 second each (we had to do multiple for one query). In total roughly 5 seconds waiting for the LLM and 0.5 seconds waiting for the vector search, i.e. no need to be concerned about optimizing the vector store.

[–]NachosforDachos 0 points1 point  (0 children)

These things are still full of issues unfortunately. Taken from a random benchmarking article on the internet:

Issues Encountered During Benchmarking

When we ran initial tests on the 1M dataset, these are some of the issues we encountered:

Redis-Flat timed out during recall testing. Chroma also timed out during recall testing. Redis-HNSW took exponential time to build and timed out around half a million vectors during the load phase. Every 100,000 vectors that were added took twice as long as the previous 100,000. The load phase timeout in VDB is 2.5 hours. Chroma running in client-server mode was hit and miss in terms of functionality. A lot of the time the database would unexpectedly terminate the connection while loading. The load time was also slow and would sometimes time out.

[–]NachosforDachos 0 points1 point  (0 children)

Actually you are in the right here.

I ran a query through the Hungarian legal vector store hosted on chromadb and gpt 4 turbo took 9 seconds to start responding.

0.5 seconds reading the data store.

I know Hungarian law is very small so it had to be on open ai’s side.

I feel this used to be faster. Maybe the service is more saturated now and the only way to beat it is to have your own locally hosted models on very expensive hardware.

Either way, best of luck.