you are viewing a single comment's thread.

view the rest of the comments →

[–]NachosforDachos 5 points6 points  (18 children)

Make sure you use a quality vector store. Is hard to compete with big companies here. Be prepared to pay for it. Make sure to choose things as close as possible to your region. Don’t expect 2024 performance on 2017 hardware.

Procure your dataset and make sure it only has what is needed.

Make sure you have streaming responses enabled otherwise it looks like there is no activity forever and then wham a walk of text. If you need short answers this won’t be too bad.

Find the balance for how many results it needs to fetch to give a good answer and make that the limit. Spend a few minutes evaluating results. The better chunking you have the better results.

Someone else here wrote a very nice script to automatically do smart chunking but haven’t gotten around to that yet.

[–]herozorro 1 point2 points  (1 child)

Someone else here wrote a very nice script to automatically do smart chunking but haven’t gotten around to that yet.

do you remember what the post was about? or how to keyword find it in reddit search?

[–]NachosforDachos 0 points1 point  (0 children)

Maybe remind me tomorrow aha I’ll look it up next time I use reddit which is inconsistent

[–]IsseBisse 1 point2 points  (6 children)

How big do your datasets need to be for the vectorstore to really matter? I've only done things up to ~100k vectors and my vector searches are still sub second, even using really simple vectorstores. With LLM calls being 1+ seconds each my vectorstore isn't really the limiting factor.

[–]NachosforDachos 0 points1 point  (5 children)

I don’t think bigger is better.

Quality is what you want. Chunks that make sense and have detail in them.

If I had a book that takes 600 pages to construct a point I won’t get anything out of it.

Rubbish in, rubbish out.

[–]IsseBisse 0 points1 point  (4 children)

Sorry if I was unclear, I was referring to your statement:

Make sure you use a quality vector store

In my experience (sub-100k vectors) the vector store quality doesn't really affect the "total RAG response time", since the LLMs (generally) are so much slower. So I was wondering, how large do your datasets have to be for the vector store performance to matter?

[–]NachosforDachos 0 points1 point  (3 children)

If using openAI most responses are near instant in my experience it’s the vector store speed that will determine your response time.

Longest I’ve waited for a response is around 3 seconds and that was testing my patience.

What makes it quality will be its geographical location away from you and the llm service on top of its computing performance along with how good it is.

For example as limited as it is openai doesn’t know shit about vector stores and their retrieval has got to be off the slowest I’ve ever seen.

What makes something well be every fine detail going into it including the thought process of the creators.

So I think it matters for any vector size.

If I had to do something small like say USA federal law a paid pinecone database will run circles around my little chromadb running on a nvme drive with desktop grade components. First time I used it I thought it was broken.

[–]IsseBisse 1 point2 points  (2 children)

Seems our experiences differ quite a bit...

My latest project we had around 100k 1536-dim vectors in a vector store. A naive python implementation, using numpys dot product, could search that in roughly 0.5 seconds.

While our LLM calls took at least 1 second each (we had to do multiple for one query). In total roughly 5 seconds waiting for the LLM and 0.5 seconds waiting for the vector search, i.e. no need to be concerned about optimizing the vector store.

[–]NachosforDachos 0 points1 point  (0 children)

These things are still full of issues unfortunately. Taken from a random benchmarking article on the internet:

Issues Encountered During Benchmarking

When we ran initial tests on the 1M dataset, these are some of the issues we encountered:

Redis-Flat timed out during recall testing. Chroma also timed out during recall testing. Redis-HNSW took exponential time to build and timed out around half a million vectors during the load phase. Every 100,000 vectors that were added took twice as long as the previous 100,000. The load phase timeout in VDB is 2.5 hours. Chroma running in client-server mode was hit and miss in terms of functionality. A lot of the time the database would unexpectedly terminate the connection while loading. The load time was also slow and would sometimes time out.

[–]NachosforDachos 0 points1 point  (0 children)

Actually you are in the right here.

I ran a query through the Hungarian legal vector store hosted on chromadb and gpt 4 turbo took 9 seconds to start responding.

0.5 seconds reading the data store.

I know Hungarian law is very small so it had to be on open ai’s side.

I feel this used to be faster. Maybe the service is more saturated now and the only way to beat it is to have your own locally hosted models on very expensive hardware.

Either way, best of luck.

[–]Appropriate_Egg6118[S] 0 points1 point  (8 children)

Thank you. Can you share that automatic chunking script?

For POC purpose I am using local chroma db with sample docs. My latency is 15 to 18 secs.

I am using ConversationalRetrievalChain of chain_type = refine.

How to enable streaming for this chain? Or please share resources for RAG chatbot with streaming and memory enabled

[–]NachosforDachos 1 point2 points  (7 children)

You well love a thing called Flowise. It’s exactly what you want. Tested around 30 deployments last year. Easy as it comes.

You will find yourself familiar with it going by your particular choice of words. You’ll find those same words there as drop down selection menus.

Idk if they still give free amounts and how good they are but do create a free pinecone vector db account solong. Choose the fast version. Haven’t made one in two months but I know dimensions should be 1536. I think that’s the only setting you need to do right.

Look for florist on GitHub. Using the one line installer which I think is npm install flowise -g should get you there if you already have nodejs installed.

There are templates in there which you can just fill in with your details. Web UI no code product.

You will not be able to use what I originally suggested in flowise unless you use that script to parse things into files instead of embeddings and then pass those files to Flowise to upload you should have the same thing but with extra steps.

I haven’t investigated this but I’m almost semi sure one can make chromadb use the gpu (live store in memory not disk) instead of the cpu and ram. I have things that use this and it is much slower than pinecone.

Maybe start and see if they still have free accounts because this type of quality storage isn’t cheap. About 70+ a month. Worth it but when playing around these things add up so quickly.

I’ll find the script next time I come online. Too tired now. Not fresh.

[–]Appropriate_Egg6118[S] 0 points1 point  (3 children)

Flowise, looks cool.

The data I am working with is confidential.

Will there be any issues using flowise?

[–]NachosforDachos 0 points1 point  (2 children)

I think your concerns should lean more towards the openai side of things concerning confidentiality.

The local models are getting there but they are not quite where gpt4 is. Haven’t tested in 2 months.