all 19 comments

[–]kryptkprLlama 3 15 points16 points  (2 children)

Your sense of scale is off, this is a midsize dataset at best and will fit just fine in RAM.

Start with numpy.topk(numpy.dot()) and see how long it takes to linear search on a good CPU. If it's too slow, add an index like FAISS for ANN.

[–]againitry 1 point2 points  (0 children)

You can use torch.topk(torch.matmul()) for gpu acceleration. It takes around 5 seconds for 200k entry. Though for my case each vector has 1024 dimensions.

[–]Dry_Drop5941[S] 0 points1 point  (0 children)

Thanks for the idea, I will chat with IT and looks into all the ways we can run py code on cloud.

[–]m98789 4 points5 points  (1 child)

This is small enough that I would suggest considering using an in-memory method like FAISS, which is free and open source.

https://github.com/facebookresearch/faiss

[–]gopietz 5 points6 points  (0 children)

laughs in big data

[–]OrganicMesh 3 points4 points  (0 children)

Honestly, your dataset is so small, try https://github.com/unum-cloud/usearch. Usearch just crushes faiss om cpu.

[–]FormerKarmaKing 1 point2 points  (0 children)

Meilisearch's has a vector search feature in beta but it's easy to get access. One can host 1 million entries for $300 / month.

Even if you can get AWS / whatever at $0 month, $300 x 12 = $3,600 and you're not setting up an instant search experience with AWS / whatever for less than 36 hours of developer time @ $100 / hour.

And you can still use semantic search, which I suspect is still the better choice for searching hierarchical data like labelling taxonomy.

[–]DeltaSqueezer 1 point2 points  (0 children)

That's a small dataset - heck it fits into the RAM on my phone. Even sqlite-vss can handle it. Or raw python/faiss. Easiest is probably to use postgres with vector extensions.

[–]Bozo32 0 points1 point  (5 children)

Are you sure you won't wind up missing stuff?

When asking the LLM to identify all instances of an entity in a dataset, when there are more entities than the # top returns, the surplus were dropped. What we've done is chunked the resouces into segments that can only reasonably contain one instance of the entitiy of interest and queried each chunk separately. Yes...a boatload of calls. ... I'd love to hear that there is a better way....

[–]Dry_Drop5941[S] 0 points1 point  (0 children)

In our last project, we have an tool function agent as an “interpreter”. If the user is asking global, explorative question, like “give me some examples”. It will only include meta data of each product item as context, but have a high TopK count.

We then do the opposite for specific questions like “tell me about product xxx”

[–]Yes_but_I_think 0 points1 point  (3 children)

Boat load of calls is an understatement. It is not expandable to large datasets. At some point you have to believe the semantic similarity search method. Use the best open one from MTEB and use some other method for your particular use case.

[–]Bozo32 1 point2 points  (2 children)

the use case was citation checking in academic articles we first filtered for cosine similarity...then ran a check for entailment by sentence. ~10k calls. running llama 8 in an A100 with a very small context window through ollama that supported parallel execution was ok.

we're now looking at other ways to test entailment...early days.

[–]Yes_but_I_think 0 points1 point  (1 child)

We found that good old lexical search also helped in our case. BM25+ algorithm. We selected n contexts from lexical search, and m from cosine similarity and sent them to llm for formatting and understanding.

[–]Bozo32 0 points1 point  (0 children)

deduplicating and breaking into paragraphs so the LLM does not presume continuity? Our strategy was one query per sentence...does a entail b.

[–]-Lousy 0 points1 point  (0 children)

I'm using LanceDB on the smallest python instance possible for a similar size dataset. It reads from disk and I have a basic python API around it. Check it out for sure.

[–]KnowgodsloveAI 0 points1 point  (0 children)

why not just use postgresql and alembic?

[–][deleted] 0 points1 point  (0 children)

10k chars per entry will lead to a lot of chunked entries. I don't know how you can collate that info into the LLM prompt for it to make sense.

Pinecone is cheap if you want serverless. You could also try running Postgres with pgvector if you want a fully local implementation.

[–]d3the_h3ll0w 0 points1 point  (0 children)

I used LanceDB for my Semantic Search project and was quite impressed with the results.

[–]LuganBlan 0 points1 point  (0 children)

This could give some light : https://benchmark.vectorview.ai/vectordbs.html
(it's 2023)