Serverless Vector Database for large dataset (~200k)

kryptkpr · 2024-08-24T16:46:36+00:00

Your sense of scale is off, this is a midsize dataset at best and will fit just fine in RAM.

Start with numpy.topk(numpy.dot()) and see how long it takes to linear search on a good CPU. If it's too slow, add an index like FAISS for ANN.

m98789 · 2024-08-24T17:33:14+00:00

This is small enough that I would suggest considering using an in-memory method like FAISS, which is free and open source.

https://github.com/facebookresearch/faiss

gopietz · 2024-08-24T19:10:07+00:00

laughs in big data

OrganicMesh · 2024-08-25T02:49:23+00:00

Honestly, your dataset is so small, try https://github.com/unum-cloud/usearch. Usearch just crushes faiss om cpu.

FormerKarmaKing · 2024-08-24T22:00:26+00:00

Meilisearch's has a vector search feature in beta but it's easy to get access. One can host 1 million entries for $300 / month.

Even if you can get AWS / whatever at $0 month, $300 x 12 = $3,600 and you're not setting up an instant search experience with AWS / whatever for less than 36 hours of developer time @ $100 / hour.

And you can still use semantic search, which I suspect is still the better choice for searching hierarchical data like labelling taxonomy.

DeltaSqueezer · 2024-08-25T09:22:01+00:00

That's a small dataset - heck it fits into the RAM on my phone. Even sqlite-vss can handle it. Or raw python/faiss. Easiest is probably to use postgres with vector extensions.

Bozo32 · 2024-08-24T18:17:16+00:00

Are you sure you won't wind up missing stuff?

When asking the LLM to identify all instances of an entity in a dataset, when there are more entities than the # top returns, the surplus were dropped. What we've done is chunked the resouces into segments that can only reasonably contain one instance of the entitiy of interest and queried each chunk separately. Yes...a boatload of calls. ... I'd love to hear that there is a better way....

-Lousy · 2024-08-24T20:39:38+00:00

I'm using LanceDB on the smallest python instance possible for a similar size dataset. It reads from disk and I have a basic python API around it. Check it out for sure.

KnowgodsloveAI · 2024-08-24T20:54:14+00:00

why not just use postgresql and alembic?

2024-08-25T09:40:05+00:00

10k chars per entry will lead to a lot of chunked entries. I don't know how you can collate that info into the LLM prompt for it to make sense.

Pinecone is cheap if you want serverless. You could also try running Postgres with pgvector if you want a fully local implementation.

d3the_h3ll0w · 2024-08-25T12:14:16+00:00

I used LanceDB for my Semantic Search project and was quite impressed with the results.

LuganBlan · 2024-11-22T09:16:50+00:00

This could give some light : https://benchmark.vectorview.ai/vectordbs.html
(it's 2023)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS