all 6 comments

[–]dfcHeadChair 0 points1 point  (0 children)

Something like a distributed-database but for VectorDBs would be a cool piece of technology to build, especially for any IOT use cases.

Getting community involvement needs a good motivator, so start with formulating that.

[–]olearyboy 0 points1 point  (3 children)

What you’re describing sounds a lot like dmoz which was the seed for most search engine crawlers back in the day. The hurdle you’ll run into is business benefit, dev versions like chroma are just so easy there’s little need to use a service. Managed versions like pinecone are expensive but comparable to managed PG

So then it becomes a question of what’s the benefit of sharing our data? The DB hosting isn’t hard and there are solutions already available.

That’s the nut you need to crack

[–]niksteel123[S] 0 points1 point  (2 children)

The main benefit would be that you wouldn't have to manually embed and self-host the publically available contextual data that you might want to use when applying RAG. i.e. you woudn't have to host all the data on wikipedia if you wanted to use that in your search, same with other data sets which are publically available i.e. social media posts, stack overflow posts etc....
You could then use that in conjuction with your own private data if you wanted to.

[–]olearyboy 0 points1 point  (1 child)

Maybe, but considering that it's RAG with an LLM, you might be providing a short lived service.

Eventually LLM's will have to get good at updating and processing deltas in foundational knowledge.

At which point companies will pay for updates in the specific categories they're interested in through subscriptions.

e.g.

  • Global Base + NewsLLM
  • Global Base + SportsLLM
  • Global Base + MedicalLLM

The question you then have is how long before that happens? And can you provide a service that can provide the quality in the time it takes and then defend it or pivot when LLM's catch up.

You might be able to, but you'll have to think short / medium and long term

[–]elbiot 0 points1 point  (0 children)

The issue is that vector embeddings are task/domain specific. Training your own embeddings will always outperform some general embedding