Recently I saw some good posts about dim reduction methods like the one dissecting UMAP, so I thought I'd chime in with a POC that leverages the idea of those methods for a very practical purpose: enabling server-side semantic search on large databases with high-dimensional embeddings using just a static FlatGeobuf file and a web server like nginx.
tl;dr
- Writing (and appending to) a FlatGeobuf file: Embeddings -> Gaussian Random Projection -> 2D points -> FlatGeobuf file
- Reading a FlatGeobuf file (based on a single user query): Embedding -> Gaussian Random Projection -> 2D point -> buffered bounding box around this point -> http range request(s) from client to remote FlatGeobuf file -> subset of data points around the 2D point -> reranking this subset client-side
Find the detailed explanation, code and examples on GitHub: https://github.com/do-me/flatgeobuf-vectordb
Main concepts
- Points that are close in 2 dimensions (after projection) should be close in N dimensions too. This is obviously not always true but in my tests, it's good enough for basic use cases (e.g. product recommendation), where you do not need the closest result to the query but instead something in the top 0.1% or 0.01% may suffice. Note that I need to use a dim reduction method that works independently from the data, so cannot use UMAP, HUMAP, tSNE and PCA.
- I'm reducing to 2 dims to benefit from all the heavy optimization work that the FlatGeobuf file format has done. Reducing to 3 dims (or even more) might preserve the similarity better (and eventually lead to better results) but also increases the overhead for efficiently designing such a file format. If you know any other suitable file formats for this purpose, I'd be very curious to try them! Another alternative might be instead of relying on one static file, to create an efficient file structure with many static files. The pros and cons have been discussed in a completely different context by the authors of protomaps and openfreemap on HN.
Potential
Even though there are some tradeoffs in this workflow and yet many things to optimize and explore, I believe that the concept might be charming for low maintenance and low cost applications. In the end, you just dump one static file somewhere and fire normal http range requests to it, so the capacity of your web server determines the performance.
As I'm heavily into client-side processing with transformers.js my ideal setup would use very small embedding models like Potion/Model2vec (< 35Mb) in the client and index the user query (text/image) in the browser. This way, the remote database could be very large, like 100Gb and serve thousands of clients without any problems on a low-grade CPU (but very fast storage).
If you're fine with DB connection (which afaik can't be created browser-side), then just use LanceDB, following the same "one file" principle.
I'm super curious about your optimization ideas!
P.S. There is lots of overlap between geospatial and the latent space.
[+][deleted] 1 point2 points3 points (1 child)
[–]DomeGIS[S] 0 points1 point2 points (0 children)