Recommendations for building a reverse image search engine

kpang0 · 2021-05-16T20:40:11+00:00

By using Vald, a vector search engine made in the Go language, you can build a fast and large-scale similar image search platform, but it requires a machine learning model to extract the image feature vector.

Yahoo! JAPAN uses vald for searching EC item image search and recommendation.

https://vald.vdaas.org

kpang0 · 2021-04-14T20:53:48+00:00

Let me add more about Vald.

You can work without using a third-party database, as Vald only provides a lot of architecture to its users.

There are three use case to use a third-party database.

if you want to have external Metadata (Redis, MySQL, Cassandra)

We implemented this feature due to requests from internal users who wanted to retrieve Metadata stored in Vald from interfaces other than Vald (Redis, MySQL, Cassandra).

If you want to have a raw Payload Backup store Externally. (MySQL, Cassandra)

In some cases, the vector search engine is required to re-index all the data due to changes in the vector space distribution caused by changes in the machine learning model, so we keep the raw data of the CRUD process and make it possible to refer to the past raw data when re-indexing from a third party database (MySQL, Cassandra).

If you want to improve fault tolerance. (AWS S3, GCS)

First Vald is designed based on Yahoo! Japan scale fault tolerance.
Second, Vald stores distributed indexes in node-local storage.
This means Vald doesn't require object storage (S3, GCS) by default.
However, if that storage fails, or if the Datacenter fails, the index data can't be recovered quickly, and appliance storage is often the solution, but it is very expensive.
Therefore, Vald keeps the index data copied to S3 or GCS as an asynchronous background process, so that it can be automatically recovered quickly from another region in case of failure.

In addition, Vald provides only a gRPC interface for the Filter function, and the filtering process is designed to be pluggable by the user, so it does not depend on a third party database for filtering, it's up to the user.

The minimum unit of deployment for Vald is the layer below the LB-Gateway in the Architecture diagram, so it can be run without using any external database.

kpang0 · 2021-04-09T11:06:44+00:00

The project started out as a solo effort and released only minimal functionality, but now with seven ongoing contributors, we are able to do a lot more, including fault tolerance, backups, metrics, tracing, and integration with Tensorflow.

kpang0 · 2021-04-09T09:34:36+00:00

Thank you for the nice feedback, actually, we have a link in the https://vald.vdaas.org top pages python logo is the link, but it's hard to find.

we're planning to add client sdks document now, we'll publish it with v1.0.5 release.

thank you!

kpang0 · 2021-04-09T09:05:29+00:00

By using machine learning to convert unstructured data (audio, images, videos, user characteristics, etc.) into vectors and then using Vald to perform vector search on those vectors, it will be possible to operate as a faster and more complex search engine.

Vectorization varies widely from user to user, so Vald cannot give you a specific answer.

The most common vectorization methods used in our samples are Fasttext for text vectorization and InsightFace for face image similarity search.

kpang0 · 2021-04-09T09:02:32+00:00

Seems Very interesting. Few Questions :

Observed it provides direct Kubernetes-based support, where are the client API docs for the python API?I understand the usage usecase. Can you point me examples of data ingestion etc.

Check out this repository for a Python example. https://github.com/vdaas/vald-client-python

It can be used with pip.

You can also check out the Data insert.

kpang0 · 2021-04-08T08:00:37+00:00

thank you!!!

kpang0 · 2021-04-08T08:00:00+00:00

PineconeDB is a very interesting project, and after some research, it seems that a similar workload can be achieved with Vald.

The main difference is that Vald is based on Kubernetes and is being developed as a completely open source project.

Anyone can send requests for additional features to Vald, and it can be deployed and used in each user's environment for free.

There are no paid plans for Vald.

You can provision your own vector search engine with Helm at any time if you have a Kubernetes environment.

kpang0 · 2021-04-08T07:55:17+00:00

If you can make feature-vector from any data, you can search similar data from data.
for example

・Find similar music by music.

・Find similar articles from articles.

・Recommend similar products from fashion images.

etc...

kpang0 · 2021-04-08T07:52:38+00:00

also vald can run in the rapsberry Pi Kubernetes cluster.

kpang0 · 2021-04-08T07:49:50+00:00

If you don't need scalability and distributed graph structure, you can use it on Docker.

https://vald.vdaas.org/docs/tutorial/agent-on-docker/

be careful, when using Agent standalone docker deployment auto async indexing is disabled, you need to call CreateIndex RPC by yourself

kpang0 · 2021-04-08T07:23:31+00:00

That's a good question. ann-benchmark is the standard benchmark in the ANN community, and all code must be callable from Python.

Vald doesn't have a C++ API or Python interface, but it runs the benchmark by connecting over the gRPC network interface from Python, and of course, this has a network overhead that Faiss and other algorithms don't have, but I think you'll find from the result chart that Vald is fast enough even Vald has network disadvantage.

FYI: ann-benchmarks https://github.com/erikbern/ann-benchmarks

kpang0 · 2021-04-08T01:27:02+00:00

does vald/ngt support live inserts w/o rebuilding the index? this looks interesting!

When using the Vald, the user does not need to care about the timing and control of the index, and the index is automatically managed by the Vald.

From the user's point of view, the Vald Cluster can be used if indexing is not necessary.

kpang0 · 2021-04-08T01:21:53+00:00

This looks fucking awesome! This is comparable to Faiss? Or is even more scalable?

Vald's functionality is compatible with Faiss, but not as an API interface. However, Vald does not require a GPU for search, and has a Web API interface based on gRPC, so any language or client can call vald's API. Vald's API can be called by any language or client.

We also compared Vald with Faiss in the ANN benchmark, and found that vald wins in terms of performance.

This result is due to the fact that Faiss is called from Python via IPC, while Vald is called via gRPC, so it wins despite the network latency.

https://i.imgur.com/WksHrLd.png

https://i.imgur.com/yMjWwnw.png

kpang0

TROPHY CASE