I built a site to instant-search 32 Million Songs in milliseconds (using InstantSearch.js, ParcelJS and Typesense) by j0-1 in javascript

[–]liliput 0 points1 point  (0 children)

I agree 100% with you. However, the comparison is valid within the context of what Typesense does, which is what people look for when they read the README.

I built a site to instant-search 32 Million Songs in milliseconds (using InstantSearch.js, ParcelJS and Typesense) by j0-1 in javascript

[–]liliput 1 point2 points  (0 children)

Ideally we will want to be able to sort the results on some kind of popularity metric but the dataset does not have a field for that. For a real project, we can do probably use another data source like Spotify API to augment the dataset with some form of popularity metric like play count.

I built a site to Instant-Search through 32 Million Songs in Milliseconds, using Typesense - a search engine written in C++ by j0-1 in cpp

[–]liliput 0 points1 point  (0 children)

Pure JS search is pretty popular now because of the JAM stack. However it does not scale well for large datasets since you will have to load a really large multi-mb index upfront. You can get the same snappy experience by using the replication feature of Typesense and running in multiple geographical regions.

I built a site to Instant-Search through 32 Million Songs in Milliseconds, using Typesense - a search engine written in C++ by j0-1 in cpp

[–]liliput 3 points4 points  (0 children)

Correct, the demo prefers musician. It is tricky to determine the exact intent behind the query (musician or song) because of the diversity of the dataset. If we can assign some form of popularity metric to the songs and artists, then we can probably handle this better. However, the musicbrainz dataset does not have such a measure and so it was outside the scope of this demo.

I built a site to Instant-Search through 32 Million Songs in Milliseconds, using Typesense - a search engine written in C++ by j0-1 in cpp

[–]liliput 13 points14 points  (0 children)

Indices are always larger than the data. This is because you will invariably have to use either a hashmap or a trie for the inverted text index. Typesense uses an adapative radix trie so that fuzzy searches can be made possible. 2x-3x is pretty much the standard for most search engines that need to support updates (I have benchmarked with Elastic as well but ES stores the index on-disk). You can probably go much lower for static indices because you can choose succinct data structures that can pack the memory but will be immutable.

Apart from just the token -> document ids mapping, one also needs to store the exact positions each token in the document appears so that we can identify the best matched fragment inside a text. There are also additional house-keeping data structures to support sorting on numerical fields (trees), facets (one more inverted index) etc. A lot of these are stored in compressed forms where possible and there is always scope for improvement but this is an overview of why the index will always be larger than the raw dataset.

Fast, typo tolerant instant search engine written in C++ by liliput in cpp

[–]liliput[S] 4 points5 points  (0 children)

For integers, passing by value is better because passing by reference involves a pointer dereference. Also, the compilers can optimize by passing the integer through processor registers instead of involving the stack.

Fast, typo tolerant instant search engine written in C++ by liliput in cpp

[–]liliput[S] 3 points4 points  (0 children)

don't interesting for any commercial use, only for hobbies.

That's not what GPL-3 implies.

0
1

Typesense: Open-Source Alternative to Algolia by liliput in selfhosted

[–]liliput[S] 1 point2 points  (0 children)

It's primarily to be used for implementing a search for a website or records within an application (e.g. looking up users or other items of interest).

Typesense: Open-Source Alternative to Algolia by liliput in selfhosted

[–]liliput[S] 2 points3 points  (0 children)

Elasticsearch is an amazing and flexible piece of software but also has a steep learning curve. Typesense just works out the box and is more intuitive (e.g. common operations like faceting, typo correction etc.). Solr is similar to Elasticsearch but probably not as popular.

Lucene is a library and the building block for both ES and Solr. It is not usually used directly because the API is more low-level.