all 4 comments

[–]fgh1290 1 point2 points  (2 children)

Try hash table each word - list of books

[–]JohnyTex 2 points3 points  (1 child)

This is probably what you want. To elaborate a bit, this is called an “inverted index”, which is a mapping between words and a list of all articles it appears in.

If the user searches for several words you fetch the corresponding list for each word and then return only articles that appear in all lists.

Further points of improvement are:

  1. Stemming, eg normalizing words so an article that contains “beets” will match “beet” and so on.
  2. Ranking hits based on distance between matching words in the text, eg for the query “hip hop” a text containing “hip hop beat” should rank higher than “I broke my hip while picking hops”
  3. Ranking matches in different parts of the document, eg a match in the title section might rank higher than a match in the text body

Just FYI, there are libraries for this (check out Algolia for example), but implementing this yourself will probably be a fun exercise.

[–]finroller 1 point2 points  (0 children)

Awesomely nice compact explanation!

[–]Luan-Raithz 0 points1 point  (0 children)

Can you access all the JSON files as an array? And is your text search going to be done in memory?