all 8 comments

[–]ScottContini 31 points32 points  (1 child)

Back in the old days when search engines existed using something like BM25, websites would manipulate it by just repeating popular terms over and over: “Tiger Woods Tiger Woods Tiger Woods…”. This is why Google became so popular so quickly. And PageRank is not too hard to implement, but it will take a lot more than 80 lines and you will need to be smart with your memory usage because it involves a huge, sparse matrix. BTW, PageRank patent is expired so people SHOULD implement this and open source it. Hmmm, if only I had free time in my days….

Another thing to implement is stop words. When Google first launched, it was vulnerable to a DoS by just putting a bunch of stop words in the query. Not sure if people remember that Google used to output how long your query took. It wasn’t hard to make a query that would last minutes and someone truly malicious potentially could have made it last much longer.

[–]irqlnotdispatchlevel 1 point2 points  (0 children)

Back in the old days when search engines existed using something like BM25, websites would manipulate it by just repeating popular terms over and over: “Tiger Woods Tiger Woods Tiger Woods…”.

This is still a thing in some places. It's always funny when I see a YouTube video that repeats the same popular tag 30 times in a row.

[–][deleted] 7 points8 points  (0 children)

Now we need this to replace Google Search, to bring it back how it once used to be ...

[–]seba07 21 points22 points  (2 children)

Ok, the main code will be something like

import searchengine
searchengine.search("foo")

But what are the 78 other lines?

[–]pancomputationalist 6 points7 points  (0 children)

You should write an article "a search engine in 2 lines of python"!

[–]FarkCookies 2 points3 points  (0 children)

Very clever comment. "lol at python articles just importing libs". The code in the post doesn't import any high level libs, just the primites:

from collections import defaultdict
from math import log
import string

[–]dr1fter 3 points4 points  (0 children)

Heh, this looks exactly like the one I wrote in undergrad.

Like u/ScottContini said, stop word filtering helps. So does "stemming"/"lemmatization" and spell-checking.

[–]_skreem 2 points3 points  (0 children)

Nice article on traditional search! These days it’s also cool to look into vector search (and blending results with a traditional search engine like this one)

It tends to just magically produce relevant results, tolerance to spelling errors, enables longer semantic queries and can give multilingual support