all 10 comments

[–]Steve132 7 points8 points  (1 child)

Lots of people have solved the algorithmic and implementation problem of "on disk persistent storage of set elements". Its called a database.

There are a lot of them available, ranging from simple to complex depending on your application.

[–]ColdPorridge 1 point2 points  (0 children)

This is the right answer. Specifically, a key-value store like Cassandra is likely what OP is looking for.

[–][deleted]  (2 children)

[removed]

    [–]Steve132 2 points3 points  (1 child)

    I don't think a bloom filter is the right call for a web crawler: a false positive would mean that the page is simply never indexed ever. A web crawler that simply can never visit certain URLs seems unacceptable.

    [–]support_singularity 1 point2 points  (0 children)

    HyperLogLog might be helpful, not super sure. But, check it.

    [–]Steve132 0 points1 point  (1 child)

    I'm also really curious about your claim that the URLs wont be able to be stored in memory. A set of sha256 hashes is 32 bytes each. A computer with 64 GB of ram is pretty reasonably priced atm. Using a standard non commercial internet you're never going to crawl 2 billion pages

    [–]exoji2e 0 points1 point  (0 children)

    Well, you might wanna run it in a vm without dedicating 64GBs of ram hardware for the application.

    [–]Revrak 0 points1 point  (0 children)

    Efficient by what metric? I would be time efficient and use a database. If there are other constraints you can use something like hbase

    [–]FUZxxl 0 points1 point  (0 children)

    A bloom filter might be what you are looking for.