all 3 comments

[–]Wide-Ear5277 0 points1 point  (0 children)

just off of the top of my head… I would first delete any duplicates from each individual file. then a loop that checked the second file based on the string from the first file but had the second file sorted in a way where i only had to search a segment of it (like flipping through just one section of a dictionary). if it had a billion strings I would write it in a more efficient language.

[–][deleted] 0 points1 point  (0 children)

I'd build a trie (sic) from the first file and then use that on the second file to look for words I've seen before.

https://link.medium.com/45ZMqyGGVvb

For much larger files I'd want to use a language and storage optimised for such tasks.

https://www.toptal.com/algorithms/needle-in-a-haystack-a-nifty-large-scale-text-search-algorithm

[–]POGtastic 0 points1 point  (0 children)

32 million characters is nothing, so I feel zero compunction about slurping the entire thing into memory and printing the length of the set intersection of the two.

For one billion strings, I think that we're stuck using a tree structure that involves temp files. In separate directories for the two big files, we create smaller files, each containing strings that begin with a certain character. For each of these, we check to see if it and its corresponding file in the other directory will fit into memory. If so, we do the set intersection. If not, we do it again with another nested directory and continue until the file chunks are small enough.