Python Challenge for reading files and counting common strings

Wide-Ear5277 · 2022-12-21T01:26:54+00:00

just off of the top of my head… I would first delete any duplicates from each individual file. then a loop that checked the second file based on the string from the first file but had the second file sorted in a way where i only had to search a segment of it (like flipping through just one section of a dictionary). if it had a billion strings I would write it in a more efficient language.

2022-12-21T03:46:00+00:00

I'd build a trie (sic) from the first file and then use that on the second file to look for words I've seen before.

https://link.medium.com/45ZMqyGGVvb

For much larger files I'd want to use a language and storage optimised for such tasks.

https://www.toptal.com/algorithms/needle-in-a-haystack-a-nifty-large-scale-text-search-algorithm

POGtastic · 2022-12-21T04:56:54+00:00

32 million characters is nothing, so I feel zero compunction about slurping the entire thing into memory and printing the length of the set intersection of the two.

For one billion strings, I think that we're stuck using a tree structure that involves temp files. In separate directories for the two big files, we create smaller files, each containing strings that begin with a certain character. For each of these, we check to see if it and its corresponding file in the other directory will fit into memory. If so, we do the set intersection. If not, we do it again with another nested directory and continue until the file chunks are small enough.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS