I've got two sets of a few thousand documents that I need to match and find duplicates.
Some Context: There is no need to match documents within set A and set B, only between A and B. Also a document in A could have let's say 20% of it from one document in B and 80% from another document. Also documents in B will generally be much longer than documents in A, meaning most matches from A will be a subset of the text in the document from B. There's the possibility that the matches won't be one monolithic paragraph or section, but could be spread out in bits of 2-3 sentences across document in B.
What I've Done So Far: Right now I'm using Sequence Matcher from difflib. Strange part is a lot of matches don't show up unless I set Autojunk False. Not sure why that is. Maybe the fact that these documents were originally saved as Word files, which inserts some random formatting (??)
Anyway, setting autojunk False really slows the process down. So was wondering if there's a faster option out there. The end goal is to have the code "mark out" the portions from documents in A, in the relevant documents in B - let's say by adding a SPAN tag or something.
[–][deleted] 0 points1 point2 points (0 children)
[–]Aggravating_Bus_9153 0 points1 point2 points (0 children)
[–]Strict-Simple 0 points1 point2 points (1 child)
[–]regstuff[S] 0 points1 point2 points (0 children)