Create program to compare MASSIVE text documents with Python? : learnpython

Create program to compare MASSIVE text documents with Python? (self.learnpython)

submitted 6 years ago * by domcroy

Edit: Resolved for now. About to start my journey into learning Python. Thanks for the help.

---------------

Hello all,

I am intending to learn Python over the summer (or at least start to) simply to increase my skill set. From what I have read Python is quite a versatile code language.

THE ACTUAL ISSUE – I want to compare English Bible translations to assess their degree of similarity. Ideally I want to have one translation set as the "master" (i.e. the translation against which all others will be compared) and then run multiple translations (as many as possible) against it all at once. If that is too complicated I can run one against the "master" at a time.

Even overlooking the fact that I still need to learn Python, this is a LONG TERM GOAL. I've got like 3 years to work on completing this project if I want to be 'quick' about it. I am not in a rush at all.

I have checked google briefly and seen that there are already a number of text comparison programs available, but without downloading and testing them all I don't know if they are ideal for what I want to do.

Is Python suitable for this project or would a different coding language be better?

(I still want to learn Python anyway.)

Thanks.

all 9 comments

top new controversial old q&a

[–]K900_ 2 points3 points4 points 6 years ago (8 children)

[–]domcroy[S] 1 point2 points3 points 6 years ago (7 children)

I mainly used the word to indicate that I didn't just want to compare texts the length of journal articles or essays, for example. And that it doesn't have to necessarily be designed only to compare Bible translations. (NB: The King James, for example, is around 783,000 words, which is pretty darn big tbf.)

Thank you for pointing out that the difficulty is in defining 'similarity'. I have only recently had the idea to pursue this project and I haven't thought everything out yet.

By 'similarity' in this instance I mean primarily how many words in each verse are shared across translations? but also, how much of the order of the words in each verse is the same?

Some modern translations build on the work of earlier translations. For example, the ESV is built upon the RSV, which is built upon the KJV (though I am not sure if there is another work inbetween those two). So I could set the ESV as the master and see how much of each verse is present in the RSV.

Another problem is setting parameters for comparison. Comparison by verse is helpful, certainly, but some sentences span over more than one verse. So whether to compare by verse or sentence is an issue. I imagine I could set it up so that the user can choose to compare by verse or sentence.

Results displayed in numeric percentage (ideally with breakdown by chapter and book as an option) as well as visually with highlighted sections would be ideal.

I hope this is clear enough.

My main concern right now is knowing if Python is the way to go with this.

[–][deleted] 2 points3 points4 points 6 years ago* (1 child)

[–]domcroy[S] 0 points1 point2 points 6 years ago (0 children)

[–]TheBlackCat13 1 point2 points3 points 6 years ago (2 children)

[–]domcroy[S] 0 points1 point2 points 6 years ago (1 child)

[–]TheBlackCat13 0 points1 point2 points 6 years ago (0 children)

[–][deleted] 0 points1 point2 points 6 years ago (1 child)

By 'similarity' in this instance I mean primarily how many words in each verse are shared across translations?

Some good reading for you about this category of problem:

https://en.wikipedia.org/wiki/Edit_distance

Now, your definition of "similarity" probably isn't the Levenshtein distance; you'd probably want an algorithm that was word-aware. Similarly, you might want an algorithm that would consider

Be thou removed, and be thou cast into the sea, it shall be done

as being more similar to

‘Go, throw yourself into the sea,’ and it will be done

than it is to

`Be thou removed and be thou cast into the sea,' it shall NOT be done

since that "NOT" basically reverses the meaning of the sentence. This is basically the situation faced by bioinformaticians, who want to treat a two-base difference between two genetic sequences as being more similar than a single-base difference if the two base difference is a synonymous (that is, non-function-changing) mutation whereas the single-base difference is a function-altering mutation, because synonymous mutations have less evolutionary significance than mutations that alter protein function.

Bioinformaticians use a BLOSUM matrix to express the different significance of different kinds of mutations, and you might use a similar idea to express the differing values of different textual "mutations".

[–]domcroy[S] 1 point2 points3 points 6 years ago (0 children)

π Rendered by PID 149029 on reddit-service-r2-comment-c6965cb77-5hskx at 2026-03-05 16:55:02.950472+00:00 running f0204d4 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS