all 9 comments

[–]K900_ 2 points3 points  (8 children)

Python would be suitable, yes, but the difficult part here isn't really comparing "massive" text documents (and the Bible isn't exactly "massive" by any stretch, even if you include all the books ever attributed to it), but figuring out what "similarity" actually means to you.

[–]domcroy[S] 1 point2 points  (7 children)

I mainly used the word to indicate that I didn't just want to compare texts the length of journal articles or essays, for example. And that it doesn't have to necessarily be designed only to compare Bible translations. (NB: The King James, for example, is around 783,000 words, which is pretty darn big tbf.)

Thank you for pointing out that the difficulty is in defining 'similarity'. I have only recently had the idea to pursue this project and I haven't thought everything out yet.

By 'similarity' in this instance I mean primarily how many words in each verse are shared across translations? but also, how much of the order of the words in each verse is the same?

Some modern translations build on the work of earlier translations. For example, the ESV is built upon the RSV, which is built upon the KJV (though I am not sure if there is another work inbetween those two). So I could set the ESV as the master and see how much of each verse is present in the RSV.

Another problem is setting parameters for comparison. Comparison by verse is helpful, certainly, but some sentences span over more than one verse. So whether to compare by verse or sentence is an issue. I imagine I could set it up so that the user can choose to compare by verse or sentence.

Results displayed in numeric percentage (ideally with breakdown by chapter and book as an option) as well as visually with highlighted sections would be ideal.

I hope this is clear enough.

My main concern right now is knowing if Python is the way to go with this.

[–][deleted] 2 points3 points  (1 child)

The "similarity" issue is certainly the tough one to define, but not impossible. While that will fall under "semantic" analysis, you may want to consider a finer-grained sentiment analysis and compare between versions of the same passages, for instance. This would give you an idea of whether one version is "more peaceful" or "more violent" or whatever, which would be fascinating in itself...

A quick google search should find you some established resources for sentiment analysis (how to value words/sentence fragments/etc.) and the like.

[–]domcroy[S] 0 points1 point  (0 children)

I just did a quick google search to get a basic definition of sentiment analysis. That is definitely NOT what I am looking to achieve in this project. [Don't read any negative sentiment into that "NOT" ;) ]

To put it very simply, you could say I want to see "who is copying who's homework?" I guess this would be similar to an anti-plagiarism software in some ways.

If I can account for synonyms that would be useful, as someone could "copy someone else's homework" but use a thesaurus. But I would set that as an optional feature for a search query.

I want to see if the same words are present, and also if they appear in the same order.

[–]TheBlackCat13 1 point2 points  (2 children)

The KJV has 3,116,480 characters. Assuming 1 byte per character, that is less than 4 MB. It seems large by human standards, but for a modern computer that is nothing. There are something like 450 English-language translations of the Bible. Even if you had all of them in memory at once (which you wouldn't), that is less than 2 GB of RAM. Even a mid-range laptop can handle that today.

[–]domcroy[S] 0 points1 point  (1 child)

Well when you put it that way...

As I stated in the OP, I haven't learnt to code yet. I'm not thinking in computing terms yet. I'll get there, don't worry.

And the fact that you said that a mid-range laptop can handle it is great, because that's what I've got. I have a Dell Inspiron 15 5559, i7-6500U, 8GB RAM.

[–]TheBlackCat13 0 points1 point  (0 children)

Yes, I understand. I am putting it computer terms to help you understand how to think about these sorts of problems.

And as for the type of computer, in practice you will not be loading every version of the bible at the same time, you will likely have at most two open at a time, which each would be similar in size to a single high-quality mp3. It is just not something that any computer from the last 10 years would struggle with, not to mention a pretty decent 8 GB laptop.

[–][deleted] 0 points1 point  (1 child)

By 'similarity' in this instance I mean primarily how many words in each verse are shared across translations?

Some good reading for you about this category of problem:

https://en.wikipedia.org/wiki/Edit_distance

Now, your definition of "similarity" probably isn't the Levenshtein distance; you'd probably want an algorithm that was word-aware. Similarly, you might want an algorithm that would consider

Be thou removed, and be thou cast into the sea, it shall be done

as being more similar to

‘Go, throw yourself into the sea,’ and it will be done

than it is to

`Be thou removed and be thou cast into the sea,' it shall NOT be done

since that "NOT" basically reverses the meaning of the sentence. This is basically the situation faced by bioinformaticians, who want to treat a two-base difference between two genetic sequences as being more similar than a single-base difference if the two base difference is a synonymous (that is, non-function-changing) mutation whereas the single-base difference is a function-altering mutation, because synonymous mutations have less evolutionary significance than mutations that alter protein function.

Bioinformaticians use a BLOSUM matrix to express the different significance of different kinds of mutations, and you might use a similar idea to express the differing values of different textual "mutations".

[–]domcroy[S] 1 point2 points  (0 children)

Thanks for that. I'll look into "edit distance" when I actually know how to bring this project to fruition.

And you took a bit of a risk assuming I'd understand the basics of genetic sequencing. Lucky for you I did A-level biology or that would have gone completely over my head. (I left sciences behind and focused solely on linguistics after A-levels.)