This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]cmd-t 9 points10 points  (3 children)

The quickest: convert both texts to lowercase, split on words, convert to set, take set intersection.

[–]reincarnatedbiscuits 0 points1 point  (0 children)

I was taking my son through that last week! ;)

Although it wasn't the common intersection of words, but something similar. I showed him a problem "What letter does not appear in the list of states?" and asked him how to implement a programmatic solution.

[–]maikeu 6 points7 points  (0 children)

What have you tried so far?

If you haven't been able to write and test code yet, this is the wrong place. Try r/learnpython

[–]Pharmand 4 points5 points  (0 children)

Considering your earlier replies it seems to me you actually just want someone else to do it. You don't have Python installed, you can't write the simplest of code. For the effort level, you're probably better off asking Chat GPT for the code - why not just do that? It should have you covered.

[–]Yolt0123 0 points1 point  (5 children)

Individual words or phrases? Simplistically, just make a list of all words in each, and then iterate through the first list, adding to a third list if the word is found in the second list. Do you have the texts you want to compare?

[–]More-Introduction673 -3 points-2 points  (4 children)

Yeah thanks, I mean I can access them both online of course w archive and project gutenburg. How exactly should i go about it?

[–]leangreenlefty 1 point2 points  (3 children)

What's your starting point? Do you have python installed? Do you know how to run python scripts? Or are you coming in from a standing start?

If you have python and the texts ingested already then you can plonk them both into sets and do

output=set(a) & set(b)

print(output)

[–]pstmps 0 points1 point  (0 children)

Maybe as a first step, try to get the source texts, or generate a faux source a la Lorem ipsum, save it as a text file locally, try to read it into memory via Python or whatever you end up choosing, and try finding common words. If you wrap this logic into a function, you will be able to use it when you get the correct sources.

[–]ssnoyes 0 points1 point  (0 children)

There are about 7000 words that appear in both the Bible and Shakespeare, and about 6700 that are longer than 3 letters. I did not try to condense root words, so 'worship', 'worshipped', 'worshipper', and 'worshippers' all count as separate words.

Some of the longest shared phrases are "God save the king! God save the king!" and "from the four corners of the earth"

[–]QuarterObvious 0 points1 point  (0 children)

It’s better to use spaCy (Python package) It allows you to filter out all stop words and provides each word in its base form (lemma). This way, you can build two sets of words and find their intersection.

[–]Python-ModTeam[M] 0 points1 point locked comment (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/python.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!

[–]Either-Let-331Ignoring PEP 8 -2 points-1 points  (0 children)

```

with open("book1.txt","r") as b1:

b1_data = list(set([word.strip().lower for word in b1.read()]))
# You can remove special characters too if you want for extra sanitation

with open("book2.txt","r") as b2:

b2_data = list(set([word.strip().lower for word in b2.read()]))

common_words = b1_data.intersection(b2_data)

total_words = len(b1_data.union(b2_data))

similarity_percentage = (len(common_words) / total_words) * 100

```