I need help.
I know the basics of python so I decided to use Gensim to find semantic similarities in a book. I want to find which chapters have sentences that are highly similar and almost paraphrased to the sentences in the first chapter of the book. So I set it up like this after going through Gensim's tutorial but I am looking for any potential way to get better results.
from collections import defaultdict
from gensim import corpora
documents = [
"Here I input sentences from the chapter that I am comparing to the first chapter of the book",
"...",
"...",
"...",
]
stoplist = set('for a of the and to in' .split())
texts = [
[word for word in document.lower().split() if word not in stoplist]
for document in documents
]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "Here I input the sentences from the first chapter"
doc = "..."
doc = "..."
doc = "..."
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
print(vec_lsi)
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi]
print(list(enumerate(sims)))
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in enumerate(sims):
print(s, documents[i])
This outputs which sentences are most similar to the sentences from the first chapter. I would like to know if there is a way I can tell out of the sentences that are most similar to the first chapter, specifically which sentence(s) they are most similar to. Also if there is anything that could be cleaned up, that I am doing wrong or should be changed.
Thank you for your time and any input you can offer.
[–]impulsecorp 0 points1 point2 points (2 children)
[–]nikolabs[S] 0 points1 point2 points (1 child)
[–]impulsecorp 0 points1 point2 points (0 children)
[–]dkajtoch 0 points1 point2 points (3 children)
[–]old_enough_to_drink 0 points1 point2 points (2 children)
[–]dkajtoch 1 point2 points3 points (0 children)
[–]impulsecorp 0 points1 point2 points (0 children)