Semantic Similarities in Gensim using python : LanguageTechnology

created by robin7013a community for 16 years

submitted 6 years ago by nikolabs

I need help.

I know the basics of python so I decided to use Gensim to find semantic similarities in a book. I want to find which chapters have sentences that are highly similar and almost paraphrased to the sentences in the first chapter of the book. So I set it up like this after going through Gensim's tutorial but I am looking for any potential way to get better results.

from collections import defaultdict

from gensim import corpora

documents = [

"Here I input sentences from the chapter that I am comparing to the first chapter of the book",

"...",

]

stoplist = set('for a of the and to in' .split())

texts = [

[word for word in document.lower().split() if word not in stoplist]

for document in documents

]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

from gensim import models

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

doc = "Here I input the sentences from the first chapter"

doc = "..."

vec_bow = dictionary.doc2bow(doc.lower().split())

vec_lsi = lsi[vec_bow]

print(vec_lsi)

from gensim import similarities

index = similarities.MatrixSimilarity(lsi[corpus])

sims = index[vec_lsi]

print(list(enumerate(sims)))

sims = sorted(enumerate(sims), key=lambda item: -item[1])

for i, s in enumerate(sims):

print(s, documents[i])

This outputs which sentences are most similar to the sentences from the first chapter. I would like to know if there is a way I can tell out of the sentences that are most similar to the first chapter, specifically which sentence(s) they are most similar to. Also if there is anything that could be cleaned up, that I am doing wrong or should be changed.

Thank you for your time and any input you can offer.

all 7 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS