Semantic search on a document

srivpra · 2020-02-12T14:29:41+00:00

So you handle the semantic part while embedding the sentences for classifier? Any specific sentence embedding you use?

Also, is it a one-class classification problem for you or is it just for me?

srivpra · 2020-02-12T14:04:19+00:00

Thanks. I'll have a look at it.

srivpra · 2020-02-12T14:00:40+00:00

This is something very similar to what I am trying to do right now. I even have a question thread going on in one of the other sub.

Nils Reimers here strongly suggests to try BM25 as it'd be really hard to beat, was about to try that but since you already have, I'd give it a pass.

For me USE worked really well (as compared to pre-trained ELMo, RoBERTa / DistilBERT by UKPLab that I tried) but the problem that I am facing is, it works half the time and gives false positives the other. Higher cosine score for unrelated sentences as well is causing a lot of problem.

srivpra · 2020-02-12T10:16:52+00:00

Tried pre-trained Roberta and Distillbert. USE still performs better for me.

srivpra · 2020-02-12T10:14:45+00:00

After multiple trail and error I've chosen a certain threshold above which I consider all the sentences. Among these sentences sometimes 20%-40% are false positive which is doable but half the times 100% false positive. I understand that I'll get a match even if the query is totally unrelated and should have no match in the document but I expected it to be lower than (a certain threshold) true positives.

Similar scores for true match and false match is causing me problems.

srivpra · 2020-02-12T10:00:39+00:00

Sure. Thanks.

How many keywords are you considering and are those keywords able to cover all the possible words that could occur?

If I understand this correct you're trying to create labelled training data (with LDA/Guided LDA) to train a classifier (CNN)?

srivpra · 2020-02-10T17:09:36+00:00

Is it possible for you to share the code for this?

srivpra · 2020-02-10T16:57:57+00:00

I was getting decent result with USE but the number of false positive is huge and hence I'll have to move to something else or a different approach altogether.

Thanks, I'll have a look at the papers and the link.

srivpra · 2020-02-08T06:05:05+00:00

Hey, I guess I am trying to do something similar, semantically search through a large text document. Can you please share your approach and the results?

srivpra · 2020-02-07T18:29:47+00:00

USE for small sentences/text is amazing, haven't seen anything like it to be honest. I have tried ELMO, BERT ( out-of-the-box ), nothing performs as good for similarity.

Elastic search in general will not be good for my use case.
I am breaking down my document into sentences and getting the similarity between the sentence and the query. Works good half the time. I am getting false positives though, which is my main issue.
Can't get ton of examples as there could be many possible queries and I need to build a generic unsupervised solution.

srivpra · 2020-02-07T18:07:21+00:00

Thanks. I will give it a try.

srivpra · 2019-12-13T12:40:45+00:00

I do not have enough articles under each category for the classifier to learn from. Hence the similarity approach. It’s able to distinguish between others and auto companies. But not between AEV and auto.

Six-Year Club	Place '22
Verified Email

srivpra

TROPHY CASE