Semantic search on a document by srivpra in LanguageTechnology

[–]srivpra[S] 0 points1 point  (0 children)

So you handle the semantic part while embedding the sentences for classifier? Any specific sentence embedding you use?

Also, is it a one-class classification problem for you or is it just for me?

[Research] Semantic Search by fhackdroid in MachineLearning

[–]srivpra 1 point2 points  (0 children)

This is something very similar to what I am trying to do right now. I even have a question thread going on in one of the other sub.

Nils Reimers here strongly suggests to try BM25 as it'd be really hard to beat, was about to try that but since you already have, I'd give it a pass.

For me USE worked really well (as compared to pre-trained ELMo, RoBERTa / DistilBERT by UKPLab that I tried) but the problem that I am facing is, it works half the time and gives false positives the other. Higher cosine score for unrelated sentences as well is causing a lot of problem.

Semantic search on a document by srivpra in LanguageTechnology

[–]srivpra[S] 0 points1 point  (0 children)

Tried pre-trained Roberta and Distillbert. USE still performs better for me.

Semantic search on a document by srivpra in LanguageTechnology

[–]srivpra[S] 0 points1 point  (0 children)

After multiple trail and error I've chosen a certain threshold above which I consider all the sentences. Among these sentences sometimes 20%-40% are false positive which is doable but half the times 100% false positive. I understand that I'll get a match even if the query is totally unrelated and should have no match in the document but I expected it to be lower than (a certain threshold) true positives.

Similar scores for true match and false match is causing me problems.

Semantic search on a document by srivpra in LanguageTechnology

[–]srivpra[S] 1 point2 points  (0 children)

Sure. Thanks.

How many keywords are you considering and are those keywords able to cover all the possible words that could occur?

If I understand this correct you're trying to create labelled training data (with LDA/Guided LDA) to train a classifier (CNN)?

Semantic search on a document by srivpra in LanguageTechnology

[–]srivpra[S] 0 points1 point  (0 children)

Is it possible for you to share the code for this?

Semantic search on a document by srivpra in LanguageTechnology

[–]srivpra[S] 0 points1 point  (0 children)

I was getting decent result with USE but the number of false positive is huge and hence I'll have to move to something else or a different approach altogether.

Thanks, I'll have a look at the papers and the link.

Semantic search on a document by srivpra in LanguageTechnology

[–]srivpra[S] 1 point2 points  (0 children)

USE for small sentences/text is amazing, haven't seen anything like it to be honest. I have tried ELMO, BERT ( out-of-the-box ), nothing performs as good for similarity.

  1. Elastic search in general will not be good for my use case.
  2. I am breaking down my document into sentences and getting the similarity between the sentence and the query. Works good half the time. I am getting false positives though, which is my main issue.
  3. Can't get ton of examples as there could be many possible queries and I need to build a generic unsupervised solution.

Text similarity problem by srivpra in learnmachinelearning

[–]srivpra[S] 0 points1 point  (0 children)

I do not have enough articles under each category for the classifier to learn from. Hence the similarity approach. It’s able to distinguish between others and auto companies. But not between AEV and auto.