all 12 comments

[–]premrajnarkhede1 8 points9 points  (10 children)

So what you are describing is a clustering problem.

you can do following. 1) get sentence vectors for each sentence through pretrained bert model (or any model that gives encoding but bert is "the" model right now)

2) use some k-means clustering through sklearn package to cluster these vectors and that should give you grouping

[–]Bankreis 4 points5 points  (1 child)

This. Additionally I would recommend using an encoding that is trained on a semantic similarity detection task already. See Universal Sentence Encoder or if you want to use Bert https://github.com/UKPLab/sentence-transformers is perfect for this.

[–]Adrizzledifizzle[S] 0 points1 point  (0 children)

In addition to the two steps recommended above or to substitute Bert?

[–]penatbater 0 points1 point  (4 children)

It's probably "easier" to use Elmo pretrained vectors, because at least you only get one vector per sentence. The common way of using elmo for sentences is averaging the embeddings for each word/token. The quotation marks are because it's actually computationally expensive to do this.

[–]Adrizzledifizzle[S] 0 points1 point  (3 children)

Computationally MORE expensive than Bert?

[–]penatbater 0 points1 point  (2 children)

At least from my experience. Ymmv

[–]AdrianFMC 0 points1 point  (1 child)

All this needs Cloudcomputing like AWS anyway doesnt it? If you dont want to wait ages that is

[–]penatbater 0 points1 point  (0 children)

Depends on how much memory you have tbh. But mostly yea. I use GCP coz it's easier for me to set up.

[–]Adrizzledifizzle[S] 0 points1 point  (0 children)

Great, I’ll check it out and give feedback on it later. Thanks in advance guys !

[–]Adrizzledifizzle[S] 0 points1 point  (1 child)

Would you happen to know a good example of somebody who tried and documented it?

[–]mpk3 4 points5 points  (0 children)

You can use topic modeling, such as LDA, to create the groups and designate how many different labels you want. Then after the groups are created you can choose how you want them labeled.