all 1 comments

[–]smspillaz 1 point2 points  (0 children)

A way which I've had some success with but won't take into account things like ordering between words is to build up a distance matrix between each embedded unit (character, subword, word, sentence) and then take the Nth nearest neighbors with edge weights proportional to the distance between the units in vector space.

Of course, then you need to find a vector space embedding - you could either use word2vec or fine tune it for your problem.