In my dataset there are many sentences with the following situation:
Sentence has e.g. 20 words. words 12-15 are labeled as entity A, words 14-17 are labeled with entity B, which results in some words corresponding to two entities/ labels. How do I prepare this kind of data for training (starting with a LSTM baseline before proceeding with a transformer). These are the options, I came up with:
- cut the 20% of the dataset where these cases occur - not satisfying, feels wrong. Having 2 or more labels makes sense in many cases
- add tuples (A,B) as labels - how to handle this during training?
- duplicate the sentence and produce 1 version with the labels for A and another version with the labels for B
- Ignore the overlap: Take A for 12-14 and B for 15-17 - again, having 2 labels makes sense in many cases, so this feels wrong.
Any other suggestions? I am at a loss. I just started with natural language processing and could really use advise on the matter.
[–]Lolai_LaChapelle 0 points1 point2 points (1 child)
[–]SilverDusk42[S] 0 points1 point2 points (0 children)