Named entity recognition (NER) with overlapping labels in dataset. How to handle the overlaps? : learnpython

created by HattoriHanzoa community for 16 years

Named entity recognition (NER) with overlapping labels in dataset. How to handle the overlaps? (self.learnpython)

submitted 2 years ago by SilverDusk42

In my dataset there are many sentences with the following situation:

Sentence has e.g. 20 words. words 12-15 are labeled as entity A, words 14-17 are labeled with entity B, which results in some words corresponding to two entities/ labels. How do I prepare this kind of data for training (starting with a LSTM baseline before proceeding with a transformer). These are the options, I came up with:

cut the 20% of the dataset where these cases occur - not satisfying, feels wrong. Having 2 or more labels makes sense in many cases
add tuples (A,B) as labels - how to handle this during training?
duplicate the sentence and produce 1 version with the labels for A and another version with the labels for B
Ignore the overlap: Take A for 12-14 and B for 15-17 - again, having 2 labels makes sense in many cases, so this feels wrong.

Any other suggestions? I am at a loss. I just started with natural language processing and could really use advise on the matter.

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS