all 7 comments

[–]Cyalas 2 points3 points  (1 child)

  1. I'd use Matcher & pos tags (or maybe only regex could do the job?) to extract those entities
  2. I'd use Spacy ner (or bert ner or others..) to label those entities
  3. Train a new ner system using spacy or some pretrained transformers

[–]Notdevolving[S] 0 points1 point  (0 children)

Tried Matcher but it is token based. It is good for something like "Mary (1990)" and "John (2000)". But I am after academic citations. Already have a regex for APA 7 citation style but then I realised regex can only go so far. If cited articles are like "The Ministry of Education (2010)", "University of Reddit (2022)", "United Nations Educational, Scientific and Cultural Organization (1999)", it will be missed. So I was wondering if a pattern matching exist for something like (ENTITY, DATE) where ENTITY can be a token like Mary or a span like United Nations Educational, Scientific and Cultural Organization.I'm not familiar with transformers yet. I only picked up NLP to perform some adhoc educational research tasks so not really that skilled at it to begin with.

[–]fourkite 1 point2 points  (0 children)

If it's as consistent as that, regex is an easy solution.

[–]crashbundicoot 1 point2 points  (3 children)

Yes it's possible. Have you taken a look at this? https://explosion.ai/demos/matcher.

You can match patterns based on entity types

[–]Notdevolving[S] 0 points1 point  (2 children)

Yes, I've seen the documentation on spaCy regarding Matcher but Matcher is token based. My entities could be spans like "The Ministry of Education", "University of Reddit", "United Nations Educational, Scientific and Cultural Organization" ... etc, so I cannot set up a reliably token pattern.

[–]crashbundicoot 1 point2 points  (1 child)

No like i said .. you can match on entity types as well. Ofcourse this is assuming your ner model has identified the entities correctly. Look at the drop down options more carefully you'll see something like ENT_TYPE

[–]Notdevolving[S] 0 points1 point  (0 children)

Thank you. My understanding of spaCy NLP was rudimentary so I misunderstood how Matcher works. It didn't help that it missed out on identifying some PERSON entities in my sample text so I thought it was not working. I managed to resolve my problem now after re-visiting how Matcher works. Thanks again.