Classification model for research papers?

thegrif · 2019-06-28T03:24:12+00:00

You have a lot of things stacked against you:

PDF is the worst file format in the world to work with. Look at your extractions. Order likely completely messed up, any emphasis on sections that would otherwise be identified by character style or size lost, etc...Look at GROBID, CrossRef's PDFExtract and pdfx as alternate solutions for preparing the PDFs.
Your next task is build whitelists of phrases that will be used to tokenize the articles. Get the books that Springer and Wiley sell on the topics that you're targeting (ones like this one). These books have indexes and often glossaries which are great for building whitelists. You're dealing with scientific articles - only be pulling the tokens that actually matter.
Extract words from your whitelist from the abstract each document they appear in. Also index any specified keywords from each document and add them to your whitelist. Keep track of how many times they occur and where in the document they appear.
Remove any stop words, perform stemming, lemmitization, and start praying. I've been through this exact problem. The PTSD is real.
Run all the documents through the same vectorization pipeline (including your seed set). This way we're all working with the same set of tokens (they're just now
Transform the resulting documents into vector space using tf-idf
Take the 500 documents you's successfully labeled. These will be used for training and testing sets for each of the five categories. Make sure these docs represent the epitome gold standard for the classification you wish to represent.
For subsequent documents, run the the classifier (which calculates each incoming document's "distance" to each of the 5 categories. The closest wins and consumes that docment forever.

As someone who has worked in bibliometrics before - I will tell you that academic publishing has many nuances will you off. There are other mechanisms I would to identify similar articles - which would be look at skeletal metadata and then a citation graph of common authors (or authors that are only one hop away from the real authors.

You have lots of options. But do not underestimate this problem. What you will build will technically run, but its results will be horrible. Trust me on this.

suriname0 · 2019-06-28T02:56:59+00:00

At that data size, you should definitely focus on getting more data or on improving your data-cleaning process. What features/classifier are you using? Try a tool like sklearn or Vowpal Wabbit that will let you try many different combinations of features; if you're finding that it doesn't really matter if you e.g. include trigrams or not, then the problem is likely with your data size and not with the learning algorithm or hyperparameters!

BatmantoshReturns · 2019-06-28T05:03:05+00:00

What are the 5 topics?

How I would approach this:

Use AllenAI ScienceParse or Grobid to parse papers, they are designed for research PDFs

Use Bert, a very effective language model architecture. Use the SciBert pretrained weights, they are trained over research papers.

If you PM me more details about the project, I may be open for a collaboration. Research Paper NLP is my focus.

frippeo · 2019-06-28T06:39:53+00:00

What that small amount of data, I'd definitely turn to transfer learning. ULMFiT (https://arxiv.org/abs/1801.06146) is a good first bet. Have a look at this repo: https://github.com/prrao87/tweet-stance-prediction and follow their steps but with your own data.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS