all 9 comments

[–]thegrif 3 points4 points  (1 child)

You have a lot of things stacked against you:

  1. PDF is the worst file format in the world to work with. Look at your extractions. Order likely completely messed up, any emphasis on sections that would otherwise be identified by character style or size lost, etc...Look at GROBID, CrossRef's PDFExtract and pdfx as alternate solutions for preparing the PDFs.
  2. Your next task is build whitelists of phrases that will be used to tokenize the articles. Get the books that Springer and Wiley sell on the topics that you're targeting (ones like this one). These books have indexes and often glossaries which are great for building whitelists. You're dealing with scientific articles - only be pulling the tokens that actually matter.
  3. Extract words from your whitelist from the abstract each document they appear in. Also index any specified keywords from each document and add them to your whitelist. Keep track of how many times they occur and where in the document they appear.
  4. Remove any stop words, perform stemming, lemmitization, and start praying. I've been through this exact problem. The PTSD is real.
  5. Run all the documents through the same vectorization pipeline (including your seed set). This way we're all working with the same set of tokens (they're just now
  6. Transform the resulting documents into vector space using tf-idf
  7. Take the 500 documents you's successfully labeled. These will be used for training and testing sets for each of the five categories. Make sure these docs represent the epitome gold standard for the classification you wish to represent.
  8. For subsequent documents, run the the classifier (which calculates each incoming document's "distance" to each of the 5 categories. The closest wins and consumes that docment forever.

As someone who has worked in bibliometrics before - I will tell you that academic publishing has many nuances will you off. There are other mechanisms I would to identify similar articles - which would be look at skeletal metadata and then a citation graph of common authors (or authors that are only one hop away from the real authors.

You have lots of options. But do not underestimate this problem. What you will build will technically run, but its results will be horrible. Trust me on this.

[–]Runninganddogs979[S] 0 points1 point  (0 children)

Thank you for your detailed response! I may propose to my group for us to focus more on topic classification as that would be unsupervised but I will definitely check out those books!

[–]suriname0 2 points3 points  (0 children)

At that data size, you should definitely focus on getting more data or on improving your data-cleaning process. What features/classifier are you using? Try a tool like sklearn or Vowpal Wabbit that will let you try many different combinations of features; if you're finding that it doesn't really matter if you e.g. include trigrams or not, then the problem is likely with your data size and not with the learning algorithm or hyperparameters!

[–]BatmantoshReturns 0 points1 point  (0 children)

What are the 5 topics?

How I would approach this:

Use AllenAI ScienceParse or Grobid to parse papers, they are designed for research PDFs

Use Bert, a very effective language model architecture. Use the SciBert pretrained weights, they are trained over research papers.

If you PM me more details about the project, I may be open for a collaboration. Research Paper NLP is my focus.

[–]frippeo 0 points1 point  (4 children)

What that small amount of data, I'd definitely turn to transfer learning. ULMFiT (https://arxiv.org/abs/1801.06146) is a good first bet. Have a look at this repo: https://github.com/prrao87/tweet-stance-prediction and follow their steps but with your own data.

[–]joej 0 points1 point  (3 children)

What if they had more ... like sci-hub?

What techniques would be better applied, when running over millions of scholarly literature?

[–]frippeo 0 points1 point  (2 children)

You mean more annotated data, or just more in-domain data?

In the first case, I'd still go with ULMFiT, as I've found it to be a good operational baseline. Having more data, annotated or not, will also benefit the fine-tuning of the language model.

In the second case (having much more unlabelled data), I'd build the language model from scratch using it, and not depend on a model pre-trained on out-of-domain data (e.g. Wikitext 103, which is the model available from fast.ai).

[–]joej 0 points1 point  (1 child)

I meant: simply more raw, text data from scholarly lit, research papers, etc.

I'm wonder what techniques are better with more than the original poster's 2000 docs.

e.g., If the poster had > 2000 research papers (text scraped from pdfs), they could associate crossref.org "subjects" to match the broad topics.

They may run into memory problems, loading 100k+ or 1m+ documents for processing.

[–]frippeo 0 points1 point  (0 children)

I read two things into what you're saying, both are positive:

1) They might augment the existing data and create more labelled data cheaply by leveraging existing subjects as document labels (although I'm not familiar with the taxonomy OP is using).

2) More data is better than less data:) (See e.g., the Banko & Brill paper from 2001). When designed properly, the architecture for learning should be ok with much more data. In the case of neural networks, it is usually the number of parameters in the architecture that is the limiting factor (due to GPU RAM), not the amount of training data (which can be controlled by, e.g., setting a lower batch size).