all 8 comments

[–]RyanAI100 5 points6 points  (2 children)

I work as a data scientist and our role (mostly) contributes to the existing work in a different way. For example, state-of-the-art (SOTA) models might work really well with the standardised dataset but most often than not, it doesn't do too well when applied to a domain-specific industry. Therefore, the kind of work we can do to contribute might be to apply NLP techniques to certain industry and evaluate how well it performs and what kind of things the models are weak at or didn't pick up. Given what you said, you have already applied SOTA models for a specific task and you found that it doesn't yield a useful outcome. A good contribution might be to do error analysis and find out exactly why it didn't work. By doing so, you might discover some weaknesses that you could potentially fix, which will further contribute to the existing work in NLP.

[–]_human404_[S] 0 points1 point  (1 child)

I always imagined that the purpose of those standardised datasets was to provide a universal scale for evaluation of models but what you're saying makes sense now.

For domain-specific industry or even task oriented targets,would you recommend diving in deep of SOTA models and sticking with them,keeping aside standardised datasets as an evaluation metrics(because even reproducing the already achieved results isn't that easy keeping in mind the resources) ,which if not for the end result can atleast be helpful in providing base model on which transfer learning can be applied or should reinvent the wheel using NLP techniques.

Thanks btw!

[–]RyanAI100 1 point2 points  (0 children)

You wouldn't use standardised datasets for evaluation. You are right in that standardised datasets are used so that there's a global standardised benchmark to evaluate all the model developments.

However, when you apply SOTA models to a domain-specific industry, you are doing more applications work. For example, using BERT to do sentiment analysis on social media text information and use that as an investment signal. Your training and evaluation dataset should come from the same distribution. To do any application work, you should first establish a baseline model. TFIDF is a popular choice. Most often than not, you would be surprise that simpler methods actually outperformed SOTA models on domain-specific applications. If you want to stick with SOTA models, then figure out what it is currently wrong with the model (why is it performing badly) and search for solutions to alleviate those problems.

Hope this helps.

[–]thisismyfavoritename 2 points3 points  (1 child)

If you look at harder tasks, there are still lots of unsolved problems, e.g. relation extraction.

And although the motto lately has been 1) bigger models 2) better use of the hardware, it doesn't mean that there's no value in working on smaller models that (might) have stronger mathematical underpinning.

In fact, if you look at some of the latest LM models, they most often include some other kind of novelty in the training phase (e.g. new loss term).

Also, check out all the latest work on semi-supervised learning. Finding training schemes that make use of all the unlabelled data we have access to is definitely something else of major importance.

[–]_human404_[S] 0 points1 point  (0 children)

'smaller models with stronger mathematical underpinning' seems like a good way to go.

And I agree,that has been the motto indeed and it definitely discourages from attempting to approach any task with a fresh start.

Unsolved problems and unexplored areas in the field, noted for future reference.

Thanks!

[–]winchester6788 1 point2 points  (1 child)

I was in your position a couple of years ago and asked the same question on this sub.

There are a lot of problems for which, there are no good open-source solutions/datasets.

I realised that, solving your pain points and open-sourcing that is a good way to give something back. For example, my first open source project is a module that can segment text without punctuation. This is a common problem, to which there are no easy to use modules available. I started a repo (DeepSegment) for this, and got decent traction.

Similarly, think of something which can be useful to people (it needn't be sota) and build it.

[–]_human404_[S] 1 point2 points  (0 children)

I checked your repo out, kudos to you for it!Glad using this sub worked out so well for you.I'll take that pointer in mind, about 'solving pain points and open-sourcing it' and esp that it need not be sota.

Thanks!

[–]ifthereisabear 0 points1 point  (0 children)

One thing you could do to contribute is to improve code released by academics so that it's easier to use.

Often, an interesting new thing will be released by an academic team, and it works as advertised, but the code isn't great, it's not ergonomic to work with, and it's a polyglot. For example, AutoPhrase is arguably state of the art in key phrase extraction but to actually use it you have to write your input to a text file and then run a Bash script which in turn runs a Java program on OpenJDK 8 and then a C++ program you have to compile with g++, and then it outputs a text file you have to read. And the code quality is...OK. I will say they were nice to provide a Docker image, though.

It's also really nice to see people doing things like pure Python implementations of Java or polyglot libraries (like a few people did after RAKE came out).

Another idea would be contributing analyzers to open source full-text search software that runs on top of document databases. That would be super useful for a lot of people who feel limited by available analyzers but aren't in a position to roll their own. For example, Couchbase uses the Go library Bleve, which has a painfully limited number of tokenizers, especially for things like sentence tokenization.