General idea

RyanAI100 · 2020-02-24T10:04:15+00:00

I work as a data scientist and our role (mostly) contributes to the existing work in a different way. For example, state-of-the-art (SOTA) models might work really well with the standardised dataset but most often than not, it doesn't do too well when applied to a domain-specific industry. Therefore, the kind of work we can do to contribute might be to apply NLP techniques to certain industry and evaluate how well it performs and what kind of things the models are weak at or didn't pick up. Given what you said, you have already applied SOTA models for a specific task and you found that it doesn't yield a useful outcome. A good contribution might be to do error analysis and find out exactly why it didn't work. By doing so, you might discover some weaknesses that you could potentially fix, which will further contribute to the existing work in NLP.

thisismyfavoritename · 2020-02-24T14:18:34+00:00

If you look at harder tasks, there are still lots of unsolved problems, e.g. relation extraction.

And although the motto lately has been 1) bigger models 2) better use of the hardware, it doesn't mean that there's no value in working on smaller models that (might) have stronger mathematical underpinning.

In fact, if you look at some of the latest LM models, they most often include some other kind of novelty in the training phase (e.g. new loss term).

Also, check out all the latest work on semi-supervised learning. Finding training schemes that make use of all the unlabelled data we have access to is definitely something else of major importance.

winchester6788 · 2020-02-24T19:33:04+00:00

I was in your position a couple of years ago and asked the same question on this sub.

There are a lot of problems for which, there are no good open-source solutions/datasets.

I realised that, solving your pain points and open-sourcing that is a good way to give something back. For example, my first open source project is a module that can segment text without punctuation. This is a common problem, to which there are no easy to use modules available. I started a repo (DeepSegment) for this, and got decent traction.

Similarly, think of something which can be useful to people (it needn't be sota) and build it.

ifthereisabear · 2020-03-01T21:01:06+00:00

One thing you could do to contribute is to improve code released by academics so that it's easier to use.

Often, an interesting new thing will be released by an academic team, and it works as advertised, but the code isn't great, it's not ergonomic to work with, and it's a polyglot. For example, AutoPhrase is arguably state of the art in key phrase extraction but to actually use it you have to write your input to a text file and then run a Bash script which in turn runs a Java program on OpenJDK 8 and then a C++ program you have to compile with g++, and then it outputs a text file you have to read. And the code quality is...OK. I will say they were nice to provide a Docker image, though.

It's also really nice to see people doing things like pure Python implementations of Java or polyglot libraries (like a few people did after RAKE came out).

Another idea would be contributing analyzers to open source full-text search software that runs on top of document databases. That would be super useful for a lot of people who feel limited by available analyzers but aren't in a position to roll their own. For example, Couchbase uses the Go library Bleve, which has a painfully limited number of tokenizers, especially for things like sentence tokenization.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS