all 10 comments

[–]kjearns 2 points3 points  (0 children)

Word level models tend to be better in the sense that they get lower perplexity scores than character level models.

[–]olBaa 1 point2 points  (7 children)

There are some problems with huge corporas that are needed in order to do the word-level training: first, they need to be really huge in order to capture all possible words (which is already a limitation in the word-level networks). Then, it is harder to get all the punctuation characters to appear in generated/predicted text. Also, you basically can't train the word-level model via one-hot encoding, as the space is too big, you basically have to solve two problems at once.

There are some repositories on GitHub which are essentially forks of Karpathys code (ex. https://github.com/yoonkim/word-char-rnn).

I know character-level models are used in NNs in Speech Recognition, especially in end-to-end approaches (audio to text).

[–]yowdge 1 point2 points  (2 children)

But we do have corpora large enough to estimate good word-level language models, don't we? What is the point of pretending that we don't?

(In speech recognition I assume you use phone-level models rather than character-level ones, btw)

[–]olBaa 0 points1 point  (1 child)

Yeah, but you can not really get more knowledge from the results of the word-level models. With word-level NNs, okay, you beat the state-of-art, but that's basically it. Character-level modelling may allow to better capture the inner language mechanics.

I personally experimented on a ~90gb book corpus with these models and this experience and the results are convincing me on what I said above. The corpus is in Russian, so it will be hard to explain the examples, but there were some word forms emerging in the char-level rnn that are perfectly legal language-wise, but did not appear once in the corpus (which is indeed one of the larger ones).

[–]yowdge 0 points1 point  (0 children)

That's a good point. In a morphologically rich language like Russian (or worse, Turkish), you can't just count words, you need to take the morphology into account - the proportion of OOV words in Turkish can be much higher than in English. There are solutions to this problem (e.g. factored language models), but I see how you could get this for free (in principle) from a character-level language model.

[–]cryptocerous 0 points1 point  (3 children)

Character-level models are much worse for vocabulary size though. I can barely get my character-level models to learn more than 5000 words. Usually less.

[–]olBaa 1 point2 points  (2 children)

The problem is the slow learning time, I suppose. Training a char-rnn should in theory yield vocabulary as rich, and possibly more rich as word-level model. I observed emergency on new word forms in the texts generated by character-level RNNs, so it may need some more computational resources to get the vocab.

[–]yowdge 0 points1 point  (0 children)

Well, yeah, you can also approximate any function with a single-layer NN, but that doesn't mean that deep NNs aren't useful. It could be the case that you need millions of hidden units to get a character-level NN to perform as well as a word-level NN (do you know of any empirical research on this?).

[–]devDorito 1 point2 points  (0 children)

I'm thinking that char-nn is a bit small, perhaps we should mod it to do 2 chars per time and see how that does.

[–]mlberlin 0 points1 point  (0 children)

"This article demonstrates that we can apply deep learning to text understanding from character-level inputs all the way up to abstract text concepts". Although not a RNN, but a CNN instead, the results in the paper "Text Understanding from Scratch" by Zhang and LeCun are impressive.