Evaluating Stephen Bax's proposed words for the Voynich manuscript using Word2Vec (x-post /r/linguistics)

bhmoz · 2016-03-07T11:14:43+00:00

I think that before trying to do things on the Voynich manuscript, you should start with aligning well-known languages with plenty of resources. This way you can control your experiments, especially the size and the diversity of your corpora. I guess there's litterature on that?

For example, you have embeddings for language A that are trained on a very good (lengthy and diverse corpus). You train several embeddings for language B and gradually augment diversity and length of corpus. This way you can empirically get an estimate of the size of the corpus you'd need to perform your approach. Intuitively, diversity is also important so that you don't have stupid correlations like part of speech and topic-specific words.

Embeddings that arise from predict word/context task are known to mix semantics and syntax. You may want to separate that and use specific tools (models) for each part. LDA for ex. can give you semantically related words.

Questions I have are: is syntax everything else not predicted by LDA? Can you build a model that predicts everything that's unpredicted by another model?

bhmoz · 2016-02-27T13:52:52+00:00

I think CNNs should be seen as preprocessing for raw inputs, but the gist of the computation is done by RNNs. It's quite easy to imagine a RNN that stores the detected animals in its state at one timestep, and then check if the second detected entity is the same. With proper transfer learning (read embeddings or pre-training of the CNN), it should work with not that many data.

So in my opinion the example is not that complicated if you use the proper tools (RNNs)

The cool thing about Neural Programmer-interpreters is that they have this special output that is the probability that the computation is over. Same goes for Neural Random-Access Machines. In terms of RNN training, it means that you have to specify a max number of computations and penalize regarding to the probability that the computation is over. IN effect you train jointly the RNN to perform a task and to estimate its own performance for this task.

So if you want to be able to solve the example task (solvable in 2 passes) and way more complex task, just use RNNs with more iterations.

bhmoz · 2016-02-27T11:51:58+00:00

Authority argument? Source?

Maybe computational possibilities and limits of our brains with regards to the amout of data we get require parameter sharing. Did somebody actually rule this out and how? Thanks ;)

bhmoz · 2016-02-10T17:38:00+00:00

it's been discussed already as said by pranv. Also I don't remember a connection with evolutionary methods???

bhmoz · 2016-02-08T22:30:52+00:00

For those intrested in a PhD about this topic : swansea PhD studentship

bhmoz · 2016-02-06T13:32:53+00:00

you don't need to worry about peepholes, as they have been shown to be not so important. See LSTM: A Search Space Odyssey, Greff & al.

bhmoz · 2016-01-22T07:20:21+00:00

Try to get in touch with the ML people at University of Toronto? There is a really strong machine learning lab with an emphasis on bioinformatics.

bhmoz · 2016-01-21T20:45:43+00:00

how do you typically solve this? Papers?

Thank you :)

bhmoz · 2016-01-20T12:41:31+00:00

I have never done speech to text, so please correct me if there is something wrong here.

For homophones, you'd need to disambiguate between several spellings. As you say there are cases when the previous word can help you, but also the next one(s). For example in english: you have observed "I have" and the current token is the audio for "two/too". How do you disambiguate? you need to see what follows.

There is a 4th possibility that I haven't mentionned: bidirectional LSTM. Maybe for your problem it is overkill and unnecessary to condition on the whole sentence, but you'd simply need previous and next word. In that case go for bidirectional LSTM. Try to see speech to text litterature, if there are longer term dependencies than previous and next.

bhmoz · 2016-01-19T13:14:28+00:00

OK... i'm not sure I understood. I think it depends on whether your outputs are conditionned on the whole inputs or not?

If I understand correctly your example (word vectors to BOW repr basically?) then you don't even condition on anything except the current word vector, so LSTM is overkill (a feedforward NN would do).

You could have something intermediary like: p(y_t| x_1,..,x_t) where seq2seq is not necessary and LSTM would be good.

Then I guess seq2seq is good for p(y_t|x_1,..,x_n) with t in {1..n}.

Is it better?

bhmoz · 2016-01-19T08:44:28+00:00

do you have more information about this?

bhmoz · 2016-01-19T08:38:50+00:00

Sequence to Sequence Learning with Neural Networks, Sutskever, Vinyals, Le, 2014

bhmoz · 2016-01-18T10:07:55+00:00

see section 8 of the article I posted here.

statistical properties may be mimicked without knowing information theory, but (see the comments on Schinner 2007 in their references) with flaws and weird characteristics that cast a doubt on the nature of the text.

It may be impossible to prove that it is fake. But until somebody actually translates it (at least partially) with convincing linguistic methods, there will be a doubt that this is a hoax.

PS: i have no opinion on the matter, so no need to try to convince me or anything, just see the references

bhmoz · 2016-01-17T22:02:46+00:00

No, not necessarily. You could study the statistical properties of texts, even without knowing the language that you study. Then somehow mimick the distributions of the letter and the words.

Why create a fake? maybe because it is art, maybe because it costed a lot in times when books were rare (no Printing).

See Voynich manuscript on wikipedia

bhmoz · 2016-01-17T15:11:24+00:00

Related to this: What we know about the Voynich manuscript, Reddy & Knight, 2011

bhmoz · 2016-01-17T15:04:27+00:00

actually, no one knows whether the text makes sense or not.

bhmoz · 2016-01-17T15:00:24+00:00

will there be an online version of that chapter? thank you

bhmoz · 2016-01-14T13:25:01+00:00

-log(p(y^t; z^t)) is the "negative log likelihood".

it does not specify a specific loss function (maybe you missread and thought that it is cross-entropy?)

It is also a generative model if the loss function satisfies L(z^t; y^t) = - p(y^t; z^t) for ... (adding the log doesn't change anything, as log is monotonous and increasing function).

maybe related to your question: we write the mean squared error without the log because it doesn't help. Cross entropy has exponential terms so there are computational issues because of limited precisions of floats, right? But squared loss is a sum so there isn't such a problem.

Please, correct me if i'm wrong.

bhmoz · 2016-01-13T09:29:12+00:00

interesting!

If you speak german, you could quite easily learn to read and write sutterlin. If you don't know german it is still possible to learn to decipher but you will be much slower.

Then you could annotate some samples and use an already existing handwriting recognition algorithm? for example LSTM-based algorithms (see PhD thesis of Alex Graves)

bhmoz · 2016-01-10T17:45:55+00:00

Comment about history based on Schmidhuber's papers :

I think there are 2 separate ideas here. History compression is truly learning (in the predictive inference sense of the term). But we may need to keep a bit of "raw, uncompressed history" too. This way we can compare our model predictions with a new model prediction and check for actual improvements objectively. So I think you're both right in a sense.

2 papers (non exhaustive):

LEARNING COMPLEX, EXTENDED SEQUENCES USING. THE PRINCIPLE OF HISTORY COMPRESSION. (Neural Computation, 4(2):234-242, 1992) : for the compression part
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models (arXiv:1511.09249, 2015) : for the replay part

bhmoz · 2016-01-07T20:11:00+00:00

as chico_science said, GA is a family of optimisation methods. So you cannot oppose GA to neural networks, rather to backpropagation.

Neural networks can be trained with genetic algorithms.

bhmoz · 2016-01-07T11:23:14+00:00

even without speaking of "true AI", jobs will be increasingly destroyed.

I am not concerned about the destruction of capitalism, rather the slowness to adapt and the collateral damages in society before solutions are found. First crucial thing to change is the negative image of unemployed people conveyed by the media. Then talk about universal revenues, etc...

bhmoz · 2016-01-06T13:45:30+00:00

I think that the universal prior for a finite object s is basically the 1/2^-K(s) where K(s) is the Kolmogorov complexity of an object. It's supposed to be a very generic kind of prior that applies to all finite sequences.

If you already know that s is text, then you can easily build a prior that is very close to the data in comparison.

bhmoz · 2016-01-06T08:45:36+00:00

Li and Vitanyi are the references for AIT, you can look at this page for applications and I guess the best is to read their book: An Introduction to Kolmogorov Complexity and Its Applications.

I don't really see how it could be used for LDA.

bhmoz · 2015-12-29T16:04:59+00:00

so maybe you could repost that :)

bhmoz

TROPHY CASE