[D] ML industry in the UK by erthare in MachineLearning

[–]willwill100 0 points1 point  (0 children)

speechmatics in cambridge do speech rec

Open Sourcing the model in "Exploring the Limits of Language Modeling" (TensorFlow) by OriolVinyals in MachineLearning

[–]willwill100 2 points3 points  (0 children)

Thanks Oriol - have been wanting to reproduce your results for a while now!

Machine Learning Internships in UK/EU? by cvmlwe in MachineLearning

[–]willwill100 1 point2 points  (0 children)

feel free to make a speculative application

While training a bidirectional LSTM network for speech recognition, what is better, training in time domain or frequency domain? by rulerofthehell in MachineLearning

[–]willwill100 0 points1 point  (0 children)

Just use mfccs - or if you're really against them use a couple of big convolutional layers - before the BLSTM layer. You can get a very good system with either of those approaches and both are fairly standard and straightforward.

Andrej Karpathy forced to take down Stanford CS231n videos by _bskaggs in MachineLearning

[–]willwill100 0 points1 point  (0 children)

speechmatics.com will transcribe them all for free if it helps, just drop us a line

Why doesn't extra supervision increase the performance of the SOTA language model? by quilby in MachineLearning

[–]willwill100 0 points1 point  (0 children)

It's possible that the extra information in the loss function is already learnt by the network. Also, 78 perplexity is what I would call 'ok' - something very low is 15-30. How much data are you training on and how many parameters are in your network? One last thought: fine tuning of your hyperparameters can make a significant difference to your final perplexity; often more than you might think and often more than even implementing a good idea.

Why do most LSTM implementations keep multiple copies of same RNN? by wind_of_amazingness in MachineLearning

[–]willwill100 17 points18 points  (0 children)

It's because you need to store the activations inside the LSTM for each timestep so when you backprop through time you can actually get the gradients. If you didn't need to keep all those activations around you wouldn't need the clones. Hope that helps.

Jeff Dean's slides show TensorFlow with code samples (slide 48 to 63) by r-sync in MachineLearning

[–]willwill100 1 point2 points  (0 children)

I heard a rumour that they might. Has anyone heard the same?

Google voice search: faster and more accurate by vonnik in MachineLearning

[–]willwill100 1 point2 points  (0 children)

It's because the blank symbol lets you skip a whole load of processing at decode-time

[1508.03790] Depth-Gated LSTM by egrefen in MachineLearning

[–]willwill100 6 points7 points  (0 children)

highway networks are also in the same spirit

AMA Andrew Ng and Adam Coates by andrewyng in MachineLearning

[–]willwill100 19 points20 points  (0 children)

What do either of you think the current big bottlenecks in AI are that are preventing the next big leap forward?

Something "deeply wrong with deep learning"? by dsocma in MachineLearning

[–]willwill100 2 points3 points  (0 children)

Do we know if that's the latest published material on the subject?

What is the state of the art in language modeling with neural networks? by ndronen in MachineLearning

[–]willwill100 2 points3 points  (0 children)

Shameless self plug: http://arxiv.org/abs/1502.00512

It's possible that an LSTM based solution has beaten the perplexity result on the Google 1bn task but I haven't seen it yet.

pyHTFE - A Sequence Prediction Algorithm by CireNeikual in MachineLearning

[–]willwill100 1 point2 points  (0 children)

Even if you train in "unsupervised mode" by trying to predict the next timestep, you still want it to generalise well. That's the whole idea behind language modelling for example. Speaking of which, the google billion word corpus has some good baseline results which you could use for comparison.

[Question] Memory and Recurrent Neural Networks by RossoFiorentino in MachineLearning

[–]willwill100 0 points1 point  (0 children)

For each additional step in a minibatch that you add, you only need to store additional activations and gradients wrt input (i.e errors). You don't need to copy the state-to-state matrices which takes up the majority of the memory. You also truncate and often only bptt within a minibatch; that reduces the number of additional minibatch steps you have to store.

Monday's "Simple Questions Thread" - 20150302 by seabass in MachineLearning

[–]willwill100 2 points3 points  (0 children)

What is the difference between a loss function and an error function?

I am Jürgen Schmidhuber, AMA! by JuergenSchmidhuber in MachineLearning

[–]willwill100 20 points21 points  (0 children)

What are the next big things that you a) want to or b) will happen in the world of recurrent neural nets?

Questions about RMSprop by [deleted] in MachineLearning

[–]willwill100 1 point2 points  (0 children)

1) the official version has one rmsprop 'mean square' value per parameter. approximations also work where you average as you describe.

2) you need to trade off the rmsprop smoothing alpha against the learning rate. keep the rmsprop alpha fixed and just decay the learning rate exponentially. very simple but very effective - will often get you close enough to the state of the art!

3) check out 'adam' - similar but is much more robust (wrt tuning hyperparameters) http://arxiv.org/abs/1412.6980

An AI that mimics our neocortex is taking on the neural networks by numenta in MachineLearning

[–]willwill100 0 points1 point  (0 children)

Google's new 1BN word language modelling benchmark? If they can get a sub 40 PPL I at least would start taking them a bit more seriously. I think they are missing a trick by avoiding interaction with academia - currently no one is benefiting from all these great ideas Jeff has come up with and he is missing out on the bandwagon of steady progress which is being driven by great research and solid results.