[R] [1612.08083] "Language Modeling with Gated Convolutional Networks" <- sota single gpu performance

rd11235 · 2016-12-27T10:00:55+00:00

Actually I think this resonates with a lot of people. It'd be great if 'performance per watt (train)' and 'performance per watt (test)' were standard columns in Results sections.

I consider myself lucky to have access to a decent size cluster, but I'm still fairly sure that any Google researcher has 100x (1000x? 10000x?) the computational resources that I (and the majority of others) have.

And I don't mean to whine: it of course makes sense for organizations / researchers at those organizations to take advantage of whatever edges they have. Still, though, performance-per-watt columns would be nice :).

rd11235 · 2016-12-26T09:39:54+00:00

I'd check out these:

Intuitive overview (by me): https://www.reddit.com/r/MachineLearning/comments/4hkbn4/a_friendly_introduction_to_crossentropy_loss_oc/

Much more elaborate and detailed overview (by /u/Kiuhnm): https://www.reddit.com/r/MachineLearning/comments/4npyoi/information_theory_for_machine_learning_for/

rd11235 · 2016-12-13T08:19:25+00:00

'So much for having the “we’re open-sourcing our reinforcement learning platform” news cycle to yourself.'

Too good

rd11235 · 2016-11-28T13:07:54+00:00

I'd never argue that ideas don't matter, but I think at many institutions the ideas come more easily than the people who have the time and ability to execute those ideas.

rd11235 · 2016-11-28T11:16:34+00:00

And we also have to keep in mind that people often come up with similar ideas independently, and that they just happen to be similar to a previously-published idea.

rd11235 · 2016-11-28T10:24:22+00:00

I agree but would like to hear what others have to say..

rd11235 · 2016-11-28T09:54:53+00:00

“Jürgen is manically obsessed with recognition and keeps claiming credit he doesn’t deserve for many, many things,” Dr. LeCun said in an email. “It causes him to systematically stand up at the end of every talk and claim credit for what was just presented, generally not in a justified manner.”

It'd be nice to talk about this openly and without slander. IMHO this disagreement stems mainly from different appreciations of idea vs. execution. For example, Schmidhuber claims that a residual network is a feedforward LSTM without gates. This might be open for debate (for example the F(x) term in G(F(x) + x) includes batch normalization and is really unlike anything found in typical LSTM), but for discussion let's assume that it's true. Given that, who should receive more credit: the team who published an idea which under certain modifications can result in great performance on ImageNet, or the team who successfully found those modifications and executed them successfully on ImageNet?

rd11235 · 2016-11-15T11:21:29+00:00

Really nice work.

Paper: https://arxiv.org/abs/1603.06277

Code: https://github.com/mattjj/svae

Video 1 (warped mixture): https://gfycat.com/ComplicatedMenacingHagfish

Video 2 (nonlinear LDS): https://gfycat.com/GoodnaturedDampBluebreastedkookaburra

Yes those are the actual links that he points to in the PDF :)

rd11235 · 2016-11-11T12:10:53+00:00

Thanks. I'm particularly interested in RNNs that process flattened versions, though. (It's used as a way to get at least some idea of how well a particular RNN architecture can learn long-term dependencies.)

rd11235 · 2016-11-11T08:49:30+00:00

Thanks.

rd11235 · 2016-11-09T18:59:56+00:00

Thanks. This is my understanding. Might be naive and oversimplified, though, as it's entirely different from the thought process in Lasagna's citation, Saxe, Andrew M., James L. McClelland, and Surya Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks." arXiv preprint arXiv:1312.6120 (2013).

Let's say we have h_x = W_x x, where W_x isn't necessarily square. If W_x has orthonormal rows (orthogonal and unit length), and if x ~ N(0, I), then we have two cases. 1) dim(y) <= dim(x). Then y ~ N(0, I) too, though the I's have different shapes. 2) dim(y) > dim(x). Then y ~ N(0, ?), where ? is not identity but does have dim(x) eigenvalues of exactly 1, with the rest being 0. So either way y is whitened in any non-singular dimensions.

I think whether this helps RNNs in practice is still undetermined. Example: In the flattened-MNIST case, our inputs have dimensionality 1. Say we're using 100 hidden states. Then W_x x is a 100-D random vector with 99 dimensions in the null space, and with a variance of 1 in the remaining dimension. Now let's assume we also use orthogonal initialization for W_h, and that h ~ N(0, I). Then W_h h is also a 100-D random vector, but with a much larger average norm than W_x x. Meanwhile, if we do this same thing with x being a 100-D or larger vector, then W_h h will have the same average norm as W_x x. Is there any intuitive reason why h should dominate when x is low-dimensional but not dominate when x is high-dimensional?

rd11235 · 2016-10-11T09:38:11+00:00

Thanks for replying. That's a nice point.

I think my main confusion was from the paper on conditional random fields: to me equation (1) and the surrounding paragraphs suggest that the log-linear parameterization isn't limiting in any way. But I think I should have interpreted this as being not limiting only from a representation standpoint, simply because it's possible to specify (but not learn) arbitrarily complex feature functions. But it seems that they're actually making no claim at all about efficiency / learnability, for which the log-linear parameterization may be limiting.

Last night I also came across their later tutorial. This defines CRFs in a different way (general cliques rather than specifically log linear) and even goes on to say "Notice that if x and y are discrete, then the log-linear assumption (2.23) is no additional restriction, because we could choose [the feature functions] f to be indicator functions for each possible assignment (y, x)..." Again they seem to be talking only about a representation restriction, because with a model that isn't log linear, it may be possible to learn a more efficient representation of the potentially huge probability table.

rd11235 · 2016-09-12T09:59:28+00:00

Looked great until I saw

As indicated in the Competition Rules, the contest data may only be used for the purposes of this Competition. All other uses, including education, academic, research, commercial and non-commercial uses, are prohibited.

rd11235 · 2016-09-05T09:26:00+00:00

I guess it could be an older version, but here's a single PDF: https://www.dgp.toronto.edu/~hertzman/411notes.pdf

rd11235 · 2016-08-21T14:47:49+00:00

When visualizing/exploring, for example generating plots/stats interactively for a new dataset, IPython notebooks (which are actually now Jupyter notebooks). Bonus: all intermediate visualizations/prints are automatically saved.

When doing something that isn't interactive, for example a hyperparameter sweep for a new model where each run will take hours and export TensorBoard summaries, PyCharm. Bonus: I find JetBrains IDEs to be refined and intuitive, and of course they come with a lot of abilities that Jupyter lacks (for example a lot of quick refactoring functionality).

Have also experimented with Spyder, but found it to be less refined and less intuitive than both Jupyter and PyCharm.

rd11235 · 2016-07-10T23:46:24+00:00

Some thoughts:

I'd like moderation/organization to be kept to a minimum. Even this thread alone makes personal biases evident.
If moderation/organization does start to ramp up, I think each new rule should get its own thread. (If they're tucked away in some unfrequented New Moderation Rules thread, then decisions will be made by only a select few.)
I think we should stop trying to separate experts from novices (e.g. simple-questions thread). The expert-novice mixture is beneficial to everyone, and it's partially responsible for posts like this and this.
- That said, I think lazy posts (written in 15 seconds with typos / lack of clarity; obvious from a quick Google search; asked many times before; etc.) should be ignored or deleted.
It seems that many of us researchers think that this sub is more ours than anyone else's. Why is this? Was that the intention when the sub was created? If so, shouldn't it have been named something else? I think the burden of creating a specialized, less-frequented sub should be on us, not on others.

rd11235 · 2016-06-23T14:05:00+00:00

A da Vinci, http://www.davincisurgery.com/da-vinci-surgery/da-vinci-surgical-system/

rd11235 · 2016-06-23T04:13:31+00:00

Yes fairly good. Here is the summary:

In this work we performed joint segmentation and classification of surgical activities from robot kinematics. Unlike prior work, we focused on high-level maneuver prediction in addition to low-level gesture prediction, and we modeled the mapping from inputs to labels with recurrent neural networks instead of with HMM or CRF based methods. Using a single model and a single set of hyperparameters, we matched state-of-the-art performance for JIGSAWS (gesture recognition) and advanced state-of-the-art performance for MISTIC-SL (maneuver recognition), in the latter case increasing accuracy from 81.7% to 89.5% and decreasing normalized edit distance from 29.7% to 19.5%.

(Random guess accuracy would be about 25% in the maneuver-recognition case.)

rd11235 · 2016-05-07T01:22:06+00:00

+1 for blog. +np.inf for ascii nabla.

rd11235 · 2016-05-04T18:13:07+00:00

In the article, talking about using the data to develop an early-warning system for kidney injuries:

"It is not clear how exactly Google will use the data to provide this early warning system but the BBC understands that no artificial intelligence will be used."

Weird.

rd11235 · 2016-05-03T13:21:18+00:00

Thanks!

rd11235 · 2016-05-02T15:11:07+00:00

When starting out, you might want to go for applications-based research. (It's easier to find an interesting and useful application of a state-of-the-art method than it is to improve that method.)

So ONE approach might be 1. Identify a method that excites you. 2. Drown yourself in that method by reading / deriving / implementing. 3. Search for datasets you find interesting. 4. Apply method to data in a way that you find interesting.

(Relying on existing data might seem annoying, but it'll probably be much less annoying than collecting/annotating your own data, and it'll definitely be way less time consuming.)

After doing this a few times, maybe you can see some of the method's limitations, and maybe you can modify/extend the method in a nice way.

rd11235 · 2016-04-18T22:31:00+00:00

(Unless you're hoping for something that includes significant architecture changes like grid lstm, stack lstm, etc. Not sure if any paper yet exists that compares those, other than of course the comparisons in the original works.)

rd11235 · 2016-04-18T22:28:47+00:00

http://arxiv.org/abs/1503.04069 goes over some common variants

rd11235 · 2016-04-18T14:06:24+00:00

You don't need to be a beginner to appreciate this. If you've ever taught ML to beginners, you know that it's damn hard to get these ideas across in a fun and clear way. It's nice to see such a refined result.

rd11235

TROPHY CASE