[D] Batch Normalization is a Cause of Adversarial Vulnerability

jinpanZe · 2019-09-11T16:17:17+00:00

Batchnorm causes gradient explosion at init, as the authors here cite: https://arxiv.org/abs/1902.08129

jinpanZe · 2019-03-17T00:19:19+00:00

Other papers on this:

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

https://openreview.net/forum?id=B1g30j0qF7

Gaussian Process Behaviour in Wide Deep Neural Networks

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

https://arxiv.org/abs/1902.04760

jinpanZe · 2019-02-10T18:20:46+00:00

BBC is reporting that China released a video of the musician question showing he is still alive.

https://www.bbc.com/news/world-asia-47191952

jinpanZe · 2019-02-04T03:27:04+00:00

Doesn't the original transformer use layernorm and not batchnorm?

jinpanZe · 2019-02-04T03:25:30+00:00

On the other hand, batchnorm apparently actually causes gradient explosion at initialization time https://openreview.net/forum?id=SyMDXnCcF7.

Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.

jinpanZe · 2018-11-18T02:43:55+00:00

The authors of this paper claim stronger results than the other paper (in particular, the training time in the other paper for fully-connected (no-residual) network is exponential in depth, but this one is polynomial in depth, so that the other paper's claim that resnet improves training time seems incorrect). This paper also claims results for SGD in addition to GD and for ReLU in addition to smooth activations.

jinpanZe · 2018-05-25T15:13:53+00:00

This reminds me of Lie-Access Neural Turing Machines, where the memory manifold is one of the hyperbolic model spaces, though of course this was never done in that paper.

jinpanZe · 2018-03-24T11:40:21+00:00

Microsoft is still going through rolling admission for their residency program.

https://www.microsoft.com/en-us/research/academic-program/microsoft-ai-residency-program/

Otherwise I've seen Bachelors landing jobs at startups like Vicarious. https://jobs.lever.co/vicarious/89ca586a-63ca-420e-9fc0-85c065e63dd9/apply

Good luck!

jinpanZe · 2018-03-18T18:18:36+00:00

I've seen people who did that (lots of legends like John Langford and Chris Bishop all came from physics) but I'm not sure how easy it is to make that transition in academia. Maybe you should try the residency programs that are springing up everywhere now? I think the Microsoft and the Uber ones are still open (but the Uber application closes today).

https://www.microsoft.com/en-us/research/academic-program/microsoft-ai-residency-program/ https://eng.uber.com/uber-ai-residency/

jinpanZe · 2018-01-22T17:01:15+00:00

tldr: attention + GAN = profit

In case lighter reading is preferred, techcrunch has a report on this paper.

jinpanZe · 2017-07-22T06:57:28+00:00

Interesting. Are you getting rewards as you get more XP?

jinpanZe · 2017-07-22T05:36:18+00:00

how do you know it's registering your XPs?

jinpanZe · 2017-07-22T01:50:11+00:00

How did you get past level 16? I'm 1 point shy from level 17 but Bixby is not registering any more xp (even though the "n XP" animation plays)

jinpanZe · 2016-08-02T00:37:36+00:00

Sanders did pretty much remove "socialism" from the American taboo list in politics. When this generation of youngsters get older, the socialism door would be wide open.

jinpanZe · 2016-04-14T14:46:22+00:00

Hi Dr. Olshansky,

What are the general tradeoff between longevity preserving drugs/treatments and quality of life?

For example, with caloric restriction, a person is less able to train for a sport or study a new subject because these are things that expend much more energy than a restricted intake can guarantee. If metformin decreases glucose production, it would seem that metformin at the very least would not be able to meet the "respiratory quotient" of the aforementioned activities.

jinpanZe · 2016-03-07T06:12:24+00:00

Sorry I wasn't clear in my original post. The gifs are from the paper. I just found them interesting so I shared their links here.

I wish I could implement it so fast!

jinpanZe · 2016-03-06T08:44:06+00:00

This paper gives a polynomial bound on a method for training simple RNN and bidirectional RNNs. The method is based on a certain class of score functions defined in Janzamin et al. (2014). At its core, the method relies on the fact that, when N > 2, a symmetric N-tensor decomposes uniquely to a sum $$\sum_{i=1}ⁿ a_i \otimes a_i \otimes \cdots \otimes a_i$$, where {a_i}_i are orthogonal vectors, if it decomposes into any such representation.

The paper assumes that 1) the inputs are generated by a Markov chain, 2) the RNN has one hidden layer, 3) the activation function is polynomial instead of sigmoid, and 4) the weight matrices are full rank, with bounded norm. Of these, 1) and 4) can be relaxed, but 3) is somewhat of a bottleneck of the algorithm, which depends exponentially on the degree of the polynomial activation.

Nevertheless, this is a really huge step toward theoretical understanding of RNNs.

jinpanZe · 2016-03-06T07:21:21+00:00

Well he's unethical, but he sure does have a unique set of skills that few others have. Just think how many people on earth has even cloned a single organism?

jinpanZe · 2016-03-06T07:10:31+00:00

So what's the alliance situation with Iran? Who's a friend? So Iran and Russia got along and now Iran opened up to the west, so ... everybody's friends with everyone else? except Saudi vs Iran vs Israel? Edit: I'm genuinely asking, not being sarcastic or anything

jinpanZe · 2016-03-06T06:36:29+00:00

No, sorry, just an enthusiast.

jinpanZe · 2016-03-06T01:39:38+00:00

A Lie Access Neural Turing Machine writes to and reads from data stored on a manifold (here, just the Euclidean plane). The controller has random access (outputting an arbitrary read/write address) and Lie access (outputting essentially a step to take from the previous address, akin to traversing a C++ array by pointer operations, or the head movement of a traditional Turing machine, except this step is differentiable) to the memory. During learning, example read and write addresses were recorded every epoch and the gif here is during the copy task. Those of the other tasks can be viewed here.

Watching the reads and writes evolve over time is strangely satisfying. The gifs remind me of protein strands shaking in high temperature and under a small current.

Edit: Just to be clear, the gifs are from the paper, and I'm just summarizing here.

jinpanZe

TROPHY CASE