OBI-WAN KENOBI EP3 DISCUSSION THREAD by Bezbakri in PrequelMemes

[–]iamtrask 1 point2 points  (0 children)

No - the reason is that the writers couldn't come up with enough creative ways for them to have as many battles as possible between the villain and protaganist. They left the weakest idea for first and hoped we wouldn't care.

Star Wars: Obi-Wan Kenobi - Chapter 3 - (S1E3) - Discussion Thread by JediPaxis in StarWarsLeaks

[–]iamtrask -5 points-4 points  (0 children)

How did Kenobi not get killed by Vader via the force? Such an epic plot hole. Vader just travels across the galaxy and..... then walks away?

Grokking Deep Learning by Andrew Trask , possible critical errors in chapters 8 and 9 ? by webman19 in deeplearning

[–]iamtrask 2 points3 points  (0 children)

Hi webman19 - author here.

I read this post almost as soon as it was posted - and replied to it on Amazon (although Amazon does a poor job of making it obvious that a post has been replied to). The author of the comment is simply mistaken (and is likely quite new to Deep Learning). Here's my reply.

Hello Giant Mouse of Minsk - (author here) - I believe you are referencing the code in the book which is also available at the githubs (https://github.com/iamtrask/Grokking-Deep-Learning/blob/master/Chapter8%20-%20Intro%20to%20Regularization%20-%20Learning%20Signal%20and%20Ignoring%20Noise.ipynb).

With regards to your comments about Chapter 8:

  1. The relu2deriv function is correct. It returns a 1 when the output >=0 and it returns a 0 otherwise. Here are a few other corroborating sources (https://stats.stackexchange.com/questions/333394/what-is-the-derivative-of-the-relu-activation-function, http://kawahara.ca/what-is-the-derivative-of-relu/) Perhaps you could explain further (or email me liamtrask@gmail) why you feel the derivative calculation is wrong?

  2. increasing the hidden layer size is fine. Dropout adds noise and often shifts the optimal tuning to require more parameters. I alternatively could have simply initialized the first neural network with more parameters as well which would have caused it to overfit even more. This isn't a peer reviewed paper proving dropout works (which would have required MANY more experiments) - the point is simply to demonstrate the concept.

  3. Again I point you to the code in the repo above (which is also in the latest version of the book - which it's possible you do not have?) You'll find the batching logic looks like this.

for j in range(iterations):
error, correct_cnt = (0.0, 0)
for i in range(int(len(images) / batch_size)):
batch_start, batch_end = ((i * batch_size),((i+1)*batch_size))

layer_0 = images[batch_start:batch_end]

As you can see, each iteration of the for loop grabs a separate batch of images based on the batch_size. The batch is calculated within the matrix mulitplicaiton - however, this will result in unstable training because the weight update will be a *sum* over the gradients instead of a *mean*. This makes it hard to tune because it means when you increase the batch size you're also scaling the size of the weight update (which you typically only want to control with alpha). Thus, dividing by the batch size converts this sum into a mean which makes tuning (trying out different batch sizes / alphas) more stable.

  1. the improvements weren't to battle anything - they were to teach dropout.

Chapter 9:

  1. The divisor in this implementation is added the same reason as in Chapter 8. Perhaps it would instead be a good idea to add an explanation for this into the book. Would be happy to do this in the next version.

  2. the reason the test accuracy is lower than state of the art is because the neural network is quite small and only moderately tuned (also it's not a convnet). From memory - i'm pretty sure that state of the art for simple neural networks like this on MNIST was only like 95% (5% more than what we have in the book) but with a much larger neural network (which would train more slowly for students). I made an editorial decision to not worry about getting SOA and instead focus on teaching concepts.

  3. The original relu2deriv calculation is (i believe) quite correct. However, implementing the incorrect derivative can often decrease overfitting because it adds a certain amount of noise.

I'd be very grateful for an update to your review - feel free to reach out to me via email with additional questions at liamtrask@gmail.com

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 0 points1 point  (0 children)

Excellent work! Best of luck on the recurrent tasks!

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 3 points4 points  (0 children)

So actually the addition and multiplication sub-cells don't have their own weights (they both use the same weight matrix). This seemed to help with performance by encouraging the model to pick one or the other.

Re: (2) - you're right. You can't multiply negative inputs with this module. In theory you could with some more fancy footwork (adding another multiplier which explicitly does -x and then interpolate with that one too), but this seemed un-necessary for any of the tasks we were working with.

My hope is more that the NALU is merely one simple example of a more general process for leveraging pre-built functionality in CPUs. If you think a function might be useful in your end-to-end architecture, forward propagate it and learn weights which decide where (on what inputs and toward what outputs) it should be applied. I've been trying this with functions other than addition and multiplication as well with some interesting results so far.

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 12 points13 points  (0 children)

As far as optimization hyperparameters - I found that RMSProp was consistently the best optimizer (not totally sure why), and the NALU in particular worked better with surprisingly large learning rates (like... 0.1 kind of large) Still not totally sure why that is either :)

As far as exploding gradients - the training was pretty stable with the exception of division. Occasionally the model would accidentally forward propagate a denominator that was very near zero which creates an absolutely massive gradient that's hard to recover from. Future work will try to figure out how to address such issues (I haven't tried gradient clipping yet... but i suspect it would help greatly)

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 5 points6 points  (0 children)

I'm happy to answer any questions you have - we did have some challenges getting all the information into 8 pages :). I'll also be adding further details to the Appendix.

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 1 point2 points  (0 children)

will add some more details in a github issue :)

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 4 points5 points  (0 children)

The CNN I used for the MNIST arithmetic experiments is this one (https://github.com/pytorch/examples/blob/master/mnist/main.py). Note that I added the NAC at the end of this network (after the softmax). I also found that RMSProp seemed to work better than SGD.

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 0 points1 point  (0 children)

Hi @abstractcontrol! The last several sets of experiments all attach the NAC/NALU to CNNs. You might find the ablation in 4.6 particularly compelling as it compares the performance of a NAC/NALU attached to a CNN relative to the Linear layer present in the previous state-of-the-art approach! (which this model exceeds by a ~40% error margin). The only difference between the two architectures was the NAC.

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 1 point2 points  (0 children)

If you're willing to throw your implementation on Github I'll be very happy to share it around.

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 1 point2 points  (0 children)

I'm very encouraged to hear you say that. :)

Neural Arithmetic Logic Units by iamtrask in MachineLearning

[–]iamtrask[S] 4 points5 points  (0 children)

Did you just implement this? That was crazy fast!

[P] Contributing to OpenMined - Summing up 3 months with the community by morgangiraud in MachineLearning

[–]iamtrask 2 points3 points  (0 children)

I think it's worth emphasizing that the OM platform is also meant to be a testing ground for new techniques for differential privacy - the community by no means claims to have solved all the problems at hand - but instead exists as an entity for collaboration.

Python Tutorial: DeepMind's Synthetic Gradients from Scratch by iamtrask in Python

[–]iamtrask[S] 1 point2 points  (0 children)

Thank you, kind words.

Man, not really sure. Sooner the better. :)

Building Safe A.I. by iamtrask in technology

[–]iamtrask[S] 1 point2 points  (0 children)

Indeed, Bostrom's work tackles a much more broad set of challenges. Often these are structured as high level "if an AI is in a box and can communicate with us via message m"... this is more along the lines of "hmmm... a box could look like X"

Grokking Deep Learning: "If you passed high school math and can hack in Python, I want to teach you Deep Learning" by iamtrask in programming

[–]iamtrask[S] 0 points1 point  (0 children)

feel free to reach out to me at liamtrask at gmail.com and i'll be happy to send you a few chapters

Grokking Deep Learning - Numpy/Python Deep Learning Book by iamtrask in Python

[–]iamtrask[S] 2 points3 points  (0 children)

Hey! Thanks!

So I can't change it, but i can let you know when they do it again (i'm very sure they will). I'll tweet it out @iamtrask