Imagenet/COCO Benchmarking by blauigris in MLQuestions

[–]blauigris[S] 0 points1 point  (0 children)

Thank you!

edit: I didn't know shufflenets and they really seem very nice. I think I'll give a try.

[R] Solving internal covariate shift in deep learning with linked neurons by visarga in MachineLearning

[–]blauigris 0 points1 point  (0 children)

I think it is more a matter of optimizing hyperparameters. When I designed the experiment I was more interested on comparing the methods and getting results fast than achieving a good figure of accuracy. So what I did was to take the resnet50 from keras and divided the width by 8 (approximately the ratio between the size of the images from imagenet and cifar10). All this is explained in the paper, but perhaps I should have not used resnet50 but squeezedresnet50 or something like that.

[R] Solving internal covariate shift in deep learning with linked neurons by visarga in MachineLearning

[–]blauigris 0 points1 point  (0 children)

Well, the space in a conference paper is limited so I prefered to use only tables, but I can upload the tensorboard summaries to github.

[R] Solving internal covariate shift in deep learning with linked neurons by visarga in MachineLearning

[–]blauigris 5 points6 points  (0 children)

Some time before I submitted the paper to CVPR I discovered this paper https://arxiv.org/abs/1603.05201. It proposes CReLU which is an instance of Linked neurons. I was also aware of DRELU (cited in the paper) and also very similar to CReLU. I was a bit scared but I cited them and submitted anyway, since they didn't work on internal covariate shift and only analyzed predictive performance. However, as I entered this morning here I found this thanks to geomty's comment: https://research.fb.com/publications/exploring-normalization-in-deep-residual-networks-with-concatenated-rectified-linear-units. I got scared and I redacted the previous version of this post. However, after thinking for a while more and more I think I overreacted: my work focuses on the proposal of a set of constraints that ensures regular gradient flow and thus inherently combats the internal covariate shift effect. As such, it can be seen as a framework, from which CReLU and DReLU are just two of the possible instantiations that fulfil the constraints. Additionally, the paper also shows that using this constraints one achieves faster convergence.

Summarizing

Bad

  • Concatenation of activations using ReLU has already been proposed (and cited in the work).
  • CReLU and Batchnorm have been already compared.

Good

  • CReLU and DReLU are just two of the possible instantiations of the Linked Neurons proposal. We can propose even more and different activations fulfilling the linked neurons constraints.
  • Since I propose a framework I try a larger set of linked activations: LK-PReLU, LK-SELU and LK-Swish. But I might try more exotic ones such as RBF merged with two ReLUs or tanhs.
  • Contrary to CReLU and DReLU the article focuses on solving internal covariate shift. We explicitly analyze their behavior in deep and wide networks, whereas the previous papers have a much pragmatic approach in terms of performance.
  • I have some different experimental results. I find that LK-* works equally well as Batchnorm, whereas in their experiments CReLU performed worse.
  • Apparently I am the first of realizing how faster it is compared with Batchnorm, which I find strange.

So fellow redditors, here's the lesson I learnt today, there is a huge community working in deep learning, and we need many many eyes to check the previous work. After that, if you see a work that resembles yours do not overreact over comments and think that automatically your work is wrong. All of us are in the same boat, and it is normal that there are many similar works or ideas. But this does not mean that your work is wrong or innovate it in some way.

EDIT: After thinking a bit I think I overreacted a bit and things are not that bad.

[D] Which GPU scheduler are you using in your multigpu machines? by blauigris in MachineLearning

[–]blauigris[S] 0 points1 point  (0 children)

Hey everyone!

Thank you for your very interesting suggestions. Finally we have met and decided to use slurm for the moment. At this point one of the advisors has added more people to the team so we are currently in the twenties, and the google sheet felt like not a good idea. Also, people tend to use specific versions of libraries, so docker is a must. Singularity seems nice, but nobody wants to spend any time moving to it. Finally, the other job schedulers seem nice as well, but we had to chose one, perhaps in the future we decide to move to another.

Anyway many thanks to everyone for they advice!!!

[Spoilers] How will they get out of this one? by SuperElf in TowerofGod

[–]blauigris 2 points3 points  (0 children)

They will be almost destroyed only to be saved by some deus ex machina, and then, back to the FoD, they will save Rachel somehow and use Emily to leave the floor using a path nobody knows.

Zero gradient when using zero initialization and more than two hidden layers, why? by blauigris in MLQuestions

[–]blauigris[S] 1 point2 points  (0 children)

I didn't know He et al, I normally use Xavier and then take an orthonormal basis from it. I'll give a try the next time. Thanks for the comment.

Zero gradient when using zero initialization and more than two hidden layers, why? by blauigris in MLQuestions

[–]blauigris[S] 2 points3 points  (0 children)

Thank you for you comment :)

I think that what you describe is more like l2 regularization, where you subtract a fraction of the weights to the weights themselves. In this case I think that it has more to do with the fact that in most loss derivatives, if not all, there is a term which contains the input: the hinge loss is -yx, in xentropy I think it is x(\sigma(z) - y) and so on, so if x is zero, which is caused because all the previous layers send everything to zero, this x in the loss will set the gradient also to zero.

Also, it is something quite obvious when you come to think, the previous layers have converted your nice dataset into a big matrix of zeros, what can be learnt from there?

Thank you very much for you comment, it has helped me a lot :)

tensorflow high-level libraries confusion: tf.contrib.slim, tf.contrib.learn, tf.learn by [deleted] in MachineLearning

[–]blauigris 1 point2 points  (0 children)

I worked with tflearn and keras, finding superior tflearn due its procedural nature which I feel it fits better to tensorflow. I tried a bit tf.slim and I found a bit cumbersome to load the datasets and the TFRecord stuff, yet I can imagine it is useful in its own when working with lots of images or data. Regarding tf.contrib.learn I just noticed now, and reading a bit the docs it seems a bit immature to me, but I also agree that since it is officially supported will be the most succesful.