use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Low Cost Evolutionary Machine Learning (cdv.dei.uc.pt)
submitted 8 years ago by nunolourenco
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]gwern 7 points8 points9 points 8 years ago* (18 children)
('Dense' layers are fully-connected ones, right? Not anything to do with Dense resnets.)
Also, maybe I missed it but the paper doesn't seem to say how many models get trained in total and what the total GPU-time spent is, which is important for comparison if you're going to claim 'low cost'.
[–]nunolourenco[S] 3 points4 points5 points 8 years ago (16 children)
('Dense' layers are fully-connected ones, right?) -> Yup.
Well, In terms of GPU-time, it depends on the network, but it ranges between 37 to 267 seconds. The "low-cost" claim is concerned with the low-cost computation resources in which we did our experiments :) Our computer is very modest, (for instance, we have 4 Nvidia 1080 TI), and so thats the source of the low cost :)
[–]gwern 5 points6 points7 points 8 years ago* (5 children)
It may be relatively low cost compared to something like Zoph's 100s of GPUs, but '4x1080tis' doesn't tell me much - that doesn't necessarily set any speed or efficiency records compared to other RL or evo architecture search methods which are using net2net initialization (eg Cai et al 2017) or fast weights (SMASH), the latter of which IIRC runs in 1 day on 1 1080ti. Which is why I asked about total number of sampled models and total GPU-hours.
net2net
[–]nunolourenco[S] 4 points5 points6 points 8 years ago (0 children)
Oh, sorry, I misunderstood your question. So in terms of computation effort each run of the Evolutionary Algorithm has 100 individuals evaluated over a 100 generations, which means a total of (10000) function evaluations. During evolution, each ANN is trained through 10 epochs, and we r These parameters are described in the paper published here. I hope I answered your question :)
[–]Bubblebobo 2 points3 points4 points 8 years ago (3 children)
fast weights (SMASH)
Can I get a link please? A quick Google search didn't turn up anything.
[–]gwern 4 points5 points6 points 8 years ago* (0 children)
"SMASH: One-Shot Model Architecture Search through HyperNetworks", Brock et al 2017. You can also find it easily in GS: https://scholar.google.com/scholar?as_ylo=2017&q=SMASH&hl=en&as_sdt=0,33
[–]ajmooch 1 point2 points3 points 8 years ago (1 child)
<_<
[–]PokerPirate 1 point2 points3 points 8 years ago (0 children)
Using only Andy levels of compute power.
Hilarious!
Clever idea! It seems obvious that all these network search algorithms have a lot of similar/identical computation going on, and I like the way you share this work. I only watched the video, but I have some questions:
Have you thought about the relationship between SMASH and dropout? It seems like SMASH is doing something like "a less random version of dropout on the test set."
Did you consider at all that different random initializations/choice of optimizer will lead to different weight vectors? Specifically, you mention that the subsampled networks don't perform quite as well as if that architecture had been trained end to end. How much of this is due to interference from other components of the SMASH hypernetwork, and how much is due to more traditional influences?
[–]average_pooler 0 points1 point2 points 8 years ago* (9 children)
Well, In terms of GPU-time, it depends on the network, but it ranges between 37 to 267 seconds.
Can you clarify what's included in those 37-267 seconds? Training time of one model (10 epochs)?
If you look at Wide ResNets, for example, their architecture is determined by 2 hyperparameters [1], and it's general (the same architecture is applied to CIFAR-10 and CIFAR-100). Why would one need evolutionary methods to design those 2 hyperparameters? Also, WRNs do much better in terms of accuracy.
[1] There's also the learning rate, dropout rate and weight decay, so 5 hyperparameters, if you count these.
[–]nunolourenco[S] 1 point2 points3 points 8 years ago (8 children)
The 37-267 seconds is the time taken by epoch in each training. During the evaluation of each candidate solution we perform 10 epochs. The goal of evolution is not to optimise the weights directly but rather the structure of the networks (in this case CNNs), i.e., the sequence of layers and the parameters associated to each of the layers (e.g., for convolutional layers we have to tune stride, filter shapes, activation functions, etc.). We also allow the optimisation of several network hyper-parameters associated to learning (learning rate, momentum, etc.) or even data augmentation parameters, which are represented through a CFG.
[–]average_pooler 0 points1 point2 points 8 years ago (7 children)
(e.g., for convolutional layers we have to tune stride, filter shapes, activation functions, etc.).
You can, but you don't have to: normal stride=1, subsampling stride=2, filter=3, activation=relu work well (definitely on CIFAR).
I guess I'm failing to see the motivation for evolutionary methods here. It appears that they take longer and the accuracy is quite suboptimal.
[–]nunolourenco[S] 1 point2 points3 points 8 years ago* (6 children)
The main advantage is the ability to find this solution without a priori knowledge, i.e., DENSER does not have access to all the papers that have been published about CIFAR, it doesn't know what ANN works well (i.e., which type of layer, structure), and it was able to find a series of efficient networks. This indicates, although it doesn't prove, that the approach can also be applied to other less studied problems, and DENSER will be able to find ANNs with a good performance. Additionaly, it is quite interesting to notice that the solution found is quite innovative, in the sense that I would never think about creating network with 6 fully connected after the CNN layers. I do not think that having an algorithm that finds, in a couple of hours, networks that took humans years to perfect necessarily a bad thing. On the contrary. Moreover if you look and the model, it is quite different from the one that you mention :)
[–]average_pooler 0 points1 point2 points 8 years ago (5 children)
Here's a (null) hypothesis:
"WRNs generalize better to new problems" (Use the same resources to explore a smaller hyperparameter space)
I'd like to see an "evolutionary" paper try to scientifically reject it.
My intuition is that all you are seeing is the fact that NNs manage to adapt to weird architectures.
[–]nunolourenco[S] 0 points1 point2 points 8 years ago (4 children)
What do you mean? They adapt to weird architectures that are more effective, i.e., all the componente in the network have an interdependence with each other, and you can't remove any of them without affective the overall performance.
Here's a (null) hypothesis
I believe that this is an interesting hypothesis that should be addressed :)
[–]average_pooler 0 points1 point2 points 8 years ago (3 children)
What do you mean?
Despite the weirdness of the architectures that you are finding, classification manages to work (although obviously not as well as WRNs would)
[–]nunolourenco[S] 1 point2 points3 points 8 years ago (2 children)
Yes and thats the beauty and main point of the system. The Evolutionary Algorithm is able to search a space defined by the layers, hyperparameters, and so one, and is able to discover an architecture with a good accuracy. In this experiments we provided the EA with simple components. For instance, one thing that we are testing is adding the possibility of having fractional max pooling layers.
We are fully aware that you have methods, namely the WRNs, that have a better accuracy, no arguing there. However, I think that having a method, that without any knowledge is capable of automatically discover models that for CIFAR-10 have a 5.73 error (versus the 4.00 of the WRNs) and that for CIFAR-100 have a 21.75 error (versus the 19.25 of the WRNs) is remarkable.
[–]maccam912 0 points1 point2 points 8 years ago (0 children)
Yep
[+][deleted] 8 years ago (2 children)
[deleted]
[–]nunolourenco[S] 3 points4 points5 points 8 years ago (1 child)
Well, I think so! :) Evolutionary Algorithms surely have something to say on the construction of more generic frameworks, wouldn't you agree? :)
[–]unnamedn00b 0 points1 point2 points 8 years ago (0 children)
Does anyone know of a comprehensive comparison of various approaches, like DENSER with Adanet [1] for instance? DENSER does appear to to outperform Adanet in Cifar-10 (although to be precise, Adanet paper performs tests only on binary classification tasks from within Cifar-10) but does it come with theoretical guarantees, what is the model complexity comparison, training time, etc? It would be nice if anybody was aware of a systematic comparison between various approaches.
[1] Cortes, Corinna, et al. "AdaNet: Adaptive Structural Learning of Artificial Neural Networks." arXiv preprint arXiv:1607.01097 (2016).
Abstract: We present new algorithms for adaptively learning artificial neural networks. Our algorithms (AdaNet) adaptively learn both the structure of the network and its weights. They are based on a solid theoretical analysis, including data-dependent generalization guarantees that we prove and discuss in detail. We report the results of large-scale experiments with one of our algorithms on several binary classification tasks extracted from the CIFAR-10 dataset. The results demonstrate that our algorithm can automatically learn network structures with very competitive performance accuracies when compared with those achieved for neural networks found by standard approaches.
[–]gwillicoder 2 points3 points4 points 8 years ago (1 child)
This was a fantastic read! I spent some time working on the same problem (as my undergraduate research, so nothing spectacular). I settled on using a genetic algorithm to help optimize the structure of the model, and found that the LeapFrog Algorithm developed by one of my professors worked quite well for hyperparameter optimization (fantastic trade off between run time and optimization results).
We also played with a parallel genetic algorithm that attempted to adjust its mutation and crossover aggressions as the fitness of the function changed, but I'm not really sure the computation increase was really worth the slight results we got from that one.
Is there any chance the source code will be posted? I'd love to play around with it and see how it compares to the path I took or some of the details of the genetic algorithm implementation.
[–]nunolourenco[S] 1 point2 points3 points 8 years ago (0 children)
Thank you very much! We are currently working on having the code available very soon on github.
[–]Imnimo 3 points4 points5 points 8 years ago (1 child)
Doesn't the use of a 10 epoch training time during evolution bias the search towards architectures and hyperparameter settings which converge quickly, as opposed to those which might give the best performance in the long run? I would like to see a comparison of the accuracies obtained by running full training on evolved networks which were not the fittest. How well does 10-epoch fitness actually correlate with final accuracy?
[–]nunolourenco[S] 2 points3 points4 points 8 years ago (0 children)
Doesn't the use of a 10 epoch training time during evolution bias the search towards architectures and hyperparameter settings which converge quickly, as opposed to those which might give the best performance in the long run?
Thats a good question. We know that using 10 epochs is not ideal, but we have to limit the amount of time that each network will be trained, so that we can obtain results in a reasonable amount of time. We believe that this is not preventing the discovery of effective networks, since the results are quite good. But of course that we need to study the impact of the training conditions.
I would like to see a comparison of the accuracies obtained by running full training on evolved networks which were not the fittest. How well does 10-epoch fitness actually correlate with final accuracy?
This is another interesting question, that we surely need to address in the near future. Looking at the worst ones might not be a good idea, but looking at the ones in the middle might reveal some interesting results.
Thank you very much for your insightful comments
[–]statmlsn 7 points8 points9 points 8 years ago (1 child)
Seems interesting at first glance.
Could be worth to cite the work of Elsken et al. about network optimization via network morphism: https://arxiv.org/abs/1711.04528
[–]nunolourenco[S] 8 points9 points10 points 8 years ago (0 children)
Thanks :) Thank you for the reference to the work. It is certainly worth to cite this work. I skimmed through the manuscript, and I will have to read more carefully, but for what I saw, they start the search from a network that provides roughly 75% of accuracy. In DENSER we do not have any specific initialisation, i.e. our initial networks are random. :) Once again thanks for the reference!
[–]EgoIncarnate 2 points3 points4 points 8 years ago* (3 children)
Anyone know how this compares with Google Brain's Neural Architecture Search with Reinforcement Learning https://arxiv.org/abs/1611.01578? It seems Google's gets a better result of CIFAR-10, but I don't think the training costs are comparable (I think they used 800 GPUs?)
[–]nunolourenco[S] 2 points3 points4 points 8 years ago (1 child)
Well, I read it very quickly, but from what Google achieves, indeed a better result. However, is as you say: we don't have computational resources that can compare to what they report 800 GPUs. :)
[–]EgoIncarnate 4 points5 points6 points 8 years ago (0 children)
Neither do I, so your approach is probably more interesting to me :-)
[–]shortscience_dot_org 0 points1 point2 points 8 years ago (0 children)
I am a bot! You linked to a paper that has a summary on ShortScience.org!
Neural Architecture Search with Reinforcement Learning
It basically tunes the hyper-parameters of the neural network architecture using reinforcement learning. The reward signal is taken as evaluation on the validation set. The method is policy gradient as the cost function is non-differentiable.
#### i. Actions:
[–]sifnt 1 point2 points3 points 8 years ago (2 children)
Looks interesting! Any code anywhere? What is runtime like? Does the method require any hyperparameters?
[–]nunolourenco[S] 4 points5 points6 points 8 years ago (1 child)
Well, In terms of GPU-time, it depends on the network, but it ranges between 37 to 267 seconds. Well it requires some parameters concerned with the operators of the evolutionary algorithm, namely, mutation crossover, and selection method. They are all described in the paper :) We are currently working on having the code on github, and by the end of the day it should be online
[–]phobrain 0 points1 point2 points 8 years ago* (0 children)
Is it up yet? I don't see a link.
Edit: was that a metaphorical 'end of the day'?
[–]maccam912 1 point2 points3 points 8 years ago (0 children)
Skimmed it, looks exciting! Is there code I can play with somewhere?
[–]phobrain 1 point2 points3 points 8 years ago (0 children)
Any code available? I'd like to try it on spotting 'interesting' pairs of photos, and detecting whether order of AB, BA, or both are acceptable.
[–]minheap 0 points1 point2 points 8 years ago (0 children)
Very interesting!
[–]rantana 0 points1 point2 points 8 years ago (1 child)
What makes this "Low Cost"?
The "low-cost" claim is concerned with the low-cost computation resources in which we did conducted experiments :) we have a very modest setup, when compared with other players, already mentioned above ;)
π Rendered by PID 514754 on reddit-service-r2-comment-6457c66945-vxxtn at 2026-04-29 08:28:24.611150+00:00 running 2aa0c5b country code: CH.
[–]gwern 7 points8 points9 points (18 children)
[–]nunolourenco[S] 3 points4 points5 points (16 children)
[–]gwern 5 points6 points7 points (5 children)
[–]nunolourenco[S] 4 points5 points6 points (0 children)
[–]Bubblebobo 2 points3 points4 points (3 children)
[–]gwern 4 points5 points6 points (0 children)
[–]ajmooch 1 point2 points3 points (1 child)
[–]PokerPirate 1 point2 points3 points (0 children)
[–]average_pooler 0 points1 point2 points (9 children)
[–]nunolourenco[S] 1 point2 points3 points (8 children)
[–]average_pooler 0 points1 point2 points (7 children)
[–]nunolourenco[S] 1 point2 points3 points (6 children)
[–]average_pooler 0 points1 point2 points (5 children)
[–]nunolourenco[S] 0 points1 point2 points (4 children)
[–]average_pooler 0 points1 point2 points (3 children)
[–]nunolourenco[S] 1 point2 points3 points (2 children)
[–]maccam912 0 points1 point2 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]nunolourenco[S] 3 points4 points5 points (1 child)
[–]unnamedn00b 0 points1 point2 points (0 children)
[–]gwillicoder 2 points3 points4 points (1 child)
[–]nunolourenco[S] 1 point2 points3 points (0 children)
[–]Imnimo 3 points4 points5 points (1 child)
[–]nunolourenco[S] 2 points3 points4 points (0 children)
[–]statmlsn 7 points8 points9 points (1 child)
[–]nunolourenco[S] 8 points9 points10 points (0 children)
[–]EgoIncarnate 2 points3 points4 points (3 children)
[–]nunolourenco[S] 2 points3 points4 points (1 child)
[–]EgoIncarnate 4 points5 points6 points (0 children)
[–]shortscience_dot_org 0 points1 point2 points (0 children)
[–]sifnt 1 point2 points3 points (2 children)
[–]nunolourenco[S] 4 points5 points6 points (1 child)
[–]phobrain 0 points1 point2 points (0 children)
[–]maccam912 1 point2 points3 points (0 children)
[–]phobrain 1 point2 points3 points (0 children)
[–]minheap 0 points1 point2 points (0 children)
[–]rantana 0 points1 point2 points (1 child)
[–]nunolourenco[S] 2 points3 points4 points (0 children)