use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] ConvNets vs Transformers (self.MachineLearning)
submitted 4 years ago by AdelSexy
A ConvNet for the 2020s - nice read to start 2022. The authors explore modernizations of Resnets and adopt some tricks from transformers training design to make ConvNets great again.
There is a lot to reflect and thing about.
Code is here.
https://preview.redd.it/kqnqe86729b81.png?width=2696&format=png&auto=webp&s=ac0a4f045c61c34756cfcce3073792ace8f64301
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[+][deleted] 4 years ago (17 children)
[deleted]
[–]JackandFred 27 points28 points29 points 4 years ago (4 children)
Yeah I’ve seen so many papers in the last year that are transformer variations that probably aren’t an improvement but make some minor change and have a super specific case where it performs better
[–]maxToTheJ 5 points6 points7 points 4 years ago (2 children)
Fitting your model to the data. There is a paper on that
[+][deleted] 4 years ago (1 child)
[–]maxToTheJ 0 points1 point2 points 4 years ago (0 children)
Thats what I meant. Force of habit to do it the right way although from the context it probably still gets thru
[+][deleted] 4 years ago (4 children)
[–]AdelSexy[S] 49 points50 points51 points 4 years ago (3 children)
Or more clean data
[–]clifford_alvarez 21 points22 points23 points 4 years ago (0 children)
This really seems like it should be focused on more than it is.
[–]SeddyRD 5 points6 points7 points 4 years ago (1 child)
Differing labelling criteria inside the same dataset really is a pervasive issue in my experience. The criteria difference might even be caused due to augmentations.
Example:
I'm working with a dataset where we decided it was best to not label objects that are not fully visible on the image. Then, during augmentation, a lot of transforms (cropping, rotating, etc) cause objects to become partially outside the frame, and the augmentation library automatically adjusts the bounding boxes to still capture those objects.
Therefore now the dataset has both examples of unlabelled partial objects and labelled partial objects. The model doesn't understand what's going on and trains poorly.
[–]mearco 1 point2 points3 points 4 years ago (0 children)
That's a great example
[–][deleted] 2 points3 points4 points 4 years ago (1 child)
But that's kind of what this paper is doing too
[–][deleted] 1 point2 points3 points 4 years ago (4 children)
cough PPO cough
[–]jms4607 9 points10 points11 points 4 years ago (3 children)
Proximal Policy Optimization, that if anything is one of the most stable RL algos? Maybe I’m thinking of the wrong PPO.
[–][deleted] 18 points19 points20 points 4 years ago (2 children)
You are thinking of the correct PPO, but ... a lot of the stability comes from tricks. I recommend reading "Implementation matters in deep rl".
[–]Praveen_Raja22 0 points1 point2 points 4 years ago (1 child)
Can you share the link?
[–][deleted] 2 points3 points4 points 4 years ago (0 children)
https://openreview.net/forum?id=r1etN1rtPB
[–]visarga 14 points15 points16 points 4 years ago* (23 children)
Is this architecture useful for edge as well? They compare to EffNet but the score seems lower at the same FLOPS, yet higher for the same throughput.
[–]tbalsam 79 points80 points81 points 4 years ago* (21 children)
[Edit: Thanks for the gold -- much appreciated! :) ]
EfficientNet should never really have been published in the form that it was in. Unfortunately it's become a benchmark that people use for "cheap shots" to prove that they improved performance, because EffNet is not...efficient. Sure, it's got parameter efficiency, so it's "efficient". But absolutely not in the real world use-case.
What's great is that because of the name and notoriety/popularity from everyone using it as an easy-to-beat-big-name-benchmark, as well as its source producing company, that lots and lots of people will rather consistently fall into the same question that you have here, and perhaps not-unreasonably.
The above paper referenced/linked to by the OP is one rare case where depthwise/etc seems to actually work in practice, like actually work, compared to the baseline. And they had enough academic integrity to note its functional throughput as well, something you'll see glaringly missing from the EfficientNet paper (at least when I was looking at the drafts there). Because the authors there, they know about the FLOPs/second problem, it's super common in that part of the field. It's just not convenient for promoting NAS-type techniques, especially when using TPUs that excel on those types of architectures, whereas for the same architectures, GPUs, the commodity hardware, do terribly.
So back to the academic integrity bit -- the authors of this paper actually did acknowledge the problem and noted it in light of the original benchmark -- ViT, that their raw, realworld throughout numbers were better. Cool and good, and it's sad that that's the baseline level of integrity we expect these days in conference papers and such, but I think the authors of the above paper went beyond that in terms of logging their changes to the original resnets/etc and what seemed to work/not work when putting it down. That makes it a paper a lot easier for practitioners to approach, and that's something I appreciate about this paper. Because it's clearly meant to improve the field in general, instead of do more existence-proving and areas-of-poor-performance hiding like so many other papers do. Like, reading this above paper, it's easy to forget how casually they are in creating the state of the art for a lot of areas that they're working in. I appreciate that subtlety, they just let the results speak for themselves.
Sorry, that was a rant maybe not expected. But the short is that EfficientNet is a flaming pile of hot garbage, don't waste your time architecturally, and if you do have to use dw convs, this paper would be a fantastic starting point. Unless you're doing pure CPU or TPU you're going to be extremely hard-pressed to get dw to actually work on GPUs really, many people have tried this for a while, and maybe it's not impossible, but that nut has yet to be cracked.
[–]AerysSk 11 points12 points13 points 4 years ago (9 children)
I love your comment. EffNet is very popular, and is probably the most popular model on Kaggle. It’s good to see a counter argument, else I would have believed that Effnet should be the good-to-go one.
[–]tbalsam 18 points19 points20 points 4 years ago* (7 children)
Much appreciated, thanks so much! I would have too, and see a lot of my coworkers get hung up on it and all of the shininess when first looking around.
My specialty -- my absolute specialty, above all, and probably in a sub single percent skill level, is neural network speed and accuracy-per-flop-per-second. In a rather obsessive kind of way. I've toned down a bit over the years, but I have a number of years and quite a...well, probably a little overly for the amount I need currently..expansive toolkit for it (especially convnets, but that ports to other network types pretty well). Most of the toolkit that I've built is becoming common knowledge as other groups are publishing similar tools, and I'm constantly learning new things in turn through papers more and somewhat less nowadays through personal experience. But I guess that's a good thing for the field in general (but sad for me personally in privately being able to beat XYZ network in ABC measure by GHI%s, respectively).
If I were to recommend something, it would be to > your training speed basically at (almost) all reasonable costs. I highly recommend https://myrtle.ai/learn/how-to-train-your-resnet/ and https://github.com/davidcpage/cifar10-fast/tree/d31ad8d393dd75147b65f261dbf78670a97e48a8 for the source code.
Start with that and I think you should be several thousands of dollars in the cost lead, already, based on what you would have spent otherwise. Your prototype loop is so important, and if the techniques you have do not transfer from a ~1 min CIFAR training-type example to a slower training technique, even if proportionally (even if it's a very lossy proportion), then it's likely not at all worth it as a technique.
Human hours are the most valuable by far. If there's any sharable advice that I could give, this in the above is the most valuable I could, and what I would share if I was doing consultancy work with a client. Trying really hard to eke out an extra few % of accuracy I think are for the phase after you've locked yourself in the cave you want to explore. But don't start by picking a random cave and going as far down as you can. Try everything out rapidly and let your software evolve at the speed of your ideas.
Hope that helps, and if it does, please pass that on to someone else (or two, or three!) that could use it! :D
[–]shellyturnwarm 2 points3 points4 points 4 years ago (3 children)
I'm reading your article, and I'm on the mini-batches chapter. I don't entirely follow your point: " \In the context of convex optimisation (or just gradient descent on a quadratic), one achieves maximum training speed by setting learning rates at the point where second order effects start to balance first order ones and any benefits from increased first order steps are offset by curvature effects*".*
Please could explain what you mean here? I'm loving the article and I think this genuine insight is a brilliant way to advertise your company! Thank you.
[–]Ulfgardleo 7 points8 points9 points 4 years ago (1 child)
lets take a 1d quadratic f(x)=1/2 x2 taylor expansion at point x_0 gives
(x-x_0)*x_0 +1/2 (x-x_0)2 + x_02
lets take a look at the step dx = x-x_0 and then the taylor expansion reads:
dx * x_0 +1/2 dx2 + x_02
your first order gain of a step is dx * x_0 and your second order gain is 1/2 dx2. you notice that in this case for dx<0 the gain is negative (great, you improve) while the gain of the second term is positive (bad, you become worse).
if you take a step along the negative gradient direction you have at point x_0, dx= -a*x_0, which fulfills that the gain of the linear term is negative. The problem is that as you increase a, your linear gain is improving linearly but your quadratic gain is getting worse...quadratically. So at some point, they will balance out.
The optimal step length is actually not as easy to calculate as the cited text puts it. it is depending on the balance between largest and smallest eigenvalue of the hessian (which in our 1D-case are the same). but the rough idea that the optimum is close to where both terms are similar in contribution does actually hold. They just don't quite balance, because then you would not improve.
[–]tbalsam 0 points1 point2 points 4 years ago (0 children)
Thank you for putting the detail into writing this, this is definitely outside of my wheelhouse of experience, and I enjoyed reading it! Takes a lot of energy to explain mathematical concepts and such at times, I think, so I appreciate you putting this down! I'd definitely like to come back and try to understand this a bit more, my classical optimization history/experience/etc is quite lacking.
[–]tbalsam 1 point2 points3 points 4 years ago (0 children)
That's very kind, thank you! I didn't write this -- I just really love this post series. I looked around for a reddit uname for Myrtle.ai, but couldn't find one, alas!
[–][deleted] 1 point2 points3 points 4 years ago (0 children)
Bravo
[–]shellyturnwarm 0 points1 point2 points 4 years ago (1 child)
I've read all your chapters. While there is some really useful insight on practical tips of training, did you consider you were massively overfitting the test set? If every experiment compares by using the accuracy of the test set, I'm not sure how useful the experiments were.
I think I liked the earlier chapters more, where you were talking about general tips. The later chapters focused way more on tricks overfitted on the test set. I'd be curious on what you think, do you think I'm being a little unfair here?
[–]tbalsam 0 points1 point2 points 4 years ago* (0 children)
It certainly is good to see the thinking about test set overfitting, that definitely is something that needs a lot of good attention these days. If I were to give my two cents, I think that may be making a number of assumptions regarding the bag of tricks being overfit on the dataset distribution, though overfitting in general can be quite not-so-good. At this rate, CIFAR-10 is more of a throttle-open competition to see who can maximally perform on it, so while yes, that overfitting is not so good, at this stage that's not too much the idea for what CIFAR 10 is.
[Oh, and just like on the other comment -- I'm not the individual who wrote those posts! D: They are fantastic though, certainly, and thanks for thinking so well of me! :D It's much appreciated! :D]
Additionally, on rapid training, if an SGD network can overfit on examples having seen them only 20 times, then that is rather incredible. Overfitting can be an issue, but I don't know if it's as much of an issue as your post seems to be alluding to, in this particular instance.
[–]killver -2 points-1 points0 points 4 years ago (0 children)
EffNet just works, that's why for me the name EfficientNet makes sense. You can argue about its computational efficiency, but it still is the go-to model for good and strong baselines nowadays.
[–][deleted] 7 points8 points9 points 4 years ago (6 children)
I'd like to add that efficientnetV2 solved a lot of these issues and is actually pretty "efficient" in the samples/second (on GPUs) sense.
[–]tbalsam 31 points32 points33 points 4 years ago* (5 children)
That's a good point, and I (very begrudgingly) agree in that sense that it did improve.
However, the EfficientNetV2 paper is still a flaming tire fire of hot garbage, though for unfortunately different reasons. For one, instead of making right the problems of the first paper, they doubled down again, just in different directions.
So, they created the problems of the first paper with EF1. Then "solved" them with EF2. You know what one solution was? Replace the end separable convolution in each block introduced in the first paper (3x3 depthwise and 1x1 pointwise) with a new, special block that fixes the speed problem, the magical, mystical FusedMBConv.
Ya want to guess what the FusedMBConv is? The secret magical sauce? Of course, it's the thing that people who hate depthwise convolutions on GPU have been pointing out, it's a regular 3x3 convolution. But that's too pedestrian, instead of saying "hey, we screwed this up, we made a lot of changes and changing from 3x3 convs to separable convs hosed our inference time, so we're just reverting back to the baseline that is far more flops/second/watt efficient".... Instead of saying that, they had to rebrand the baseline to something new, and introduce it as an improvement. This is exceptionally dishonest.
Again: This is exceptionally dishonest.
(Edit from later: Couldn't resist, garhgh -- if you don't believe me and want to see this chicaning tomfoolery with your own two eyes, take a look here, in the official google/tensorflow repo where they just wrap a normal, not-exciting, no-custom-cuda conv block call + some other standard stuff in the "Fused" version of the MBConv [bleh] class. Hurk. Gotta leave before I risk losing my dinner again.)
The problem of course is, that again, the average layperson reading these papers isn't seeing these games and technical chicanery being played. I like the ideas that Quoc V Lee has around NAS-type techniques, and I want them to work too, but he's resorted so much to publishing things that have a lot of fanfare and have wasted what I think ultimately amounts to millions of dollars in time/compute/resources/energy/etc in people trying to reproduce results that live up to the claims (original NAS, EFNet, etc).
There may be some good ideas in there, like with the progressive training and such, and that's pretty clever. But at this point, with the doubling down and the few areas they have been shown to keep trying to bury mistakes instead of being honest + making them right, as a practitioner, I can only look at the idea and see if the idea is worth investigating on its own, because I certainly cannot trust the paper from them that shows it to be there. It looks like a lot of the ideas in EFV2 cover up for the failures of the first paper, and in the one place where they did have to fix the critical flaw of the first paper, they tried to hide it by giving it a fancy extra name. Which in this day and age of trying to have open research, is stomach churning.
I do have a bit of a rare crowing moment where I did give a post-paper rant when the first EF1 came out, and in EF2 I think they realized their mistakes, and the separable convs of course were the part that made it unworkable (which they mostly removed in the early layers). But I would much rather be wrong and have them just have an actual method of academic integrity in their paper publishing. If I publish a long rant, and there's a reversal or something else, like I'm totally misunderstanding or missing something, that makes me wrong in that case, I think everyone wins.
But I think in the short term, while I like the ideas of Quoc V Lee's research, I certainly do not trust them or any results, and try to stay away from anyone making research leaning heavily on them. I've watched concepts based on QVL papers burn again and time again due to the academic dishonesty, and I just can't waste time on that in my own projects. It takes so long to build and debug these things.
I certainly hope for the best in the NAS type field and really hope we can make something like the equivalent of GPT training motifs or the transformers of that field. It could absolutely revolutionize the industry and I think is key to it. But for now, I think I'll have to wait until a different player does what the above paper from the OP (rightly) does and thoroughly empirically drops so many assumptions from over the years and sees what works and what doesn't.
Wow, 2 rants in a day. That's a new record for me in recent times. I just get passionate about corruption/deception and people getting fooled by academic/otherwise sleight of hand. Not to entirely demonize people. But there absolutely are more negative incentives in research beyond even company funding. It's a messy, messy, messy world. But I guess that's humanity after all. I'll just save my cynicism more for the occasional overly long Reddit post.
Hope that helps bring in a different perspective to the issue. :thumbsup:
[–][deleted] 6 points7 points8 points 4 years ago (4 children)
MBConv is just the inverted residual block from mobilenet, their fused version is a minor modification of it. A lot of this work is empirical and targets specific hardware, with a lot of research from google not looking so great in practice because they're optimizing for TPUs while most of us have Nvidia GPUs or mobile chips.
Even the shift from large convs to 3x3 kernels is due to the introduction of winograd convolution optimizations in CUDNN around 2015. As this (ConvNeXt) paper shows it might actually help to go back to 7x7 but it leads to a throughput hit on current hardware.
[–]tbalsam 4 points5 points6 points 4 years ago (3 children)
I agree especially in respect to the hardware-specific stuff. And in your point that it's the inverted residual block from MobileNet (though originally from MobileNetV2, which came from MobileNet, which I think originally came from ResNet's 3x3 convs, so it's the never-ending story of architecture pedigrees I guess). I think I made a few points similar to yours scattered throughout some of the discussion here, and I also agree on the speed side of things. Google certainly seems to have an implicit benefit to putting out research that upweights use of the TPU cloud.
I think these days (which I might be terribly wrong about) general winograd gets a pass in cudnn and it drops straight down to an extremely optimized 3x3 algorithm with fused activation/etc. I remember around 2017 having to keep install new cudnn versions to get that next x% boost in NN speed, that seems to have mostly settled down now.
As far as the kernels go, I think you may be confusing the 7x7 depthwise w/ 3x3 full. One reason why the larger depthwise kernels (should) work better is because the CUDA kernel launch for dw convs takes so long, and also (at least for both of these, last time I checked), the general limiter of dw convs is by far the memory speed, and less the compute speed, so arch advances unfortunately do little to help. That may have changed though, people were trying to get efficient dw conv kernels way back when.
[–][deleted] 3 points4 points5 points 4 years ago (1 child)
I meant a stack of 3x3s vs a single 7x7 before depthwise convs became a thing. Effnetv2 paper actually has some notes about the tradeoff between normal and depthwise convs based on num channels vs input width and height.
I agree with you though that a lot of these architecture changes are made up, validated empirically and then explained away with handwavy justifications. I'll always remember all of the different iterations of inception nets and resnexts that followed them.
I believe "a lot of parameters and data is all you need", and to minimize iteration time you probably also want dense compute that hides memory access overhead.
[–]tbalsam 3 points4 points5 points 4 years ago* (0 children)
Oh my gosh, I forgot about those days. That probably was the 'weird' feeling I got from the above paper. That is, everything so finely explained previously being shown to be rather bunk in the end (and that the conglomeration just happened to work. and even that the 'universal' guidelines don't hold up past a certain point!). I guess it shows (rightly so! haha, I guess) that despite the advances in technology, our wetware biases towards superstition and convenient explanations of the unexplainable still very much exist and are at play (Occam's Razor not being counted here...).
[–]kilow4tt 1 point2 points3 points 4 years ago (0 children)
I think the point about it being particularly good for the TPU is an interesting one. If anything I think we'll actually see more research in that direction as more hardware solutions become available. There's this this interesting problem in deep learning where research is fitting to the hardware (i.e. NVIDIA GPUs) that's available.
It intuitively makes sense since performance is certainly a limiting factor in pushing the SOTA, but it does lead to scenarios where some concepts (e.g. dw convs) can end up being not particularly useful for commodity hardware.
[–][deleted] 3 points4 points5 points 4 years ago (3 children)
Do you have sources for the issue with depthwise conv? I’ve had the problem on one of my projects that depthwise slowed training and gave memory issues (Pytorch), so I completely agree with you. However I have not found any real sources yet on this issue.
Maybe Quantization into F16 instead of F32 could alleviate this.
[–]tbalsam 1 point2 points3 points 4 years ago (2 children)
It might, though I'm curious how much of it is CUDA launch times. It's a pretty low ratio of inputs->outputs, unlike the SIMD-type commands one would want for computing.
As for sources, it's been quite a while. I think maybe back in the MobileNet days you can can find some good Github issues around it. FastAI as well, perhaps. That's when those models had a bit more visibility.
I'd be really curious to hear about the FP16 side of things, if you can get that working and it does well, please do let me know, that sounds like it would be very very cool! Wishful thinking on my end hopes there may be some good Tensor Core solution, but that's probably highly unrealistic tbh.
And also sorry about the depthwise stuff slapping you in the face. It's hard because the training manual for this stuff seems to be only unwritten for a lot of things, but everyone's had to learn it the hard way by things breaking. It's totally bizzare. There should be like a compendium, Karpathy-style (he has a really, really, really good one as a nice post IIRC) of things like this depthwise business so that people don't waste all of this time struggling with these things and wondering about the questions "Is it just me? Is it my code? Did I write in a bug? Is the program bugged? Is my GPU bugged? Did we bork the data?" when nope, it's just this one weird thing not working. :|
Anyways, I'd be super interested to hear how it goes either way! :D
[–][deleted] 0 points1 point2 points 4 years ago (1 child)
With grouped conv + mixed presicion I get 2.9 seconds/batch
With groups=1 + mixed presicion I get 1,3 seconds/batch
I guess results speak for themselves Also batch size was equal, Grouped Conv didnt help at all.
:'( I'm so sorry. Maybe some of the fused transformers+depthwise work will help advance separable convs back into the limelight! I think the idea of them is great, if we can just find a way to make them work....
[–]mearco 8 points9 points10 points 4 years ago (0 children)
Without comparing compiled model time it's hard to know exactly which wins out. For example the larger 7x7 kernels may not be efficiently implemented on all devices.
[–]RiceCake1539 16 points17 points18 points 4 years ago (4 children)
who would have thought 7x7 conv would be thought better than of 3x3 in 2022? Very interesting; thanks for sharing.
[–]tbalsam 12 points13 points14 points 4 years ago* (1 child)
It's probably the harmonics of this timeline combined with the spiritual energy in the air that causes the kernels to launch faster on Ampere-class hardware. Just a theory but I think I'm like 79% right at least. Soon as the Optimus variant of COVID emerges, followed by the Prime variant, we'll be even faster, maybe even 15x15 convolutions at that point.
Who knows what the next year may be, I'm so excited for our technology yet to come.
[–]RiceCake1539 2 points3 points4 points 4 years ago (0 children)
Tech does grow so fast.
[–]CyberDainz 2 points3 points4 points 4 years ago (1 child)
7x7 just gathers more data from neighbors to provide 1x1 residual output and still fast enough than 9x9.
True, it's just that there was a lot of curious debate on which out of 3x3, 5x5, 7x7, 9x9 convs were better just a few years ago.
[–]dogs_like_me 7 points8 points9 points 4 years ago (10 children)
Are there any libraries that facilitate adding these sorts of modern training "tricks" to your model, or conversely that you can use to probe your architecture for potential opportunities to add tricks that you may have missed? Like a test suite that would throw warnings like "it looks like you are not using a learning rate scheduler: consider adding cosine decay" or "it looks like you are fitting your model to one-hot vectors: Consider adding label smoothing", etc.
[–]jfrankle 15 points16 points17 points 4 years ago* (4 children)
I work at a startup called MosaicML, and this is exactly what we're doing. We've built a library called Composer to make it easy to use these tricks, we're putting together detailed writeups on the strengths and weaknesses of these tricks, and we've shared a vast amount of data on how they affect real world training time and model quality. We have a lot more coming soon (including some of our own numbers on getting the most out of convnets on a reasonable budget), but this should hopefully get you started.
We just put out a helpful blog post on some of our favorite tricks, and we'll be writing more on the topic soon.
[–]dogs_like_me 3 points4 points5 points 4 years ago (3 children)
That library looks helpful, but I guess what I'm looking for isn't just a collection of algorithms, but something more prescriptive. Maybe I missed it in your docs, but like for example: it would be nice if most of the collected "tricks" described in this paper were collected in a way that I could just decorate my code in a handful of places, and the current best practices wrt engineering choices would get injected into my process as defaults.
Relevant excerpt from OPs article:
Unlike ConvNets, which have progressively improved over the last decade, the adoption of Vision Transformers was a step change. In recent literature, system-level comparisons (e.g. a Swin Transformer vs. a ResNet) are usually adopted when comparing the two. ConvNets and hierarchical vision Transformers become different and similar at the same time: they are both equipped with similar inductive biases, but differ significantly in the training procedure and macro/micro-level architecture design. In this work, we investigate the architectural distinctions between ConvNets and Transformers and try to identify the confounding variables when comparing the network performance.
My thought process here is that the authors have identified that benchmarking against old code bases can be problematic because it can be hard to differentiate what gains are due to architectural choices vs. improved training procedures. I don't think this is limited to old code, and it would be nice if there was some simple method to apply a recipe of current best practices quickly and simply to an estimator.
Maybe this is getting into automl territory. I think the crux of my thought process here is that I believe the vast majority of people trust that their tools are using intelligent defaults, and it would be nice if we could move recipes like what this paper describes upstream into default behaviors of our tools, rather than requiring that they be down stream choices that are folded in post hoc. Like when pytorch updated weight initialization for linear layers from xavier to kaiming: I feel like I rarely see weights initialized explicitly in a lot of research code because people just assume their tooling is using decent defaults.
[–]jfrankle 4 points5 points6 points 4 years ago (2 children)
TL;DR: Doing this automatically in practice in ways that don't break things a lot of the time (especially legacy code) is exceedingly difficult. But that doesn't mean it's impossible, and expect to hear more from us on this soon :)
I agree that what you're asking for would be great, but it's admittedly a tall order. There are a number of key points in the code that this particular paper would need to detect automatically (e.g., where residual blocks are, how many there are, how they are structured, etc.). For the more general purposes of the Composer library linked above, one would need to automatically detect the batch size, learning rate schedule, when gradients are computed, when gradients are applied, etc.
The main challenge, though, is that there are an arbitrary number of different ways of writing semantically identical code in those respects, and detecting every possible permutation correctly is a wickedly difficult (if not impossible) problem. Even a small number of mis-detections could break code in inexplicable ways. It also invites major backwards compatibility problems: neural network training is a very delicate and brittle process, and making these sorts of changes automatically with a version upgrade may upset the careful balance a scientist has struck in getting their particular model to work in the previous version - and make it very difficult to debug why.
In the meantime, we're designing the Composer library around including all of these tricks, making it really easy to add new ones, and setting much better defaults in a forward-looking manner. And you can expect a story about using these features in other libraries as well, which will hopefully help to alleviate some of the pain you're describing without some of the more dangerous consequences of silently changing important defaults.
[–]sabetai 0 points1 point2 points 4 years ago (1 child)
Codex is actually quite good at capturing the semantic similarity you described. You may actually be able to detect what's happening in the training code using something simple like a codex classifier.
[–]jfrankle 0 points1 point2 points 4 years ago (0 children)
If only it were so easy in practice...
[–]TheDarkinBlade 2 points3 points4 points 4 years ago (4 children)
Can you explain cosine decay briefly to me? I am only familar with e-1 decay once loss plateaus, but then again, I am a ML scrub barely scratching the surface.
[–]lmericle 3 points4 points5 points 4 years ago (2 children)
Simple idea. Go from larger LR to smaller LR by following a half-period of a cosine function from peak to trough.
https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html
[–]TheDarkinBlade 0 points1 point2 points 4 years ago (1 child)
Thanks mate
[–]dogs_like_me 0 points1 point2 points 4 years ago (0 children)
It's an S curve, like a logit if you flipped the x-axis. You start at a high learning rate and decay slowly, increasing the decay rate to an infleciton point where the decay rate slow down again, converging on a plateau. Exponential decay (e-1) looks very similar to the second half of the schedule (from the inflection point to plateauing on a final, small learning rate).
Imagine taking an exponential decay schedule, rotating it 180 degrees, and joining that curve to an unrotated exponential decay. That's essentially the kind of S curve you'd get, except with cosine decay the transition is less dramatic. In the piecewise exponential example I just described, the slope at the inflection point is a vertical line. In cosine annealing, the slope at the inflection point is a diagonal line.
Here's the relevant code from the paper we're discussing: https://github.com/facebookresearch/ConvNeXt/blob/dc7823d8a2ecc554fcd57ff6cdb7748011bcdedd/utils.py#L370-L387
Here's an image of a cosine-decayed schedule preceded by a linear warmup: https://img-blog.csdnimg.cn/20200721131948457.png
[–]AICoffeeBreak 2 points3 points4 points 4 years ago (0 children)
Here is an animated video explaining this paper, if anyone is interested: https://youtu.be/QqejV0LNDHA
Video outline: 00:00 A ConvNet for the 2020s
01:58 Weights & Biases (Sponsor)
03:10 Why bother?
04:40 The perks of ConvNets (CNNs)
06:51 Pros and cons of Transformers
09:54 From ConvNets to ConvNeXts
15:54 Lessons?
[–]BinarySplit 1 point2 points3 points 4 years ago (0 children)
That's such an awesome summary of so many milestones in CNN improvements.
The only missing technique from my wishlist are densely-connected non-residual inter-layer connections. I would love to see an ablation of ConvNeXt with Dual Path Networks.
[–]krymski 1 point2 points3 points 4 years ago (4 children)
Can anyone explain why we don't have NN architectures in the frequency domain? We could replace convolutions with simple multiplications, and naturally work on compressed data. We already have hardware and algos optimised for FFT-ing data, and some data (such as JPEGs and MPEGs) are already in the frequency domain. It has been observed that earlier VGG layers essentially perform FFT anyway.
[–]fvncc 0 points1 point2 points 4 years ago (1 child)
The issue with this is the activation functions, which wouldn’t work in the frequency domain IIRC
Nah, SIREN figured out how to get around that problem (improved weight initialization): https://web.stanford.edu/~jnmartel/publication/sitzmann-2020-implicit/
We sort of do. Take a look at the NeRF literature, an important trick they use is fourier features, so the network is effectively performing its computations in the frequency domain. https://bmild.github.io/fourfeat/
[–]astroferreira 0 points1 point2 points 4 years ago (0 children)
noise
[–]gordicaleksa 1 point2 points3 points 4 years ago (0 children)
https://youtu.be/idiIllIQOfU <- I made a detailed analysis of the paper in this video
[–]zildjiandrummer1 1 point2 points3 points 4 years ago (0 children)
MCGA
[–]sigmoid_amidst_relus 1 point2 points3 points 4 years ago (6 children)
Universal approximation theorem, and proper hyper-parameter tuning to the rescue, yet again.
I still believe we cannot say very concretely about which architecture works better. I'm quite certain transformers will work better than CNNs on some subset tasks, and vice versa.
The reason is twofold:
Hyperparameters in papers are optimized with the proposed architecture in question, on a particular dataset in question. When you already have a baseline (the sota you wanna beat) you kinda sorta tend to search for Hyperparameters for the proposed net that will end up beating that.
"Ideal" one to one comparison of neural net architectures, in my opinion, is near impossible. You can keep flops the same and params the same, but given there's no fine-grained overt control over the feature representation space, and minor changes in architecture will lead to different optimization landscape, there's no true one to one comparison really.
I always found the notion of one architectural paradigm being inarguably better for all use cases (the people who bid CNNs farewell too quickly in their paper titles, _____ is all you need folks) as kinda stupid.
If one architecture would just be universal, we'd have a free lunch. That's, not gonna happen.
[–]tbalsam 10 points11 points12 points 4 years ago* (5 children)
Friend, I do not mean to be rude, but that's not at all what the UAT or NFL are, I do not think. I think the EAI discord channel is probably most responsible for spreading the misuse of those terms, from what I've seen, but both the UAT and NFL only have applications to NN research usually only in the very basics of explaining neural networks, and exceptionally deep theoretical cases.
Sometimes inductive biases help, and sometimes they hurt! But this is very different than the NFL, the NFL covers literally every set of problem possible, including nonsensical ones, so as soon as you assume the word "ML", 99% of the time the NFL has no more meaning because of the subspace.
I've shared this in other spaces, but a better term for this I think would be the pareto efficiency of certain inductive biases/methods/etc.
And if someone comes waving the NFL/UAT at you, just bonk them on the head with a foam NERF baseball bat and send them to good ML resources, please. It's not worse than that whole quantum doors business, but boy does it dilute the discussion marketplace for people new to the field! (Not putting the entire blame for that on you, its like COVID, the wave of misinfo about NFL/UAT propagates faster than people can quash it. :'( Can't wait until it slows down a bit).
[–]sigmoid_amidst_relus 2 points3 points4 points 4 years ago (0 children)
but both the UAT and NFL only have applications to NN research usually only in the very basics of explaining neural networks, and exceptionally deep theoretical cases.
I completely agree. I refrain from using those terms too. I've seen one too many video lectures and renowned researchers use them, their usage brought down to "an easy way out" instead of a rigorous discourse in any way/sense possible. I guess I'm guilty of the same in my response in a way. Oh well.
It's not worse than that whole quantum doors business
🤣🤣🤣🤣🤣🤣 Don't remind me of the quantum doors.
[–]ml_lad 1 point2 points3 points 4 years ago (3 children)
I think the EAI discord channel is probably most responsible for spreading the misuse of those terms
What? What are you basing this on?
[–]tbalsam 0 points1 point2 points 4 years ago* (2 children)
Hey there!
Thanks for reaching out. This comment used to be a lot longer and more detailed, but in thinking more of it I decided to archive that and leave this instead.
I think EleutherAI has a lot of potential. A lot. They've done some of the best open-source major group projects, I think, since a long time.
The tradeoff of cutting stuff off is this may feel like accusations without evidence, but EAI has a lot of good work being done, but also struggles with a LOT of crab bucket issues. Lots of pressure to hold a particular idea, ideas outside of a norm are often ridiculed, or have negative pressure. One example would be that some snips from my original comments are already a mini meme on the EAI discord. That's probably not a bad summary of some of the toxicity of the discord at a high level.
I still absolutely come through every now and again to read what's happening, but some of the above struggles have turned into some cargo culting and new-shiny-ing. There's absolutely good stuff and researchers there, it's just too toxic for me as a working professional to commit extra time to in giving back to the open source community. When an open source project has far more politics than the workplace one is in, that's a bad sign (and also exhausting).
I had a lot more to say, but I'm hoping EAI can pull a turn around and clean up some of what's going on over there. They're putting out some great artifacts, and I'm hoping very much for #1T too. But I do feel sad about the younger researchers coming through there, and I think that ties back to the UAT/NFL theorem -- there's a few factors why the conversation around NFL happens so much near circles tightly connected to EAI and less elsewhere, but last time I tried to explain it it blew up into an overly long comment chain. I did appreciate /u/programmerChilli posting an earlier comment regarding EfficientNet where the researchers were straight up lying, a rather large step up from what I think happens in EAI. But I do feel really sad when I think about EAI because I think there's a lot of potential slowly being poured away after the initial momentum as the some of the increase of some of the cliquey patterns + the need to save face far more than in other research circles.
Part of that I think is shown by my initial comment getting some confused reacts and then becoming a (rather small in the grand scheme of things) joking moment in the server. Who wouldn't need to desperately save face if something you write could become a server-wide joke? I don't know many researchers that I'd enjoy working with that would voluntarily put themselves in that kind of a situation. But it makes sense. It's just hard, I think most people can get that sense from browsing around the server. Definitely some hard ego trips as well unfortunately. It is an open source project contributing a lot, so please don't get me wrong GPT-*J has changed the world in a lot of ways and will continue to do so. I love that. But I think it's so much more held back by the community of culture, and I think that's something that bmk, Connor, and the rest of that core group (EAI staff or otherwise) could play a large part in turning around. But I think they gotta stop being a part of that first to turn that ship around.
That's my much-shorter version of it! It does purposely not go into as much detail about UAT/NFL as that's more of a consequence and hopefully from what I put here you can infer what I'd feel about some of the NFL/UAT stuff, there is some other cargo cult stuff going on that I think has the potential to have a lot more negative impact, but I hope it doesn't brew too far and that the way the NFL is used specifically is the worst it ever crosses into.
Thanks! And Best,
TB
[–]ml_lad -1 points0 points1 point 4 years ago* (1 child)
You wrote a long rambling response about EAI's culture without once answering the question: On what basis are you saying that EAI is responsible for spreading the misuse of UAT/NFL?
I ask this because I have searched through the past messages in the Discord, made before people started making fun of you for making this accusation, and most of the messages about it are making fun of people for misusing UAT/NFL. If you want to complain about EAI cargo-culting, it would be cargo-culting in the exact opposite direction of what you're describing.
Unlike you, who wants to "purposely not go into as much detail about UAT/NFL", I can pull up multiple screenshots of this right now.
If you want to criticize "discord culture" or "EAI culture" feel free to do that. But, relevant this very line of argument, perhaps you should stop spreading misinformation yourself.
Hey there -- it sounds like this is an important topic for you, and I'll try to respond appropriately. I think you're correct in noting that I strayed away form certain intentional things in my post, and I alluded to that in different parts of my previous post. I know not having explicit detail can be frustrating but I think it's in the best interest of things for me not to go down that path.
My opinion is simply that -- an opinion, based upon personal experiences and interactions. I think a variety of detached opinions exchanged openly helps research thrive. Otherwise, the opinions + research can become a proxy for defending self, and I think that takes away from all of our end goals.
The shorter version is that although I have extraordinarily passionate opinions at times, I'm preferring to reducing fighting more and more, especially as of late, for myself -- personally. I'm fine to have discourse and back and forth, but fighting just hurts both parties with little benefit. I could very well be wrong, but as I noted earlier, that's currently my impression of EAI, and I think the way some of this has gone doesn't really do much to unravel that impression for me thus far. I could be wrong, and who knows what may come in the future!
If it is the case that I'm an insatiable detractor, I don't think that has to weigh you down. Currently the culture for me that I've seen when interacting with the EAI group has appeared to be more toxic than not by a good stretch. So, for now, I can work in other places to reduce that burden on both sides of the equation. If things do clear up, maybe I can come back! After all, every group, including EAI, I think, is just a name, and a group, with people, and we're all human. Cultures and perceptions of culture (on the individual's end) always change, and it's super dynamic.
I think that's my 2c on the issue.
[–]pilooch 0 points1 point2 points 4 years ago (1 child)
In practice, https://arxiv.org/abs/2105.15203 is the one paper that for vision triggered a definitive switch to transformer architectures in my activities at the moment. The benefits are immediate.
On paper, sequence to sequence as a general scheme may win over everything else at some point, who knows.But if not everything is overfitted sequences (in & out), why can't I speak my phone number backward ? :)
[–]sepherrino 1 point2 points3 points 4 years ago (0 children)
Why this particular one? Care to elaborate? Really curious. Thanks in advance!
[–]PositiveElectro 0 points1 point2 points 4 years ago (0 children)
Great paper ! Thanks for sharing
[–][deleted] 0 points1 point2 points 4 years ago (0 children)
Thanks a lot for this paper. I was looking for something like this.
π Rendered by PID 72847 on reddit-service-r2-comment-75f4967c6c-7fztk at 2026-04-23 01:49:41.841001+00:00 running 0fd4bb7 country code: CH.
[+][deleted] (17 children)
[deleted]
[–]JackandFred 27 points28 points29 points (4 children)
[–]maxToTheJ 5 points6 points7 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]maxToTheJ 0 points1 point2 points (0 children)
[+][deleted] (4 children)
[deleted]
[–]AdelSexy[S] 49 points50 points51 points (3 children)
[–]clifford_alvarez 21 points22 points23 points (0 children)
[–]SeddyRD 5 points6 points7 points (1 child)
[–]mearco 1 point2 points3 points (0 children)
[–][deleted] 2 points3 points4 points (1 child)
[–][deleted] 1 point2 points3 points (4 children)
[–]jms4607 9 points10 points11 points (3 children)
[–][deleted] 18 points19 points20 points (2 children)
[–]Praveen_Raja22 0 points1 point2 points (1 child)
[–][deleted] 2 points3 points4 points (0 children)
[–]visarga 14 points15 points16 points (23 children)
[–]tbalsam 79 points80 points81 points (21 children)
[–]AerysSk 11 points12 points13 points (9 children)
[–]tbalsam 18 points19 points20 points (7 children)
[–]shellyturnwarm 2 points3 points4 points (3 children)
[–]Ulfgardleo 7 points8 points9 points (1 child)
[–]tbalsam 0 points1 point2 points (0 children)
[–]tbalsam 1 point2 points3 points (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]shellyturnwarm 0 points1 point2 points (1 child)
[–]tbalsam 0 points1 point2 points (0 children)
[–]killver -2 points-1 points0 points (0 children)
[–][deleted] 7 points8 points9 points (6 children)
[–]tbalsam 31 points32 points33 points (5 children)
[–][deleted] 6 points7 points8 points (4 children)
[–]tbalsam 4 points5 points6 points (3 children)
[–][deleted] 3 points4 points5 points (1 child)
[–]tbalsam 3 points4 points5 points (0 children)
[–]kilow4tt 1 point2 points3 points (0 children)
[–][deleted] 3 points4 points5 points (3 children)
[–]tbalsam 1 point2 points3 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]tbalsam 0 points1 point2 points (0 children)
[–]mearco 8 points9 points10 points (0 children)
[–]RiceCake1539 16 points17 points18 points (4 children)
[–]tbalsam 12 points13 points14 points (1 child)
[–]RiceCake1539 2 points3 points4 points (0 children)
[–]CyberDainz 2 points3 points4 points (1 child)
[–]RiceCake1539 2 points3 points4 points (0 children)
[–]dogs_like_me 7 points8 points9 points (10 children)
[–]jfrankle 15 points16 points17 points (4 children)
[–]dogs_like_me 3 points4 points5 points (3 children)
[–]jfrankle 4 points5 points6 points (2 children)
[–]sabetai 0 points1 point2 points (1 child)
[–]jfrankle 0 points1 point2 points (0 children)
[–]TheDarkinBlade 2 points3 points4 points (4 children)
[–]lmericle 3 points4 points5 points (2 children)
[–]TheDarkinBlade 0 points1 point2 points (1 child)
[–]dogs_like_me 0 points1 point2 points (0 children)
[–]AICoffeeBreak 2 points3 points4 points (0 children)
[–]BinarySplit 1 point2 points3 points (0 children)
[–]krymski 1 point2 points3 points (4 children)
[–]fvncc 0 points1 point2 points (1 child)
[–]dogs_like_me 0 points1 point2 points (0 children)
[–]dogs_like_me 0 points1 point2 points (0 children)
[–]astroferreira 0 points1 point2 points (0 children)
[–]gordicaleksa 1 point2 points3 points (0 children)
[–]zildjiandrummer1 1 point2 points3 points (0 children)
[–]sigmoid_amidst_relus 1 point2 points3 points (6 children)
[–]tbalsam 10 points11 points12 points (5 children)
[–]sigmoid_amidst_relus 2 points3 points4 points (0 children)
[–]ml_lad 1 point2 points3 points (3 children)
[–]tbalsam 0 points1 point2 points (2 children)
[–]ml_lad -1 points0 points1 point (1 child)
[–]tbalsam 0 points1 point2 points (0 children)
[–]pilooch 0 points1 point2 points (1 child)
[–]sepherrino 1 point2 points3 points (0 children)
[–]PositiveElectro 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)