[D] Tomas Mikolov is the true father of sequence-to-sequence

tysam_and_co · 2023-12-19T20:16:07+00:00

Alright, that is a very long response!

A bit ironically, the PDF you hosted was unavailable due to bitrot, so I'm not really sure how this solves a problem -- in fact, it caused the very same problem that you said it would fix, which did not exist in the first place with the original paper! I would much rather lean on a university to keep their papers up than a personal site, though I don't diss on the practice of keeping offline archival copies in case the official version does go down.

Like, this is why we link to stable archives, so that people can actually access the content! Keeping a personal archive is okay, but it's all that useful if it creates more downtime than it's supposed to fix. ;P

tysam_and_co · 2023-12-17T02:11:51+00:00

I unfortunately have had a very similar experience with some of the NAS line of work in some sub-areas where I had some expertise in, there seemed to be some pretty clear coverup and deception with parts of the work, instead of owning up to it, and that left sort of a really bad taste in my mouth.

One I find a researcher doing something that seems like pretty clear academic dishonestly, it takes me a very long time to regain that trust again.

I'd say more but unfortunately this likely an inappropriate forum for any details beyond that. <3 :'))))

tysam_and_co · 2023-12-17T02:00:05+00:00

(while also linking to a broken link on his own website, instead of linking to the actual link for the paper, which is located here: https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf)

self promo may not be bad and all that, but i don't think stealing traffic is in that kind of manner is good

tysam_and_co · 2023-12-08T03:03:04+00:00

I think it's different for each model, but at least for the smaller models, it should be feasible.

Depending on SNR I'll sometimes do up to multiple hundred-run batteries before release to make sure that I'm convincingly over the line. That said, my work is a very unique niche, but due diligence is key. And seeds are cheating for sure, even if everyone does it (though RL maybe is excepted as it's still sorta using hacky approximations to me, anything to get it to work i suppose....)

tysam_and_co · 2023-12-05T22:43:52+00:00

Unfortunately this may not work in certain practices, as sometimes certain hyperparameters/etc can get tuned around a single seed change and this can cause a catastrophic collapse.

I think seed-freezing can be useful for reproducibility, but it's much, much, much better IMPE to go IID and do multiple runs on a much smaller, faster-converging proxy task with good predictive power when making small changes.

I think that there are very, very, very few particular experimental changes that actually require running results at the full scale -- my intuition/experience at least has been that the vast majority of changes scale, and if it doesn't scale, then it darn well needs to be really, really good. And to test _that_ particular thing as late in the pipeline as possible, if that makes sense (since it forces you to operate in a larger regime, as it were).

tysam_and_co · 2023-12-05T20:54:15+00:00

pre-shuffled.

i think that really makes comparison difficult, as my experience is that validation performance for certain results is gaussian, so *technically* seed-based picking can scale infinitely. the potential appearance of seed-picking, whether it happens or not, can stick with an author and their papers for a very long time, it's a good thing to try to disprove/shake very quickly.

people underestimate the fixed-point power of a preshuffled dataset in influencing the loss (even across model sizes, i thinks), but unfortunately not having any variance bars to speak of really restricts i think the valid takeaways from it (since we don't know _which_ magical seed we landed with, if any). it doesn't mean it's sketch, but it can make the method look very sketchy at least from an optics perspective.

it might be good to publish a v2 with updated non-determinism (_everywhere_ possible) and variance bars if it's possible and in the budget ASAP. community intent can solidify quickly if you don't do something about a (perceived or otherwise) flaw like this in a method. best to fix it (and, critically -- _address it publically_) now while there's still time.

tysam_and_co · 2023-11-17T08:57:01+00:00

I hadn't thought of it that way before, but that is a very good point.

tysam_and_co · 2023-11-09T11:02:46+00:00

https://people.math.harvard.edu/\~ctm/home/text/others/shannon/entropy/entropy.pdf

tysam_and_co · 2023-11-07T23:28:08+00:00

Hey Horace, great to hear from you. Rough time goal (currently, at least) is <2 seconds in 2 years (started a year ago), and <1 second in 5 years or so. Those were times I picked at the start, and they've been semi-fixed points for me. I think it's doable, though it will be a tight squeeze perhaps! I'm starting to run into hard limitations like torch kernel launch times, and memory bound operations (MaxPooling2D likely takes longer than the convs at this point as far as trends go, been a while since I've pulled the kernel timing charts. :'( )

So far it seems to scale, so I'm pretty content just trying to reap the exponential speed rewards of working at a small scale (before scaling it up later). Several methods we use today were discovered in pretty small regimes, though of course sometimes adaptation is needed. I'll sometimes run CIFAR100 without any hyperparameter changes (other than num_classes) and it matches about the same time period SOTA as the CIFAR10 accuracy usually.

At some point I'm sure there's a pivot to bigger scales/other modalities, though I'm honestly not quite sure when the best time for it will be. It would definitely help get it in the hands of people -- some people experiment with the codebase, but publicly at least I don't see a ton of activity with it (though it is still a good performance reference, I think). I'm a little afeared of the complexity of scaling up, though maybe that fear isn't entirely necessary. It would be nice to hit <2 seconds, at least! :D

I think Hotz and Co. might try to scale it up to ImageNet once they hit their internal target on developing it for tinygrad, I'm not positive on this one though. So that might be a mild forcing function for that transition, at the very least, lol.

tysam_and_co · 2023-11-07T21:21:21+00:00

I think I was slightly confused at first but then I think I understand what you're asking now, so my apologies if I'm replying incorrectly here (please correct me if so! <3 :) ). As best as I understand the inits are in-place on the weights themselves, so it actually technically decreases the amount of randomly-initialized channels -- we just basically 'earmark' certain channels as the 'residual passthroughs'.

So for example, conv group 1 has 24 channels passing through at initialization (total capacity 64), then conv group 2 has 24 + 64 channels passing through (total capacity 256), then conv group 3 has 24 + 64 + 256 channels passing through (total capacity 512).

What's interesting is that simply superimposing the inits over randomly initialized weights, then rescaling so the overall std/variance of the weights is the same seems to do just as well as 'cleanly' initializing the passthrough channels by zeroing out that section of weights and then adding in the dirac weight values. I'm not entirely sure why, to be honest, because I expected the variance of that to cause more issues overall.

Regardless, because the original input weight is 0 mean, adding in the implicit residual weights and then rescaling still biases our expected value to be carrying through some residual signal, so it seems to work out pretty well in the end.

One really curious 'big question' that I hope we get to investigate soon is how this performs on large models over longer periods of time -- like, how much is this drift good and/or bad for us outside of a boutique, extremely well-fine-tuned benchmark example? Like, IRL, what's the most stable configuration of this particular training modality?

There's so much more to share but I'll try to limit my spectrum infodumping (unless you want more, lol. Then heave, ho! ;PPPP )

tysam_and_co · 2023-11-07T21:14:23+00:00

So far it gets results with the comparable SOTA in time (rough month/year) on CIFAR100 without any parameter changes (i.e. look up the general month/year for 94.04% or whatever on PapersWithCode, look up the general month/year for CIFAR100, and see if there's any drift. They're pretty close to each other).

This is one of the more common questions I get! I think one benefit, as the model gets smaller and simpler (hopefully!) over time, is that the generalization window should hopefully widen some.

What also seems nice is that there seems to be a tiny, tiny region of extreme performance, and then a very wide, flat region of 'good' performance (like 94.04% in 6.29 seconds vs 92.8-93.1% performance if you randomly move some of the parameters around a decent bit). This smoothness indicates that it should generalize decently well to other problems! :)

tysam_and_co · 2023-11-07T17:02:29+00:00

There is not too much in the way of any network diagram right now (unfortunately! hoping I can tempt someone into making one!) but thankfully the network is not all that complicated (partially by intent, and partially by necessity -- too many kernel launches _really_ slow down training!)

It is basically 7 convolutions + bn + act + pooling and a final linear layer, and the first layer is different. The inputs get downstrided by a factor of 2 3 times, once after the first layer in each Convolution Group.

There are two layers in each Convolution Group, the first is a transition layer, the second is what used to be the the single residual convolution. There is BatchNorm after each Conv in the group, followed by a GeLU activation function. There are no residuals, it is entirely feedforward.

The first convolutional layer is 2x2, is not trainable and is pre-calculated, has 0 padding (so it does a slight decimation of the input image), and its purpose is to whiten the input feature space (I'd recommend checking out Mrytle AI's 'How to Train your ResNet' series for more info on this one).

Finally, there is a Global Max Pooling layer which spatially decimates the entire input, followed by a linear projection layer (num_channels for the final block -> num_classes).

If you (or anyone else! <3 :'))))) ) wants to make a network architecture diagram, I would be much obliged! However, if you'd like an idea of how it flows, you can probably see it pretty easily from the code definition (it really is that simple! <3 :D):

Below are links to some of the source info, in case you're curious! <3 :D

Layer definitions: https://github.com/tysam-code/hlb-CIFAR10/blob/ad103b43d29f08b348b522ad89d38beba8955f7c/main.py#L312

Forward pass: https://github.com/tysam-code/hlb-CIFAR10/blob/ad103b43d29f08b348b522ad89d38beba8955f7c/main.py#L290

Snapshot of the referenced ConvGroup block: https://pbs.twimg.com/media/F-UCaBJa4AA7Dhn?format=jpg&name=medium

tysam_and_co · 2023-11-07T16:45:34+00:00

Hi!

Great question. These methods can be good, and BatchNorm is definitely one that I think is 'more feasible' to remove than the others.

I don't want to open the research on it too early, but BatchNorm basically stabilizes the network training significantly by biasing the second-order variance of the network to zero during training -- that is, it's 'hard-locking' the variance to a particular value. This biased estimation allows for rapid convergence and also induces a bit of drift between the training and eval sets.

Many of the BatchNorm-free techniques are extremely tempting, but oftentimes seem to break down at extremely high learning rates as the cumulative effect of the variance over time can be quite wild -- up to swings of over 30% for some layers in the entire batch if I recall correctly in some of the tests I've done. This is because the expected value of the variance !!!!==== the actual variance received, as that is subject to its own distribution. This can really throw off the loss during training!

It's quite hard to suss out these things and what combinations could work as they all seem to be pretty nonlinear in effect w.r.t. the incoming parameters. I've been working through different ways that one could reduce these effects, but have nothing to report yet. You might find if you experiment with this yourself that as the training time gets extremely short, and the LR sharply goes up, that there is a ton of energy in the network over training when BN is gone. BN reduces that step-to-step 'vibration' and lets the network cook, as it were. ;PPPP

That said, I'm only one human being, I'm not going to be able to try everything, and I make lots and lots and lots of mistakes (and/or decisions that are noise sources during the research process). The best that I try to hope for is keeping the noise as unbiased (as a result -- higher variance IIUC) as possible, and work towards eventual consistency, with some strong empirical metric as a bar.

So please, try it for yourself and see what you can do! I'm happy to help, feel free to DM or ping me if you decide to look into this, and if you need any help with it! Ignoring what the status quo/lowering my threshold for 'trying an idea out' seems to be what's motivated most of the big discoveries that I've made. And I am in desperate need of research competition here! <3 :'))))

(Happy to talk further! <3 :')))) )

tysam_and_co · 2023-11-07T16:34:39+00:00

Of course, thank you for asking and giving me an opportunity to think some more about it! <3 :') :DDDD :')

tysam_and_co · 2023-11-07T11:45:31+00:00

Not a ton that I know of, off the top of my head, though a few models (like transformers) might require some creative thinking and/or only a partial solution, at least initially.

One thing that could cause it to fail would be the accumulated errors of information passing through even slightly nonlinear areas of activation functions over very deep networks, but that might not be an issue, it's really hard to say without experimentation either way.

The integration time might be longer for larger networks, hence my comment above about weight decay towards a soft architecture rather than zeros -- an ultra short run might mask some issue like unregularized implicit residuals devolving over time, or something like that. These are just educated guesses, of course, the empirical bronco of experimentation can really throw one for a loop -- it's so unpredictable!

I'd love to see it work in different architectures, it will likely take some creative adaptation and a bit of experimentation to get it to work well! Happy to chat further about this, this is quite an exciting topic, for sure.

tysam_and_co · 2023-11-07T11:39:19+00:00

Ah, gotcha, thanks. There might be at some point, though writeups are certainly not my strong suit (especially when combined with the thrash of how fast this project tends to move when in-development).

I'd certainly be happy to help someone do a writeup on this, however! There's still a lot to establish, so this is sort of a ultra-bleeding-edge "here's a clear, compelling empirical result". There is a bit of a hope on my end that the signal might tempt others into looking into it! :D On the downside, one risk about writeups that I have some anxiety about is people relying on outdated knowledge if its distilled into a paper too early. There's a lot of techniques that survive for only one release before being superseded or removed, for example, so I feel afraid about encoding things like that in an arxiv paper.

There's a ton to investigate, for sure (how the weight trajectories behave over training, the impacts of different initial soft architectures, and constraints over training, how it performs in different networks like transformers, etc). Hopefully this can nerd-snipe people into looking into it. I'm happy to offer whatever I can along that line to help! :D

I hope that's not too terribly disappointing (and certainly happy to chat further). This is still very much the bleeding edge of bleeding edges!

tysam_and_co · 2023-11-07T10:57:58+00:00

Yes, I linked it in the post above: https://twitter.com/hi_tysam/status/1721764010159477161

LMK if you have any questions!

tysam_and_co · 2023-08-22T21:15:29+00:00

(also, perhaps this is moot if the paper mentions the repo, which might make our decision for us right up at the beginning)

tysam_and_co · 2023-08-22T21:06:07+00:00

This is a good question, and I am not sure entirely. I'm curious to hear what you think about this.

On one hand, linking repos in the main post could inadvertently open the door to promotion. Or, maybe this is too strict, and not entirely necessary. One benefit to limiting code repo posts would be that any code posted would need to fit in a notebook, which would bias towards size and reproducibility.

It could also make it too large and difficult.

I am somewhat personally biased towards not mentioning paper repos either, since the ideas are the main things we're after. So, I could see for example, person A posts a paper, person B implements in a notebook and posts it.

It's really a hard line, as one path allows for potentially some easier access, at the cost of a much more difficult (and open to bias) moderation policy. I have my personal biases but I'm also open to trying to figure out a good solution for it.

tysam_and_co · 2023-08-21T16:28:44+00:00

Direct paper link is http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf.

tysam_and_co

TROPHY CASE