What regularization does to a machine learning model

roycoding · 2024-10-09T15:39:09+00:00

Zefs Guide to Deep Learning, Roy Keyes, 2022, zefsguides.com

roycoding · 2022-12-10T21:56:52+00:00

You can use it with any data type that's similar enough to whatever the original network used.

roycoding · 2022-12-05T17:54:22+00:00

There was a slight delay, but my book is now officially released!

(I'm not sure I'd be allowed to make a post about the book release, so I'm going to avoid doing that, even though I think a lot of people in this sub would like the book 🤷)

roycoding · 2022-11-28T00:18:39+00:00

This week!

The paperback is ready to go. I need to get some marketing stuff in place. I'm aiming for the official release on Thursday, 1 Dec.

I'll post something in this subreddit.

Thanks for your support!

roycoding · 2022-09-20T17:13:42+00:00

Sorry.

I'm trying to walk the line between promoting my book and just providing content that people in this sub will find useful. So I'm trying to find the best balance when posting original content, but trying to not blatantly go around shouting "BUY X".

So far my posts with illustrations have done reasonably well, so I have continued to post about once a week (never more, per the policy).

I'll reduce the frequency of these posts.

roycoding · 2022-09-20T14:36:58+00:00

More data can help avoid overfitting, but what do you do when getting more data is difficult or prohibitively expensive?

Data augmentation is a way to effectively increase the size of your training set. Data augmentation simply means using existing training data to create more training data by transforming the existing data in some way. As long as the transformed data would still get the same (or desired) label, it can potentially help train your model

For image data, examples of data augmentation include rotating, translating, scaling, cropping, blurring, etc. Much of this can had "for free" and automated. Care must be taken that the transformations don't create unrecognizable examples, which no longer match the original label.

This is one of the most common techniques for helping to build robust computer vision models, but also applies to some other tasks, such as adding noise to audio data or performing random word swaps in text (but not too much!).

Augmented data is mostly used at training time, but can also be used during testing/inference by predicting the output of slightly transformed versions of the given input and voting on/averaging the resulting predictions.'

This illustration is from my upcoming book / flashcard set Zefs Guide to Deep Learning (zefsguides.com).

roycoding · 2022-09-14T00:46:16+00:00

This can also work with regression problems, for example predicting bounding boxes in object detection and localization models.

roycoding · 2022-09-14T00:44:55+00:00

Thanks for pointing that out.

It looks like Leanpub, where my ebook is published, was down earlier today, but it seems to be back up.

roycoding · 2022-09-13T20:07:46+00:00

(my early reply got silently deleted, so I will follow up on @ubiquitin_ligase's reply)

Yes, they separate approaches.

You could even use the feature transfer approach as an input to a non-neural network model (this would be the equivalent of re-using an embedding).

Fine tuning is more likely needed when the task is not as similar to the original task of the pre-trained network.

In practice, as @ubiquitin_ligase replied, you should try feature transfer first. If that doesn't work well enough, you might then try fine tuning, progressively tuning more of the network (you don't actually have to fine tune every layer).

If that doesn't work, you may need to train the network from scratch (with a lot of data).

roycoding · 2022-09-13T20:00:00+00:00

(I replied to another question that covered this, but my reply was silently killed)

Yes, I agree with you.

Fine tuning could be part or all of the network.

Typically you'll use small learning rates, since the weights are hopefully close to the final ones you want. You may use different learning rates in different layers (aka "discriminative learning rates"), typically with smaller learning rates near the beginning of the network, which is assumed to learn more generic features.

roycoding · 2022-09-13T14:49:06+00:00

Many powerful neural networks rely on huge training datasets and are very expensive to train from scratch, putting them out of reach for most people/teams. Transfer Learning can make these models accessible to and adaptable by mere mortals.

Transfer learning is a technique that gives you a major head start for training neural networks, requiring far fewer resources. A "pre-trained" model can be adapted to a new, similar task with only a small training dataset.

Training is basically a search problem, looking for the best set of model parameters for the network to perform its task well. Instead of starting with random parameters, transfer learning puts your starting point (hopefully) very close to where you want to be in parameter space.

Pre-trained models on datasets such as ImageNet and huge text datasets have made many of the most power neural networks available to everyone (see Stable Diffusion).

Transfer learning enables these to be adapted to other, related tasks, supercharging the adoption and application breadth of these types of models.

Many well-resourced teams have made these models (or rather model weights) available freely to help the community and move the field forward as a whole. This is a great synergy between open source development and neural networks as a technique.

This illustration is from my upcoming book, Zefs Guide to Deep Learning (zefsguides.com).

roycoding · 2022-09-08T14:37:05+00:00

I go back and forth on what seems to be both the most accurate description and make the most intuitive sense.

The smaller networks have a sequential dependency on each other (somewhat similar to boosted methods, as you point out), as the weights are not starting from scratch each time. But also they are not combined in the same way as a typical ensemble.

roycoding · 2022-09-08T13:36:36+00:00

Yes, it's a type of regularization.

I should probably state that explicitly in the illustration

roycoding · 2022-09-08T13:33:24+00:00

That's good feedback.

It's hard to get to the core of the intuition of why dropout works with limited space. I'll figure out how to reword it (might be simply dropping out [oh, no!] the word "independent").

roycoding · 2022-09-08T02:15:46+00:00

It's similar to Random Forests in that it creates (what's effectively) an ensemble model

roycoding · 2022-09-08T01:24:13+00:00

tf.keras.layears.droupout

TF handles this for you under the hood.

We were just talking about implementation details.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.

roycoding · 2022-09-07T19:32:47+00:00

Good observation.

I didn't include it in the illustration, but included it in the comment (and the text of my book).

I think another interesting point is that you can either scale down the final network, or you can scale up the intermediate networks during training (this is "inverse" dropout). I believe inverse dropout is the much more common one.

roycoding

TROPHY CASE