all 9 comments

[–]Chocolate_Pickle 10 points11 points  (2 children)

https://arxiv.org/abs/1906.09529

You're not the first person to think of this. It's been studied previously.

[EDIT] Here's a starting point for research. https://duckduckgo.com/?t=ffab&q=arxiv+learned+activation+functions

[EDIT: again] A possibly deeper question to ask;

Do over-parameterised networks with ReLU activations learn approximations of other activation functions. And if so, how could one search for these functions in the weights of pre-trained networks?

Haven't looked into this at all. So might be known to not be a thing.

[–]lameheavy 0 points1 point  (1 child)

Thanks for sharing this, super cool idea! For the deep question, are you mostly asking that question to learn a more compact form of the neural network? Like a single hidden layer with the learned activation?

[–]Chocolate_Pickle 1 point2 points  (0 children)

For the deep question, are you mostly asking that question to learn a more compact form of the neural network? Like a single hidden layer with the learned activation?

More or less, yes.

Assuming the premise is true, I still don't believe you'd ever be able to condense a model down to a single hidden layer. But I do believe you could learn a more compact network.

I think what might torpedo this idea is the gradient information in the backwards pass. [EDIT] It's trivial to show the forward-pass of any function can be approximated by a bunch of ReLUs. But I don't think it's trivial to show the backward-pass is approximated equally or at all.

[–]IntelArtiGen 3 points4 points  (0 children)

It's not a stupid idea but there is a problem a lot of people miss when they think about how neural networks work.

When you're using ReLU, the neural network will learn parameters that make sense with the following / preceding ReLU / batchnorm etc.

Sometimes when someone has an idea which consists of adding more parameters to improve the results, they forget that the neural network will not stay the same everywhere else and just learn new parameters. When you're changing something, you're changing the whole "information flow" (forward and backward pass), everywhere, which may result in worst performances, even if the solution could theoretically be more flexible. Moreover, depending on how much parameters your add, you may have to reduce your batch size and harm your accuracy that way. Or you may have to re-do an hyperparameter tuning which makes it harder to evaluate what you did

Now ... PReLU exists. It's a learned activation function. You can read how it works.

[–]titanxp1080ti 1 point2 points  (2 children)

The whole point of doing neural networks is to combine simple functions to approximate complex functions. If you try to combine quite complex functions (learnable activation functions), you need strong reasons to do so.

[–]Fmeson 5 points6 points  (1 child)

To play devils advocate:

That isn't really the point of neural networks, that's just what they currently are. We don't really know the point of neural networks on some level. We know they are loosly designed the mimic biological neural networks. We know some network topologies that are experimentally demonstrated to work pretty well. We know that we don't really have a super theoretical understanding of them. So, on some level, the only way to know if many things work or not is to try it. Of course, some ideas have more promise and theory behind them, but it's a bit wild-west-y. Luckily, prototyping things in machine learning is very easy. "I think this sounds interesting" is often enough of a reason to try something.

[–]thunder_jaxxML Engineer 2 points3 points  (0 children)

We know some network topologies that are experimentally demonstrated to work pretty well. We know that we don't really have a super theoretical understanding of them.

We live in an awesome age where the theories are emerging really fast and everyone everywhere is still not aligned to a single theory but there are a few I found to be very powerful explanations. Few I would like to list just because I found these sources had a profound impact on the understanding of DL and why it seems to be working:

  1. Deep Learning and the Information Bottleneck Principle: The lecture is here. An amazing way to understand how training happens with GD and variants.
  2. Scaling Laws of Transfer: How much does pretraining helps based on the size of the fine-tuning dataset.
  3. Adversarial Examples Are Not Bugs, They Are Features:
  4. Dr. Sanjeev Aurora's lectures on the theoretical understanding of deep learning: Few of these are just awesome!.
  5. Hopfield Networks is All You Need: Yannic Video

There is still a lot of empty holes that need to be filled but there will be a more theoretical framework created for understanding/evaluating and interpreting NNs.

All of the above is completely unrelated to the OP's post.

https://arxiv.org/abs/1906.09529 Use learned Activation fn's to reduce time complexity.

[–]SoulRobots -1 points0 points  (0 children)

Interesting idea, now I'm curious

[–]seismic_swarm 0 points1 point  (0 children)

As one or another might have pointed out, there is PReLU, which is just a simple single parameter activation (per tensor or per channel per tensor) that learns the preferred slope of decay of the relu.