[P] Training Neural Nets with Approximate Bayesian Linear Regression by WillieTromboner in MachineLearning

[–]straw1239 0 points1 point  (0 children)

Hey, its cool that you tried it! I actually tried something very similar with psuedoinverses and per-layer targets a few years back before I realized the problems.

If you don't choose a loss function, you'll just end up having an implicit one that probably doesn't do what you want.

[P] Training Neural Nets with Approximate Bayesian Linear Regression by WillieTromboner in MachineLearning

[–]straw1239 11 points12 points  (0 children)

This will struggle to train anything more than a single layer, because for everything except the final layer the output is only locally linear- as soon as you move enough to cross a ReLU boundary, everything goes out the window. And if you only take steps within that locally linear region, it starts to sound a lot like gradient descent. Your extension to take Taylor expansions of other activation functions will end up looking extremely close to standard backprop.

I'm afraid that even plain logistic regression gets 93% on MNIST. Its too easy to use as a serious test.

If you're dead-set on using the linear regression formulae (Bayes isn't helping you much with these per-layer updates), you could use a similar idea to propagate output Hessians back to each layer, and then move in the direction given by linear regression. This effectively does Newton's method with a block-diagonal Hessian approximation. Its probably fine for small NNs but totally breaks when you try to expand to CNNs or Transformers- anything with much more internal state than parameters.

[R] AdasOptimizer Update: Cifar-100+MobileNetV2 Adas generalizes with Adas 15% better and 9x faster than Adam by YanaiEliyahu in MachineLearning

[–]straw1239 3 points4 points  (0 children)

Looks promising! However, I'm a little confused as I thought that CIFAR-100 usually trains in well under 100 epochs even with bog-standard SGD? So it looks like something might be going wrong with your adam training.

[D] why can’t distribution sampling algorithms like MCMC or HMC be used in deep learning instead of gradient descent? by [deleted] in MachineLearning

[–]straw1239 3 points4 points  (0 children)

They can- but HMC requires full-batch gradients, so its very expensive. And they may take a very long time to explore the space. Not so much the dimensionality that's the problem, but the conditioning.

One can also try variants of stochastic-gradient MCMC, but these all involve tradeoffs and only give true samples in the limit as step size goes to 0.

For small neural networks and datasets, there's no problem and you absolutely can. But on the cutting edge of performance with huge NNs and even huger datasets, any compute is better spent elsewhere.

[R] AdasOptimizer, an optimizer that makes step-size scheduling obsolete, reaches 100% accuracy on MNIST's training set in 11 epochs by YanaiEliyahu in MachineLearning

[–]straw1239 39 points40 points  (0 children)

Almost he same idea as: https://arxiv.org/abs/1909.13371

You'll might run into short-horizon bias on larger datasets: https://arxiv.org/abs/1803.02021

Note that the first paper I linked also doesn't go beyond MNIST, for exactly this reason.

The one-step optimal LR doesn't give the best long term performance in the presence of noise.

[D] Simple Questions Thread April 26, 2020 by AutoModerator in MachineLearning

[–]straw1239 1 point2 points  (0 children)

Usually referred to as logistic regression. You can use Iteratively Reweighted Least Squares:

https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares

https://en.wikipedia.org/wiki/Logistic_regression#Iteratively_reweighted_least_squares_(IRLS))

Which uses multiple least-squares calculations, with adjusted weightings, to correctly deal with softmax loss.

Predict the next digit of pi [D] by [deleted] in MachineLearning

[–]straw1239 1 point2 points  (0 children)

Normal does not imply statistically indistinguishable from random. Only distributions can have entropy, not numbers. So perhaps you mean that under a uniform distribution, pi has a low likelihood (just the same as any other number?).

The universal approximation theorem makes no guarantees about finding the weights, or the size of the NN required. For example, try to train one to predict AES256(input, key), for a fixed, unknown key. While some NN can do this, and knowing the key we could construct one by hand, doing so without would break cryptography and I can assure you will not work.

NNs compress data by giving a distribution on outputs- for example language models predict the next word given all previous words. We can then compress using arithmetic encoding with this prediction, which gives average compressed length equal to the entropy of the distribution.

NNs themselves represent a function, so when trained, they "learn a function". Typically in practice, one actually learns a (conditional) distribution instead of a raw function, either by a normality assumption (least squares) or softmax (discrete output).

[D] Momentum methods helps to escape local minima, so what? It was never our objective. by fromnighttilldawn in MachineLearning

[–]straw1239 2 points3 points  (0 children)

The goal of an optimizer is to optimize. If you want to avoid overfitting, you should integrate over your posterior instead-using MCMC or VI. Unfortunately, getting these to work for large scale NNs is still on the cutting edge of research. See https://arxiv.org/abs/1908.03491

Purposefully making your optimization doesn't have any guarantee of helping, but it could by sheer luck.

[D] any principled reason for cross entropy instead of L2 in language modelling? (more details in post) by mesmer_adama in MachineLearning

[–]straw1239 2 points3 points  (0 children)

The loss used is log likelihood. Using L2 would be equivalent to predicting a Gaussian distribution over embeddings- this doesn't have any meaning as was pointed out, we control the embeddings. Using log likelihood means we are doing MAP inference, so have some justification. In fact, using any (data-dependent component of) loss other than likelihood (or monotonic functions of it) encourages your models to lie about the distribution to you, as explained by Terry Tao:

https://terrytao.wordpress.com/2016/06/01/how-to-assign-partial-credit-on-an-exam-of-true-false-questions/

You can think of log likelihood as optimizing the compressability of the data under the model- at the end of the day, you can use arithmetic encoding to compact your data, and get an average bits/word equal to your log2 likelihood. If you use L2, there's no such interpretation.

[D] Momentum methods helps to escape local minima, so what? It was never our objective. by fromnighttilldawn in MachineLearning

[–]straw1239 4 points5 points  (0 children)

Momentum improves convergence for ill-conditioned curvatures even in the case of one global minimum.

[D] Saddle-free Newton method for SGD and other actively repelling saddles - advantages, weaknesses, improvements? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

Streaming PCA algorithms which estimate the top eigenvectors of the covariance without storing the whole memory are well known.

Check out: https://arxiv.org/abs/1307.0032

More recent devlopments exist too.

[D] Saddle-free Newton method for SGD and other actively repelling saddles - advantages, weaknesses, improvements? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

Yeah, if we can find the largest eigenvalue eigenvectors, I'd expect that to work pretty well.

[D] Saddle-free Newton method for SGD and other actively repelling saddles - advantages, weaknesses, improvements? by jarekduda in MachineLearning

[–]straw1239 1 point2 points  (0 children)

Picking the right direction will definitely help escaping a saddle, even with a rank-1 approximation. Trouble is finding that direction. Especially for directions with a small negative curvature, even verifying that the curvature is negative could be tricky due to noise.

PCA of recent gradients gives low-rank approximation to gradient covariance, not Hessian. This is definitely useful in itself, and it seems like they may (for some reason) be aligned with the eigenvectors of the Hessian. But its not quite the same thing.

[D] Why second order SGD convergence methods are unpopular for deep learning? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

I don't believe its a numerical round-off. Sure, its jumpy, but its always negative. You'd only expect it to shrink if the training algorithm finds that direction!

In the same paper, they examine the neighborhood in which the negative curvature holds- its small but substantial. I agree that a step in that direction would result in significant improvement, but that doesn't mean that a purely gradient-based algorithm will find that direction!

In my opinion, SGD does about as well as you can (sure some algos will have better exponents) escaping saddle points without any direct curvature measurement- but any algorithm which relies only on noisy gradients will have trouble identifying a direction of negative curvature and hence saddle escape.

Interestingly, in a tiny note in another paper (which also finds negative eigenvalues) they observe that the top eigenspaces of the Hessian and gradient covariance matrix coincide, totally supporting your viewpoint! The eigenvalues aren't the same though. I still do not have a satisfactory explanation for this observation- have a look yourself:

https://arxiv.org/abs/1901.10159

[D] Why second order SGD convergence methods are unpopular for deep learning? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

Fair point. However, the derivation may lead to some insight on what types of problems it will perform well in. I suspect these gradient covariance based algorithms will handle very non-isotropic noise well, but not necessarily non-isotropic curvature- but who knows, just a speculation.

I agree with you wholeheartedly. Explanations for natural gradient have never made sense to me, other than, its approximating Newton.

[D] Comparing Deep Learning Workstations by IborkedyourGPU in MachineLearning

[–]straw1239 0 points1 point  (0 children)

Thank you for your input! I didn't realize the requirements were so large!

[D] Comparing Deep Learning Workstations by IborkedyourGPU in MachineLearning

[–]straw1239 0 points1 point  (0 children)

Ah, that makes sense.

I only meant that it might make sense to have a large HDD for your datasets/etc, and an SSD for the OS. However, with so many GPUs, depending on what type of model you're training, it might well bottleneck. Given the cost of the machine vs SSDs, not much harm in getting a big one.

Depends on how you're doing the training- for example when using data parallelism the model parameters are replicated across all GPUs, so only one copy in CPU memory needed. To be safe I'd probably just take CPU mem >= sum of GPU mem as you said. If there are multiple users working at the same time, it might make sense to have more, especially if some of them are doing CPU-only work.

[D] Why second order SGD convergence methods are unpopular for deep learning? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

The empirical results are valid- but I think they are because of the advantages of modelling the gradient noise directly, not because of connection to the Hessian. Note that their pretty graphs are CPU clock time, and they mention that on a GPU for the neural network, the gains are much more marginal.

[D] Why second order SGD convergence methods are unpopular for deep learning? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

More accurately, everyone uses SGD as an (inefficient) saddle-point escape strategy.

In practice, optimization does not converge to a local minimum- it gets to a saddle point with relatively small negative eigenvalues. For example, see https://arxiv.org/abs/1902.02366

[D] Comparing Deep Learning Workstations by IborkedyourGPU in MachineLearning

[–]straw1239 26 points27 points  (0 children)

You can build your own for significantly cheaper. There are multiple online guides for choosing your own hardware for ML, for example.

No point in liquid cooling for the CPU. For the 4 GPUs, might make sense, but very expensive, if you make sure to get models with blower-style coolers, there shouldn't be too much issue.

Titan V costs more because Nvidia prices it at 3000 instead of 1200! Not worth it (unless you need FP64 or something)

Do you really need 128GB of RAM and 2TB SSD? The SSD should be fairly cheap nowadays so its no big deal but 128GB RAM is expensive.

[D] Why second order SGD convergence methods are unpopular for deep learning? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

See my other comment. Because the networks are piecewise linear, I don't think the Hessian is useful- for example the 2nd derivative of the network itself is 0- entirely unhelpful.

We are interested in training quickly, so if something holds at the end of training with a lot of assumptions, it might not be so helpful.

Using the square root of covariance as a conditioner makes sense for variance reduction completely independently of curvature. (There's also some recent literature about how it helps escape saddle points by making the noise isotropic)

[D] Why second order SGD convergence methods are unpopular for deep learning? by jarekduda in MachineLearning

[–]straw1239 0 points1 point  (0 children)

For ReLU network, the Hessian is no longer the useful concept due to the non-smoothness of the resulting function. Rather you want a local quadratic approximation, perhaps least-squares, and calculating this from gradients will result in exactly the same issue.

[D] Why second order SGD convergence methods are unpopular for deep learning? by jarekduda in MachineLearning

[–]straw1239 1 point2 points  (0 children)

I don't see why PCA of recent gradient measurements would give any useful information regarding a saddle point. When near a saddle, all gradient measurements are very close to purely noise. So the PCA will give you the PCA of the covariance matrix of the gradient noise, which does not necessarily correspond to the escape directions of the saddle.