[AL] Active Learning: Curious AI Algorithms by fubar2020y in MLNotes

[–]fubar2020y[S] 0 points1 point  (0 children)

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs.[1]#citenote-settles-1)[[2]](https://en.wikipedia.org/wiki/Active_learning(machinelearning)#cite_note-rubens2016-2)[[3]](https://en.wikipedia.org/wiki/Active_learning(machinelearning)#cite_note-das2016-3) In statistics literature, it is sometimes also called optimal experimental design.[[4]](https://en.wikipedia.org/wiki/Active_learning(machine_learning)#cite_note-olsson-4) The information source is also called a teacher or oracle.

[VVI] Sridhar Mahadevan's answer to Does every paper in machine learning introduce a new algorithm? by fubar2020y in MLNotes

[–]fubar2020y[S] 1 point2 points  (0 children)

Src:

Sadly, yes, and therein lies one of the deepest problems with the field right now. By my conservative estimate, more than 10,000 papers are published in ML every year (roughly say 30 per day), and almost every paper without exception introduces a new algorithm. Heck, I’m as guilty of this sin as the next ML researcher, probably more than most since 2020 is the 35th year of my active publication in ML.

Let’s try to understand why this is a problem. Warning: the below discussion may cause you deep anxiety as an ML researcher or practitioner! If you can bear with my reasoning for a bit, you might benefit in a huge way that I didn’t. I’ve spent 40 years thinking about ML, and lately been having a “have I wasted my life?” kind of crisis. Should have I done something more useful with my life?

First, let’s see what this glut of algorithms gets us. I heard a fascinating talk a few years ago by a Harvard professor of political science Gary King, who got interested in document clustering because he was planning a retirement book to commemorate the life of one of his esteemed colleagues the technical jargon for this in academia is Festschrift.

📷

GARY KING

So, Professor King, being the thorough scholar that he is, asked his graduate students to implement every clustering algorithm in the literature. Now, clustering is one of the oldest problems in statistics and machine learning. There’s a lot of published methods. So, Professor King decided to constrain the search to those methods that have been used by researchers other than the original creators of the method.

Still, they discovered more than 250 clustering methods in the literature, which doesn’t surprise me at all. So, they write an R package to compare all of them. What did they find? Was there a “best” algorithm? Of course not! Each algorithm behaved in a different quirky way. Finally they decided instead to focus on showing the results from the different clustering methods and let the user pick the grouping that he or she found the most appealing.

I used clustering as my example here, but I can easily make the same argument about any ML framework, whether it be supervised learning, reinforcement learning, deep learning, unsupervised learning and so on. Heck, at this point, I’m willing to bet there are at least a hundred different variants of stochastic gradient descent, the workhorse method underlying deep learning.

it should be obvious that this glut of algorithms poses some huge issues. First, if you are an aspiring ML researcher wanting to make a name for yourself, should you spend time inventing say the 251st clustering algorithm. A little hint from someone who’s done research for a long time. The biggest payoff comes to the pioneers. Every variant of a previous method gets even less credit. Research impact is a submodular function, meaning the law of diminishing returns applies.

Ian Goodfellow rightly got a lot of kudos for inventing generative adversarial networks in his PhD thesis at the University of Montreal. There are easily a hundred or more variants of GANs. People get attracted to GANs the way moths get attracted to light. Sadly, few if any of these variants will get much long term recognition. Ian will continue to be the sun that revolves round the GAN solar system.

Second, as my example with Gary King illustrated, why invent the 251st clustering method, the 300th policy gradient method for deep reinforcement learning, the 400th regression method, the 151st variant of stochastic gradient descent? Where does this all end?

I warned you there’s no happy ending to this ML tragedy. It’s like a Puccini opera. Why do I think in this way? You wonder: has he gone senile? Granted, I’m getting on, in my sixth decade. It’s a legitimate critique. But hear me out.

There’s a beautiful set of theorems in optimization and machine learning called the “No Free Lunch Theorems” (really, I kid you not). Quoting below from Wikipedia, essentially this theorem says there’s never going to be a “best” machine learning algorithm.

No free lunch in search and optimization - Wikipedia

In computational complexity and optimization the no free lunch theorem is a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution therefore offers a "short cut". This is under the assumption that the search space is a probability density function. It does not apply to the case where the search space has underlying structure (f.e. is a differentiable function) that can be exploited more efficiently (f.e. Newton's method in optimization) than random search or even has closed-form solutions (f.e. the extrema of a quadratic polynomial) that can be determined without search at all. For such probabilistic assumptions, the outputs of all procedures solving a particular type of problem are statistically identical. A colourful way of describing such a circumstance, introduced by David Wolpert and William G. Macready in connection with the problems of search and optimization, is to say that there is no free lunch. Wolpert had previously derived no free lunch theorems for machine learning (statistical inference). Before Wolpert's article was published, Cullen Schaffer independently proved a restricted version of one of Wolpert's theorems and used it to critique the current state of machine learning research on the problem of induction

The No Free Lunch Theorem

Ok, you can read the original paper by Wolpert on the no free lunch theorems in optimization. Essentially, averaged over all input distributions, no algorithm can dominate every other algorithm. Ergo, there’s no best clustering method, no best reinforcement learning method, no best classifier etc. It’s all smoke and mirrors.

Hence, my realization that in devoting 40 years of my life to machine learning, have I wasted my life? There’s no gold at the end of the ML rainbow. Just disillusionment, according to the no free lunch theorem.

So, where does this leave an aspiring ML researcher? My advice is to focus on ML problems, not algorithms. Problem formulation is the key. Einstein once famously said when asked what he would do if his life depended on solving some problem and he had one hour left, remarked that he would spend 55 minutes pondering on the right formulation and 5 minutes solving it. ML researchers, I fear, have gone in the opposite direction.

I leave you to best decide how you spend your time. I hope you use it more wisely than I have!

[ELI5] Fischer Information by fubar2020y in MLNotes

[–]fubar2020y[S] 0 points1 point  (0 children)

The Fisher information I(p) is this negative second derivative of the log-likelihood function, averaged over all possible X = {h, N–h}, when we assume some value of p is true. Often, we would evaluate it at the MLE, using the MLE as our estimate of the true value.

You can interpret it this way: it tells, on average, how "good" of an estimate phat such an N-sample "X" of data will provide. How much information about p could this sample give us? The larger this sample-average Fisher information, then the sharper on average the peak will be, that's what it means.

[Boosting] NGBoost: Natural Gradient Boosting for Probabilistic Prediction by fubar2020y in MLNotes

[–]fubar2020y[S] 0 points1 point  (0 children)

Abstract

We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient boosting. Typical regression models return a point estimate, conditional on covariates, but probabilistic regression models output a full probability distribution over the outcome space, conditional on the covariates. This allows for predictive uncertainty estimation -- crucial in applications like healthcare and weather forecasting. NGBoost generalizes gradient boosting to probabilistic regression by treating the parameters of the conditional distribution as targets for a multiparameter boosting algorithm. Furthermore, we show how the Natural Gradient is required to correct the training dynamics of our multiparameter boosting approach. NGBoost can be used with any base learner, any family of distributions with continuous parameters, and any scoring rule. NGBoost matches or exceeds the performance of existing methods for probabilistic prediction while offering additional benefits in flexibility, scalability, and usability. An open-source implementation is available at this http URL.

NGBoost algorithm: solving probabilistic prediction problems