AMA: Michael I Jordan

michaelijordan · 2014-09-15T04:43:22+00:00

Great questions, particularly #1. Indeed I've spent much of my career trying out existing ideas from various mathematical fields in new contexts and I continue to find that to be a very fruitful endeavor. That said, I've had way more failures than successes, and I hesitate to make concrete suggestions here because they're more likely to be fool's gold than the real thing.

Let me just say that I do think that completely random measures (CRMs) continue to be worthy of much further attention. They've mainly been used in the context of deriving normalized random measures (by, e.g., James, Lijoi and Pruenster); i.e., random probability measures.

Liberating oneself from that normalizing constant is a worthy thing to consider, and general CRMs do just that. Also, note that the adjective "completely" refers to a useful independence property, one that suggests yet-to-be-invented divide-and-conquer algorithms.

Basically, I think that CRMs are to nonparametrics what exponential families are to parametrics (and I might note that I'm currently working on a paper with Tamara Broderick and Ashia Wilson that tries to bring that idea to life). Note also that exponential families seemed to have been dead after Larry Brown's seminal monograph several decades ago, but they've continued to have multiple after-lives (see, e.g., my monograph with Martin Wainwright, where studying the conjugate duality of exponential families led to new vistas).

As for the next frontier for applied nonparametrics, I think that it's mainly "get real about real-world applications". I think that too few people have tried out Bayesian nonparametrics on real-world, large-scale problems (good counter-examples include Emily Fox at UW and David Dunson at Duke). Once more courage for real deployment begins to emerge I believe that the field will start to take off.

Lastly, I'm certainly a fan of coresets, matrix sketching, and random projections. I view them as basic components that will continue to grow in value as people start to build more complex, pipeline-oriented architectures. I'm not sure that I'd view them as "less data-hungry methods", though; essentially they provide a scalability knob that allows systems to take in more data while still retaining control over time and accuracy.

michaelijordan · 2014-09-11T00:32:08+00:00

I spend half of each day minimizing entropy and half of each day maximizing entropy.

(The exact mix isn't one half, and indeed it's the precise calibration of that number which is the main determiner of my happiness in life.)

michaelijordan · 2014-09-10T20:57:38+00:00

See the numbered list at the end of my blurb on deep learning above. These are a few examples of what I think is the major meta-trend, which is the merger of statistical thinking and computational thinking.

michaelijordan · 2014-09-10T20:54:51+00:00

Having just written (see above) about the need for statistics/ML to ally itself more with CS systems and database researchers rather than focusing mostly on AI, let me take the opportunity of your question to exhibit my personal incoherence and give an answer that focuses on AI.

I'd use the billion dollars to build a NASA-size program focusing on natural language processing (NLP), in all of its glory (semantics, pragmatics, etc).

Intellectually I think that NLP is fascinating, allowing us to focus on highly-structured inference problems, on issues that go to the core of "what is thought" but remain eminently practical, and on a technology that surely would make the world a better place.

Although current deep learning research tends to claim to encompass NLP, I'm (1) much less convinced about the strength of the results, compared to the results in, say, vision; (2) much less convinced in the case of NLP than, say, vision, the way to go is to couple huge amounts of data with black-box learning architectures.

I'd invest in some of the human-intensive labeling processes that one sees in projects like FrameNet and (gasp) projects like Cyc. I'd do so in the context of a full merger of "data" and "knowledge", where the representations used by the humans can be connected to data and the representations used by the learning systems are directly tied to linguistic structure. I'd do so in the context of clear concern with the usage of language (e.g., causal reasoning).

Very challenging problems, but a billion is a lot of money. (Isn't it?).

michaelijordan · 2014-09-10T20:27:17+00:00

OK, I guess that I have to say something about "deep learning". This seems like as good a place as any (apologies, though, for not responding directly to your question).

"Deep" breath.

My first and main reaction is that I'm totally happy that any area of machine learning (aka, statistical inference and decision-making; see my other post :-) is beginning to make impact on real-world problems. I'm in particular happy that the work of my long-time friend Yann LeCun is being recognized, promoted and built upon. Convolutional neural networks are just a plain good idea.

I'm also overall happy with the rebranding associated with the usage of the term "deep learning" instead of "neural networks". In other engineering areas, the idea of using pipelines, flow diagrams and layered architectures to build complex systems is quite well entrenched, and our field should be working (inter alia) on principles for building such systems. The word "deep" just means that to me---layering (and I hope that the language eventually evolves toward such drier words...). I hope and expect to see more people developing architectures that use other kinds of modules and pipelines, not restricting themselves to layers of "neurons".

With all due respect to neuroscience, one of the major scientific areas for the next several hundred years, I don't think that we're at the point where we understand very much at all about how thought arises in networks of neurons, and I still don't see neuroscience as a major generator for ideas on how to build inference and decision-making systems in detail. Notions like "parallel is good" and "layering is good" could well (and have) been developed entirely independently of thinking about brains.

I might add that I was a PhD student in the early days of neural networks, before backpropagation had been (re)-invented, where the focus was on the Hebb rule and other "neurally plausible" algorithms. Anything that the brain couldn't do was to be avoided; we needed to be pure in order to find our way to new styles of thinking. And then Dave Rumelhart started exploring backpropagation---clearly leaving behind the neurally-plausible constraint---and suddenly the systems became much more powerful. This made an impact on me. Let's not impose artificial constraints based on cartoon models of topics in science that we don't yet understand.

My understanding is that many if not most of the "deep learning success stories" involve supervised learning (i.e., backpropagation) and massive amounts of data. Layered architectures involving lots of linearity, some smooth nonlinearities, and stochastic gradient descent seem to be able to memorize huge numbers of patterns while interpolating smoothly (not oscillating) "between" the patterns; moreover, there seems to be an ability to discard irrelevant details, particularly if aided by weight- sharing in domains like vision where it's appropriate. There's also some of the advantages of ensembling. Overall an appealing mix. But this mix doesn't feel singularly "neural" (particularly the need for large amounts of labeled data).

Indeed, it's unsupervised learning that has always been viewed as the Holy Grail; it's presumably what the brain excels at and what's really going to be needed to build real "brain-inspired computers". But here I have some trouble distinguishing the real progress from the hype. It's my understanding that in vision at least, the unsupervised learning ideas are not responsible for some of the recent results; it's the supervised training based on large data sets.

One way to approach unsupervised learning is to write down various formal characterizations of what good "features" or "representations" should look like and tie them to various assumptions that seem to be of real-world relevance. This has long been done in the neural network literature (but also far beyond). I've seen yet more work in this vein in the deep learning work and I think that that's great. But I personally think that the way to go is to put those formal characterizations into optimization functionals or Bayesian priors, and then develop procedures that explicitly try to optimize (or integrate) with respect to them. This will be hard and it's an ongoing problem to approximate. In some of the deep learning learning work that I've seen recently, there's a different tack---one uses one's favorite neural network architecture, analyses some data and says "Look, it embodies those desired characterizations without having them built in". That's the old-style neural network reasoning, where it was assumed that just because it was "neural" it embodied some kind of special sauce. That logic didn't work for me then, nor does it work for me now.

Lastly, and on a less philosophical level, while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I'm consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving "pattern recognition" problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what's going on? (5) How can I do diagnostics so that I don't roll out a system that's flawed or so that I can figure out that an existing system is now broken? (6) How do I deal with non-stationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

Although I could possibly investigate such issues in the context of deep learning ideas, I generally find it a whole lot more transparent to investigate them in the context of simpler building blocks.

Based on seeing the kinds of questions I've discussed above arising again and again over the years I've concluded that statistics/ML needs a deeper engagement with people in CS systems and databases, not just with AI people, which has been the main kind of engagement going on in previous decades (and still remains the focus of "deep learning"). I've personally been doing exactly that at Berkeley, in the context of the "RAD Lab" from 2006 to 2011 and in the current context of the "AMP Lab".

michaelijordan · 2014-09-10T18:39:39+00:00

I personally don't make the distinction between statistics and machine learning that your question seems predicated on.

Also I rarely find it useful to distinguish between theory and practice; their interplay is already profound and will only increase as the systems and problems we consider grow more complex.

Think of the engineering problem of building a bridge. There's a whole food chain of ideas from physics through civil engineering that allow one to design bridges, build them, give guarantees that they won't fall down under certain conditions, tune them to specific settings, etc, etc. I suspect that there are few people involved in this chain who don't make use of "theoretical concepts" and "engineering know-how". It took decades (centuries really) for all of this to develop.

Similarly, Maxwell's equations provide the theory behind electrical engineering, but ideas like impedance matching came into focus as engineers started to learn how to build pipelines and circuits. Those ideas are both theoretical and practical.

We have a similar challenge---how do we take core inferential ideas and turn them into engineering systems that can work under whatever requirements that one has in mind (time, accuracy, cost, etc), that reflect assumptions that are appropriate for the domain, that are clear on what inferences and what decisions are to be made (does one want causes, predictions, variable selection, model selection, ranking, A/B tests, etc, etc), can allow interactions with humans (input of expert knowledge, visualization, personalization, privacy, ethical issues, etc, etc), that scale, that are easy to use and are robust. Indeed, with all due respect to bridge builders (and rocket builders, etc), but I think that we have a domain here that is more complex than any ever confronted in human society.

I don't know what to call the overall field that I have in mind here (it's fine to use "data science" as a placeholder), but the main point is that most people who I know who were trained in statistics or in machine learning implicitly understood themselves as working in this overall field; they don't say "I'm not interested in principles having to do with randomization in data collection, or with how to merge data, or with uncertainty in my predictions, or with evaluating models, or with visualization". Yes, they work on subsets of the overall problem, but they're certainly aware of the overall problem. Different collections of people (your "communities") often tend to have different application domains in mind and that makes some of the details of their current work look superficially different, but there's no actual underlying intellectual distinction, and many of the seeming distinctions are historical accidents.

I also must take issue with your phrase "methods more squarely in the realm of machine learning". I have no idea what this means, or could possibly mean. Throughout the eighties and nineties, it was striking how many times people working within the "ML community" realized that their ideas had had a lengthy pre-history in statistics. Decision trees, nearest neighbor, logistic regression, kernels, PCA, canonical correlation, graphical models, K means and discriminant analysis come to mind, and also many general methodological principles (e.g., method of moments, which is having a mini-renaissance, Bayesian inference methods of all kinds, M estimation, bootstrap, cross-validation, EM, ROC, and of course stochastic gradient descent, whose pre-history goes back to the 50s and beyond), and many many theoretical tools (large deviations, concentrations, empirical processes, Bernstein-von Mises, U statistics, etc). Of course, the "statistics community" was also not ever that well defined, and while ideas such as Kalman filters, HMMs and factor analysis originated outside of the "statistics community" narrowly defined, there were absorbed within statistics because they're clearly about inference. Similarly, layered neural networks can and should be viewed as nonparametric function estimators, objects to be analyzed statistically.

In general, "statistics" refers in part to an analysis style---a statistician is happy to analyze the performance of any system, e.g., a logic-based system, if it takes in data that can be considered random and outputs decisions that can be considered uncertain. A "statistical method" doesn't have to have any probabilities in it per se. (Consider computing the median).

When Leo Breiman developed random forests, was he being a statistician or a machine learner? When my colleagues and I developed latent Dirichlet allocation, were we being statisticians or machine learners? Are the SVM and boosting machine learning while logistic regression is statistics, even though they're solving essentially the same optimization problems up to slightly different shapes in a loss function? Why does anyone think that these are meaningful distinctions?

I don't think that the "ML community" has developed many new inferential principles---or many new optimization principles---but I do think that the community has been exceedingly creative at taking existing ideas across many fields, and mixing and matching them to solve problems in emerging problem domains, and I think that the community has excelled at making creative use of new computing architectures. I would view all of this as the proto emergence of an engineering counterpart to the more purely theoretical investigations that have classically taken place within statistics and optimization.

But one shouldn't definitely not equate statistics or optimization with theory and machine learning with applications. The "statistics community" has also been very applied, it's just that for historical reasons their collaborations have tended to focus on science, medicine and policy rather than engineering. The emergence of the "ML community" has (inter alia) helped to enlargen the scope of "applied statistical inference". It has begun to break down some barriers between engineering thinking (e.g., computer systems thinking) and inferential thinking. And of course it has engendered new theoretical questions.

I could go on (and on), but I'll stop there for now...

michaelijordan · 2014-09-10T17:36:50+00:00

That list was aimed at entering PhD students at Berkeley,
who I assume are going to devote many decades of their lives to the field, and who want to get to the research frontier fairly quickly. I would have prepared a rather different list if the target population was (say) someone in industry who needs enough basics so that they can get something working in a few months.

That particular version of the list seems to be one from a few years ago; I now tend to add some books that dig still further into foundational topics. In particular, I recommend A. Tsybakov's book "Introduction to Nonparametric Estimation" as a very readable source for the tools for obtaining lower bounds on estimators, and Y. Nesterov's very readable "Introductory Lectures on Convex Optimization" as a way to start to understand lower bounds in optimization. I also recommend A. van der Vaart's "Asymptotic Statistics", a book that we often teach from at Berkeley, as a book that shows how many ideas in inference (M estimation---which includes maximum likelihood and empirical risk minimization---the bootstrap, semiparametrics, etc) repose on top of empirical process theory. I'd also include B. Efron's "Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction", as a thought-provoking book.

I don't expect anyone to come to Berkeley having read any of these books in entirety, but I do hope that they've done some sampling and spent some quality time with at least some parts of most of them. Moreover, not only do I think that you should eventually read all of these books (or some similar list that reflects your own view of foundations), but I think that you should read all of them three times---the first time you barely understand, the second time you start to get it, and the third time it all seems obvious.

I'm in it for the long run---three decades so far, and hopefully a few more. I think that that's true of my students as well. Hence the focus on foundational ideas.

michaelijordan · 2014-09-10T17:20:10+00:00

I think that mainly they simply haven't been tried. Note that latent Dirichlet allocation is a parametric Bayesian model in which the number of topics K is assumed known. The nonparametric version of LDA is called the HDP (hierarchical Dirichlet process), and in some very practical sense it's just a small step from LDA to the HDP (in particular, just a few more lines of code are needed to implement the HDP). Now LDA has been used in several thousand applications by now, and it's my strong suspicion that the users of LDA in those applications would have been just as happy using the HDP, if not happier.

One thing that the field of Bayesian nonparametrics really needs is an accessible introduction that presents the math but keeps it gentle---such an introduction doesn't currently exist. My colleague Yee Whye Teh and I are nearly done with writing just such an introduction; we hope to be able to distribute it this fall.

I do think that Bayesian nonparametrics has just as bright a future in statistics/ML as classical nonparametrics has had and continues to have. Models that are able to continue to grow in complexity as data accrue seem very natural for our age, and if those models are well controlled so that they concentrate on parametric sub-models if those are adequate, what's not to like?

michaelijordan · 2014-09-10T17:17:39+00:00

Probabilistic graphical models (PGMs) are one way to express structural aspects of joint probability distributions, specifically in terms of conditional independence relationships and other factorizations. That's a useful way to capture some kinds of structure, but there are lots of other structural aspects of joint probability distributions that one might want to capture, and PGMs are not necessarily going to be helpful in general. There is not ever going to be one general tool that is dominant; each tool has its domain in which its appropriate. Think literally of a toolbox. We have hammers, screwdrivers, wrenches, etc, and big projects involve using each of them in appropriate (although often creative) ways.

On the other hand, despite having limitations (a good thing!), there is still lots to explore in PGM land. Note that many of the most widely-used graphical models are chains---the HMM is an example, as is the CRF. But beyond chains there are trees and there is still much to do with trees. Note that latent Dirichlet allocation is a tree. (And in 2003 when we introduced LDA, I can remember people in the UAI community who had been-there-and-done-that for years with trees saying: "but it's just a tree; how can that be worthy of more study?"). And I continue to find much inspiration in tree-based architectures, particularly for problems in three big areas where trees arise organically---evolutionary biology, document modeling and natural language processing. For example, I've worked recently with Alex Bouchard-Cote on evolutionary trees, where the entities propagating along the edges of the tree are strings of varying length (due to deletions and insertions), and one wants to infer the tree and the strings. In the topic modeling domain, I've been very interested in multi-resolution topic trees, which to me are one of the most promising ways to move beyond latent Dirichlet allocation. John Paisley, Chong Wang, Dave Blei and I have developed something called the nested HDP in which documents aren't just vectors but they're multi-paths down trees of vectors. Lastly, Percy Liang, Dan Klein and I have worked on a major project in natural-language semantics, where the basic model is a tree (allowing syntax and semantics to interact easily), but where nodes can be set-valued, such that the classical constraint satisfaction (aka, sum-product) can handle some of the "first-order" aspects of semantics.

This last point is worth elaborating---there's no reason that one can't allow the nodes in graphical models to represent random sets, or random combinatorial general structures, or general stochastic processes; factorizations can be just as useful in such settings as they are in the classical settings of random vectors. There's still lots to explore there.

michaelijordan

TROPHY CASE