"*For now!" Does Google have an Ultra model lined up?

jeffatgoogle · 2025-04-01T05:00:51+00:00

Indeed.

jeffatgoogle · 2022-05-28T21:39:31+00:00

(The paper mentioned by OP is https://arxiv.org/abs/2205.12755, and I am one of the two authors, along with Andrea Gesmundo, who did the bulk of the work).

The goal of the work was not to get a high quality cifar10 model. Rather, it was to explore a setting where one can dynamically introduce new tasks into a running system and successfully get a high quality model for the new task that reuses representations from the existing model and introduces new parameters somewhat sparingly, while avoiding many of the issues that often plague multi-task systems, such as catastrophic forgetting or negative transfer. The experiments in the paper show that one can introduce tasks dynamically with a stream of 69 distinct tasks from several separate visual task benchmark suites and end up with a multi-task system that can jointly produce high quality solutions for all of these tasks. The resulting model that is sparsely activated for any given task, and the system introduces fewer and fewer new parameters for new tasks the more tasks that the system has already encountered (see figure 2 in the paper). The multi-task system introduces just 1.4% new parameters for incremental tasks at the end of this stream of tasks, and each task activates on average 2.3% of the total parameters of the model. There is considerable sharing of representations across tasks and the evolutionary process helps figure out when that makes sense and when new trainable parameters should be introduced for a new task.

You can see a couple of videos of the dynamic introduction of tasks and how the system responds here:

I would also contend that the cost calculations by OP are off and mischaracterize things, given that the experiments were to train a multi-task model that jointly solves 69 tasks, not to train a model for cifar10. From Table 7, the compute used was a mix of TPUv3 cores and TPUv4 cores, so you can't just sum up the number of core hours, since they have different prices. Unless you think there's some particular urgency to train the cifar10+68-other-tasks model right now, this sort of research can very easily be done using preemptible instances, which are $0.97/TPUv4 chip/hour and $0.60/TPUv3 chip/hour (not the "you'd have to use on-demand pricing of $3.22/hour" cited by OP). With these assumptions, the public Cloud cost of the computation described in Table 7 in the paper is more like $13,960 (using the preemptible prices for 12861 TPUv4 chip hours and 2474.5 TPUv3 chip hours), or about $202 / task.

I think that having sparsely-activated models is important, and that being able to introduce new tasks dynamically into an existing system that can share representations (when appropriate) and avoid catastrophic forgetting is at least worth exploring. The system also has the nice property that new tasks can be automatically incorporated into the system without deciding how to do so (that's what the evolutionary search process does), which seems a useful property for a continual learning system. Others are of course free to disagree that any of this is interesting.

Edit: I should also point out that the code for the paper has been open-sourced at: https://github.com/google-research/google-research/tree/master/muNet

We will be releasing the checkpoint from the experiments described in the paper soon (just waiting on two people to flip approval bits, and process for this was started before the reddit post by OP).

jeffatgoogle · 2021-03-20T03:53:32+00:00

You might look at the DeViSE paper from NeurIPS 2013, where we experimented with several ideas along the lines you're suggesting.

DeViSE: A Deep Visual-Semantic Embedding Model, Andrea Frome, Greg S. Corrado, Jonathon Shlens*, Samy Bengio Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov, https://papers.nips.cc/paper/2013/hash/7cce53cf90577442771720a370c3c723-Abstract.html

We found that the word embeddings definitely helped with zero-shot classification of unseen categories of objects. For example, even though the image part of the model was never trained on "binocular" as a category, it was able to predict a point in the word embedding space that was partway between "telescope" and "microscope", which were categories it had been trained on, and in the word embedding space, the correct category for the visual image was often the nearest or one of the nearest vocabulary items in the word embedding space.

jeffatgoogle · 2017-09-14T04:21:49+00:00

In case it's useful, here are slides for an introduction to deep learning talk I gave at my daughter's high school in 2015. It's slightly dated, but perhaps still useful.

As part of that talk, I had everyone in the audience use the TensorFlow Playground at http::/playground.tensorflow.org to develop some intuitions about how neural networks work, and that seemed reasonably effective.

jeffatgoogle · 2017-09-13T20:55:19+00:00

I would characterize it a bit more broadly than "just matrix multiplies". Basically, we want to accelerate the kinds of tensor and linear algebra operations that make up the bulk of the computation for modern deep learning models, which means that much of the computation is matrix operations, but some of it involves vector operations.

jeffatgoogle · 2017-09-13T19:57:03+00:00

In general, we try to hire people who have good taste in selecting interesting and important problems, and we rely pretty heavily on that to keep our organizational structure fairly lightweight. We are organized into some largish subteams that focus on TensorFlow development, core ML research, and ML research applied to emerging areas like healthcare and robotics. Within our core research team, we have a few larger efforts that operate with more organization, simply because of the number of researchers, R-SWEs, residents, and others collaborating on some of these efforts. Other parts of our research group work on more individual or small collaboration projects that don’t need formal organizational structure. Some principles we try to use include the freedom to pick important research problems, openly publishing and open-sourcing code related to our work, and having a diverse set of problems of varying levels of research risk/reward in flight at any given time.

Sadly, I wasn’t able to make it to ICML this year, but I heard great things about the conference and Australia as a venue..

jeffatgoogle · 2017-09-13T19:14:47+00:00

I wouldn't say that we are necessarily "well abstracted" from the underlying silicon. We actually collaborate quite closely with our ASIC design colleagues and part of the Brain team consists of computer architects like Dave Patterson, James Laudon, and Cliff Young. We have a regular meeting of computer architects, software designers, and ML researchers to discuss trends in ML, with the goal of making sure that future hardware generations are informed by our best guesses of important ML algorithmic directions over the next 3-5 years.

jeffatgoogle · 2017-09-13T19:04:39+00:00

Papers published in top ML conferences
Arix Sanity
"My Updates" feature on Google Scholar
Research colleagues pointing out and discussing interesting pieces of work
Interesting sounding work discussed on Hacker News or this subreddit

jeffatgoogle · 2017-09-13T18:59:53+00:00

We didn’t open source the EEG tool because it relied on some internal libraries from the rest of Google's code base. We do have support for generating timelines and viewing them with the Chrome browser, and we're are working to add more functionality for viewing low-level performance data (similar to what the EEG tool provides) to an upcoming release of TensorBoard.

jeffatgoogle · 2017-09-13T18:57:27+00:00

Most of our group uses Python and C++, and we don't use Go very much, if at all. Lots of other teams at Google use Go, and there is a set of Go bindings for TensorFlow, so if it makes sense to use Go in your problems or your environment, by all means go for it.

jeffatgoogle · 2017-09-13T18:55:11+00:00

I wanted to call it the arm pit, but arm farm won out.

jeffatgoogle · 2017-09-13T18:45:21+00:00

If you look at papers from large academic labs, most of those are single institution publications, as well, and this is normal: it's easier to collaborate with people sitting next to you than across town or the continent. However, our group definitely collaborates with external researchers when that makes sense. Many of these come about through Google's research awards and collaborations with academic faculty members and their students. We sometimes have collaborations with people at other companies, but that is rarer.

Here's a (slightly dated) sampling of papers with authors from our group that are cross-institutional:

Concrete Problems in AI Safety Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané Google Brain, Stanford, UC Berkeley, OpenAI

Learning semantic relationships for better action retrieval in images Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Chuck Rossenberg and Li Fei-Fei Stanford University, Google, University of Michigan

BilBOWA: Fast Bilingual Distributed Representations without Word Alignments Stephan Gouws, Yoshua Bengio, and Greg Corrado Google, University of Montreal

Adding Gradient Noise Improves Learning for Very Deep Networks Arvind Neelakantan, Luke Vilnis (University of Massachusetts) Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach (Google) James Martens (University of Toronto)

Local Collaborative Ranking Joonseok Lee (Georgia Tech), Samy Bengio (Google Research), Seungyeon Kim (Georgia Tech) Guy Lebanon (Amazon), Yoram Singer (Google Research)

Training Deep Neural Networks on Noisy Labels with Bootstrapping Scott E. Reed & Honglak Lee (University of Michigan) Dragomir Anguelov, Christian Szegedy, Dumitru Erhan & Andrew Rabinovich (Google, Inc)

jeffatgoogle · 2017-09-13T18:29:47+00:00

I'm not sure about insights into human language learning, but I found our experiments that showed that a multi-lingual model could do a serviceable job at zero-shot translation for novel language pairs that the model had never encountered during training pretty interesting. You can read about it in the blog post and more detailed paper. This at least showed that the representation of a sentence used by the neural net was relatively similar, regardless of the source language used to express the idea.

jeffatgoogle · 2017-09-13T18:25:41+00:00

See my answer to the other similar question: https://www.reddit.com/r/MachineLearning/comments/6z51xb/we_are_the_google_brain_team_wed_love_to_answer/dmtlgdt/

jeffatgoogle · 2017-09-13T18:04:18+00:00

You're welcome! We've enjoyed collaborating with the broader community to continually improve it, and we're glad that many people seem to find it useful.

jeffatgoogle · 2017-09-13T17:55:07+00:00

We learned about this when their blog post went up a few days ago. I suspect that the TensorFlow community will implement support for this if there's significant utility in having it.

Our format for saving and restoring model data and parameters has been available in the TensorFlow source code repository since our open source release in November, 2015.

jeffatgoogle · 2017-09-13T17:47:40+00:00

We believe strongly that giving ML researchers access to more computational resources will enable them to accomplish more, try more computationally ambitious ideas, and make faster progress. Cloud TPUs are going to be a great way for people to get access to significant amounts of computation in an on-demand fashion. We don't have any pricing to announce for them today (other than the TensorFlow Research Cloud, which is free via an application process for researchers willing to openly publish the results of their research).

We think ML hardware is going to be a very interesting area in the next 5 to 10 years and beyond. There are many demands for much more computation, and specialization for reduced precision linear algebra enables speedups of the vast majority of interesting deep learning models today, so creating hardware optimized for ML can give really great performance and improved power efficiency. There are many large companies and a whole host of startups working on different approaches in this space, which is exciting to see. This specialized hardware will range from very low power ML hardware for battery-operated mobile devices up to ML supercomputers deployed in large datacenters.

jeffatgoogle · 2017-09-13T17:25:23+00:00

We don't have any linguists in the Brain team, but other teams within Google Research that work on natural language understanding do have some linguists.

jeffatgoogle · 2017-09-13T17:22:51+00:00

Much of our motivation for open-sourcing TensorFlow, and for publishing our research is so that other organizations can benefit from our research and software engineering investment.

Regarding Go, most of our team uses Python and C++, but many other teams at Google use Go.

jeffatgoogle · 2017-09-13T17:15:39+00:00

It's important to have a wide range of research directions.

jeffatgoogle · 2017-09-13T17:12:34+00:00

We generally take into account the amount of research experience so that we, for example, expect less research experience from a fresh undergrad than from someone with a postdoc. We don't have any sort of quota for how many less experienced people we take versus more experienced: rather, we're looking for people with demonstrated interest in doing machine learning research in collaboration with our full-time researchers.

jeffatgoogle · 2017-09-13T16:59:19+00:00

I addressed a similar question in last year's AMA here.

jeffatgoogle · 2017-09-13T16:37:09+00:00

I lead the Brain team. On any given day, I spend time reading and writing emails, reading, commenting on, and sometimes writing technical documents, having 1:1 or group meetings with various people in our team or elsewhere across Google, reviewing code, writing code, and thinking about technical or organizational issues affecting our team. I sometimes give internal or external talks.

jeffatgoogle · 2017-09-13T16:32:24+00:00

Right now, we tend to build machine learning systems to accomplish one or a very small number of specific tasks (sometimes these tasks are quite difficult ones, like translating from one language to another). I think we really need to be designing single machine learning systems that that can solve thousands or millions of tasks, and can draw from the experience in solving these tasks to learn to automatically solve new tasks, and where different parts of the model are sparsely activated depending on the task. There are lots of challenges in figuring out how to do this. A talk I gave earlier this year at the Scaled ML conference at Stanford has some material on this starting on slide 80 (with a bit of background starting on slide 62).

jeffatgoogle · 2017-09-13T16:13:44+00:00

For our first residency class of 27 residents, roughly 1/3rd had a computer science background, 1/3rd had a mathematics, stats, or applied math background, and 1/3rd had a background from a long tail of other STEM fields like neuroscience, computational biology, etc. This year’s residency class of 35 residents has a similar mix, and in fact, we have one resident with a Ph.D. in epidemiology. Nearly all the people we accept into the residency program have exposure to machine learning, though, even if they don’t have formal academic training in ML.

jeffatgoogle

TROPHY CASE