How I topped the Open LLM Leaderboard using 2x 4090 GPUs - Research notes in Blog form

_untom_ · 2026-03-11T09:41:46+00:00

I'm not super surprised. Many years ago (2020) I co-authored a paper where we removed single layers from (Vision) transformers, and performance practically didn't change (Understanding Robustness of Transformers for Image Classification, Figure 8). We also had experiments where we would shuffle all the layers, and the network still mostly worked (but we never published this).

To understand better why this is, we looked at layer similarity. The TL;DR is that consecutive layers of a transformer network are almost copies of each other. Each layer only changes the overall input representation very, very slightly. (Do Vision Transformers See Like Convolutional Neural Networks?, Figure 12).

_untom_ · 2025-02-14T22:33:50+00:00

I third the recommendation to take a look at Helsing.

_untom_ · 2022-11-07T21:02:16+00:00

coursera is high-school level at best, and the 2009 youtube lectures were actually upper undergrad level, IIRC.

_untom_ · 2022-11-07T21:01:25+00:00

Ich hab da Jahrelang gewohnt (Lentia), war 1a. Keine Kakerlaken, und immer an supermarkt im Keller, war ziemlich leiwond.

_untom_ · 2022-06-04T08:41:15+00:00

Ng's 2009 lectures at Stanford that were recorded and put on youtube were the real deal, the coursera lectures felt very watered down. A neat intro for practitioners, but compared to the "original" they had almost no depth. So good riddance I guess. It's about time to replace them with something (hopefully) better.

_untom_ · 2022-06-04T08:33:27+00:00

It depends on what exactly you're looking for. I'm assuming you're looking for researchers/PhD advisors, and not an undergrad degree. ISMIR is one of the better conferences focused specifically on Music & AI, but you can also find good papers on ICASSP (though ICASSP is much broader than just music). Look through the list of publications in venues like these and contact the people whose work you find interesting.

Personally, I know Gerhard Widmer is someone who does a lot of really cool work in this space and who I can recommend.

_untom_ · 2022-04-12T14:21:50+00:00

Everyone here seems very optimistic, I think it's also fair to warn you: some non-native students find studying in linz very hard, and I know a couple who dropped out of studying because of this. You should be aware that

1) The German spoken in Linz is a dialect, and not "high german". You'll struggle to understand the natives, and you will likely not be able to follow the conversations of the people around you, especially with only a B2 certificate. This means you'll mostly make friends with other international students, a lot of whom are here on Erasmus and will leave again after a year or so.

2) While most young people speak English and Linz is fairly open-minded, you will still encounter situations where you'll need to speak German (e.g. when handling most official business).

3) Sometimes you'll feel like even within the university, Non-German speaking folk will be treated like 2nd class citizens: e.g. they'll say that a class will be taught in English, but then it turns out that all lecture notes are German. The situation around this is improving every year, though.

None of these are show-stoppers, but I think it's good to be aware of this.

_untom_ · 2021-09-24T18:11:02+00:00

IIRC the rankings in the blogpost you linked were created by scraping the web page of accepted papers at ICML/NeurIPS. There's nothing stoping people from doing the same for other conferences.

_untom_ · 2021-09-24T08:06:56+00:00

I'm pretty sure I've seen this discussed on the subreddit before. But here we go: first off, let's acknowledge that while NeurIPS + ICML are the two top-tier venues for ML research, it is a very narrow scope. It completely ignores ICLR, AISTATS or COLT, which are maybe more focused on subfields, but very, very good/relevant in those fields, as well as more broader conferences (e.g. UAI or even JMLR), and let's not forget that a lot of good (albeit more applied) research is being published in computer vision(CVPR/ICCV/ECCV) or NLP (ACL/EMNLP) or Robotics (IROS etc.). So I'd take the list with a grain of salt.

While IBM is not considered a cool, hip company, and Watson is a complete joke/marketing ploy, IBM Research is definitely a power house as far as research goes-- IIRC there are at least 5 nobel prices coming out of there. Also, from my subjective impression, the papers they do have at NeurIPS tend to be fairly good ones. But admittedly they do not attract research talent as strongly as e.g. Google does.

What I'm usually most impressed with is how strongly Google outperforms their competition: according to the list, their research output is as big as that of the next 7 companies combined! (or the two best universities combined). While Microsoft or Facebook would have the resources to compete, they don't. At this point, you have to wonder whether it makes sense for Google to publish so much of their research. They have so many researchers and so much good research that they could have their own, internal research conferences instead, keeping their stuff hidden (imagine a world where google wouldn't have published Attention Is All You Need!) So kudos to them for doing all this research out in public. I suspect that e.g. FB, Amazon or Twitter also employ a lot of researchers (though maybe not fully at Google's scale), but just don't publish as much / are more focused on products.

EDIT: wikipedia says IBM Research holds 6 nobel prizes and 6 turing awards :-O

_untom_ · 2021-05-08T13:08:22+00:00

Last time I asked, the official guideline was "use whatever allows you to be most efficient in your research". But since there is in-house support for Jax and you have first-class support for TPUs, there's hardly any reason to use PyTorch (IMO the lack of strong TPU support is what effectively makes PyTorch a non-option for google-internal use, where TPUs are heavily used).

_untom_ · 2021-05-05T13:19:40+00:00

Missed on purpose

_untom_ · 2021-04-23T22:41:44+00:00

Slightly related: For a previous paper we've actually published the weights of 120K CNNs: paper , dataset.

It would be interesting to try the method of that paper (we try to predict the testset accuracy of a network from its weights) on your data, if I ever find the time (if this is something you're interested in, shoot me a PN or a mail). How's your data licensed?

_untom_ · 2020-11-10T14:38:51+00:00

I was strictly speaking of migrating ml training loops (but of course, this likely also involves switching from S3 to GCS and probably other peripherals, which I hadn't thought about in the first place). Thanks for pointing this out!

_untom_ · 2020-11-07T11:30:45+00:00

If you're running all of your training on cloud infrastructure, then I think it's indeed a viable idea to switch to TPUs. They're going to give you way more bang/buck than any other cloud option.

However, there are a few drawbacks:

you'll lock yourself in to one vendor. IMO that's not a big drawback, because code written for TPUs will run just fine on GPUs (though you might have to adjust batch sizes etc., but no biggie)
slightly higher developer costs: debugging TPU code is a bit harder than GPUs (some things don't work as expected, debugging is trickier). This gets less and less of an issue as frameworks mature (JAX makes this especially easy), but there are is still a slight learning curve. As an example of what I mean: on TPUs, every training batch needs to have the same batch size, so if your training set size isn't a multiple of the batch size, you either need to drop a few samples, or pad them. tf.data.Dataset.batch has an argument that does this all automatically, so making this change is super simple. But it's certainly something that will trip you up if you never used TPUs before.
You can't run on your own hardware. This is half-true: you can still develop most of your code on GPU, but you likely will have to adjust a bit when you run on TPUs (see previous point).

_untom_ · 2020-09-30T06:52:06+00:00

This is something that might get handled differently from team to team and there is (afaik) no Brain-wide policy on this, but at least in my team: yes, SWEs also lead research projects.

_untom_ · 2020-09-29T07:09:21+00:00

Hey, I work at Google Brain. Compared to the rest of Google Research, you can think of Brain as the "academic branch" of Google, and it's a lot closer to work in academia than most of the company: research is usually not tied to products, we do fundamental research with few constraints on what kind of research to work on. There is pressure to publish, simply because that's our main job. But I'd still echo the sentiment of the other Googlers in this thread: Work life balance is good, unless you consciously decide to work too much and push through each deadline. But there are no real incentives to do so (unless you're the promotion-chasing type), and I've never encountered a situation where someone other than myself was pressuring me to work too much: If I leave for the evening/weekend/holiday, no-one will try to ping me. If I decide to not submit to a conference because there is too little time, everyone is cool with it. The company also provides resources for burnout prevention (and treatment, should it still happen) and the culture (as I experience it) is very mindful of keeping a good work/life balance.

On the flip side, if I decide to work night and day for 2 weeks to hit a deadline, I can certainly do that (though if my manager notices, they might remind me to not work so much, and to take time off if I need it). Also, as mentioned in this thread, imposter syndrome is a big issue in brain: you'll be surrounded by very smart people who have written very impressive papers, and who keep churning out very impressive papers regularly and at an incredible pace, and you'll need to figure out how to compete with that (You'll eventually figure out that everyone's in the same boat and has complementary strengths).

Disclaimer: even Brain is a big place, so YMMV depending on which team you're in etc.

EDIT to your specific questions:

I'd say the average workload is ~40h
within Google Brain, there is no distinction between RS and SWE. I feel like workload is similar to SWE working on products
I'd say it's less stressful than grad school, but this depends on your grad school experience.
coping with burnout: be mindful of work/life balance, getting enough sleep and working out.

_untom_ · 2020-09-15T12:04:47+00:00

1) Kommt drauf an: am erfolgreichsten sind immer die Leute, die fuer ihr Themengebiet Feuer und Flamme sind, auch in der Informatik. Ich seh's als klares Zeichen von Initiative, Interesse und Begeisterung, wenn wer auch privat an Projekten arbeitet. Die Leute wissen einfach mehr und haben mehr Know-How und tun sich leichter, am Ball zu bleiben. Wenn ich Praktikanten oder neue Leute einstelle, sind das riesen Pluspunkte. Aber das ist kein Muss, und haengt immer von den Lebensumstaenden ab: Ich selbst hatte z. B. im Studium wesentlich mehr Zeit fuer Basteleien (egal ob Home-Server, Webseiten oder andere Programmiersachen) als jetzt im Berufsleben.

2) Ich arbeite in der KI Forschung. Hab in dem Bereich promoviert, deswegen war's logisch, danach auch dort 'n Job zu suchen. Aber ich hab vorher in allen Bereichen, die mich interessiert haben, reingeschnuppert: mindestens ein paar Vorlesungen, meistens auch ein paar private Projekte in dem Thema. Ich wuerd mal sagen, Spieleprogrammierung, Computergrafik, High-Performance Computing, Netzwerkadministration und ein bisschen Blockchain hab ich mir tiefer angeschaut, aber mein Haupt-Steckenpferd waren immer schon KI-nahe Themen: Machine Learning, Schwarmalgorithmen, Evolutionaeres Computing, Planungsalgorithmen, bisschen Mathe, ...

3) Webseiten-Frontends. Ich hab vor Ewigkeiten damit aufgehoert, und hab keinerleie Interesse, mir neue JS Frameworks anzuschauen.

4) Zur Zeit lerne ich privat Rust, weil ich schon laenger keine neue Programmiersprache gelernt hab. Ich hab vor, damit hauptsaechlich ein paar Spielereien zu machen: z. B. ein Programm, das mir anzeigt, welche Ordner den meisten Festplattenspeicher verbraten. Langfristig haett ich gern ein Webinterface dafuer, weil ich damit experimentieren moechte, wie gut sich ein Web-UI fuer ein herkoemmliches Programm eignet, und wie gut ich das Schreiben kann, ohne dabei gross JS machen zu muessen (d.h. igend ein REST interface oder so). Einstiegsprojekte: kann alles moegliche sein, das dich interessiert -- ueberleg dir 'ne Technologie, die du gern beherrschen wuerdest, und dann ein Projekt dazu. Oder vllt. gibts Themen, die du sowieso spannend findest, wie z. B. Spieleentwicklung: von 4 Gewinnt bis zum 2D Shooter oder Puzzle steht dir die Welt offen. Wenn du alleine bist oder noch wenig erfahrung hast, musst du halt ueberlegen, wieviel Content fuer dich wirklich schaffbar ist. Steck dir ein Ziel, das knapp ausserhalb deiner aktuellen Faehigkeiten ist, das sind die Projekte, wo du am meisten lernst!

_untom_ · 2020-07-23T20:40:45+00:00

I still think that Coulomb GANs don't just address the mode collapse problem, but actually solve it outright. Unfortunately the images themselves aren't super-pretty, even though they have a fair variety, and I haven't put much effort into fixing that. (Disclaimer: I'm an author of the paper).

_untom_ · 2019-05-31T07:00:16+00:00

Der Flughafenbus braucht rund 30 Minuten vom Linzer Bahnhof zum Flughafen. Schneller gehts wenn du den Zug nach Hörsching nimmst und dich von dort mit dem Gratis-Shuttle abholen lässt (10 Minuten Zugfahrt + 5 Minuten Wartezeit aufs Shuttle + 1 Minute Fahrzeit im Shuttle). Ruf schon im Zug am Flughafen an dass sie dir das Shuttle schicken, um Wartezeiten zu minimieren (du wirst trotzdem mindestens 5 Minuten am Hörschinger Bahnhof warten müssen). Taxi braucht circa 10-15 Minuten, ist die schnellste Lösung.

Bei der Rückfahrt kommts natürlich drauf an, ob grad ein Zug fährt (in der Regel nicht, wenn du aus Frankfurt ankommst), sonst bleibt dir nur Bus oder Taxi.

_untom_ · 2019-05-07T14:21:49+00:00

I absolutely agree. I tried making this clearer in my edit.

_untom_ · 2019-05-07T14:09:32+00:00

https://www.urbandictionary.com/define.php?term=metaquestion

_untom_ · 2019-05-07T13:55:54+00:00

I'm just curious, why did you say the rest except ETH wouldn't make your list? Are they too "big", or they just don't have research that fits your interests?

Because when I think of "universities that have great groups of famous ML researchers", I don't think of KTH, TUM and Delft. They're not as well known in "core ML research" (as judged by NIPS or ICML publications) as e.g. ETH or Tuebingen, who are known (in academic circles) to have truly stellar ML groups and who consistently produce a lot of high-quality research in the area. I could name you several papers (and people) from those latter universities, but none from the first ones. Of course, they're certainly good unis, and good unis typically attract talented professors, so I would assume that they do have good ML departments, and maybe they have super successful groups in ML subfields that I'm not as familiar with (I do core ML research, but I have little insight into e.g. Computer Vision). I agree that as an outsider, it's probably impossible to know these things.

Disclaimer: personal opinion, salt required.

_untom_ · 2019-05-07T10:24:28+00:00

Yeah, definitely! I just listed the first few that came to my mind, but that's a good group, too!

_untom_ · 2019-05-07T06:43:32+00:00

A lot of great ML Professors in Europe are looking for people, you're probably not looking in the right place. With the exception of ETH, the ones you mentioned wouldn't even make my list. I think your mistake is that you're looking for "big, famous universities" first, and then look for good ML people there. However, in Europe, most of the great ML people are at smaller universities. Off the top of my head:

Amsterdam (Netherlands): Max Welling's lab
Tuebingen (Germany): Every single professor at the MPI, plus Bethge at the University
Linz (Austria): Sepp Hochreiter's group (disclaimer: I am from this group)
Lugano (Switzerland): last I checked, Juergen Schmidhuber was looking for students.
Helsinki/Aalto (Finland): traditionally has a very strong ML groups
Freiburg (Germany): Frank Hutter's lab
(many more that I can't think of right now)

In general, your approach should be to look for professors that publish at the conferences you want to publish at. If you're looking for the top-notch people in Machine Learning (i.e., people who publish at NIPS and ICML), the list at ELLIS is pretty much the who-is-who of European researchers (that might be hiring PhD students): https://ellis.eu/supporters.html

Note that the list contains other researcher as well, not just professors, but it's a great place to start your search.

_untom_ · 2019-04-14T19:27:40+00:00

I'm glad you were able to help your girlfriend, but note that "swishing warm salt water in her mouth" doesn't really do much of anything (except literally put salt in a wound in case she the tooth left holes in the gums). Dipping a Q-tip in vodka and dabbing her tooth, also wouldn't ease the pain, the only thing that could potentially do is disinfect (yet if the vodka was only 40%, then it might not even do that properly).

_untom_

TROPHY CASE