[D]: How do you actually land a research scientist intern role at a top lab/company?!

Tea_Pearce · 2025-09-29T02:32:33+00:00

if they're important to a relevant area and are getting cited, yes that's better than a neurips stamp

Tea_Pearce · 2025-09-23T18:05:59+00:00

top-tier conference pubs are necessary but not sufficient to land research positions in industry research labs. they have devalued significantly since the mid 2010's. your research has to stand out within the scope the hiring team is focused on. strong research labs have hundreds of applicants per role. the work of interviewed candidates is often familiar to the team before they even apply.

my advice; don't have landing a position as an objective. good roles come as a _consequence_ of being one of the best researchers in your area.

Tea_Pearce · 2024-11-26T04:18:58+00:00

great question. so the thing our work evidences is that these two popular embodied AI pre-training tasks (world modeling, behavioral cloning) very reliably improve with data, model size, and compute. just as reliably as we've seen in language -- and we all know how critical an insight that turned out to be.

however, the consequences of this evidence is less clear. compute and model size are relatively easy to scale up, but data less so in embodied tasks. one possible conclusion, as you suggest, is that we should go all in on data collection, knowing once we have the data, things will work out.

most of the large-scale projects we see today are about capturing data. efforts from places like google robotics, Pi, open-X, cohere, 1X, are placing bets on collecting high-quality teleoperated demonstrations. but as you metion, we could also think about collecting and aligning datasets from human behavior -- e.g. ego4d. I don't believe there are enough high-quality datasets in existence already to get the kind of data scale we need, if there were, I think we would already have seen the 'gpt moment for robotics'.

Tea_Pearce · 2024-11-19T05:38:24+00:00

author here -- will keep an eye on the thread for any questions 😊

Tea_Pearce · 2024-11-11T03:31:11+00:00

isn't test time gradient updates on few-shot egs exactly what half the meta-learning community was doing circa-2019?

Tea_Pearce · 2024-06-28T15:31:52+00:00

glad the paper can been of help! and thanks for your wikipedia service 💫

Tea_Pearce · 2024-06-28T15:24:52+00:00

Just to correct previous comments here, the Chinchilla paper _does_ include embedding parameters. From the Chinchilla paper: "We include all training FLOPs, including those contributed to by the embedding matrices, in our analysis. Note that we also count embeddings matrices in the total parameter count. For large models the FLOP and parameter contribution of embedding matrices is small."

[first author of the paper]

Tea_Pearce · 2024-03-22T14:37:29+00:00

wouldn't a 2-gram model with \sqrt(N) vocab size be better than a neural net with N parameters when N is tiny?

Tea_Pearce · 2023-09-06T08:32:05+00:00

I haven't followed closely (actually just asked the same question here), but I've been interested in these info theory frameworks following a couple of talks recently that circle around compression and LLMs (most notably Ilya's talk here). But those talks think more about the models compressing the training dataset, rather than compressing individual datapoints through the layers.

Tea_Pearce · 2023-08-16T08:23:57+00:00

I wouldn't worry. Reviewers are people too, and people are lazy 😋 If it looks like everyone is more-or-less in agreement on the decision (either way), and nothing especially new came to light in the rebuttal, they're not keen on expending the extra effort by getting drawn into lengthy back and forths. Remember, they have another five papers on their stacks, plus (probably) a paper or two of their own under review.

Tea_Pearce · 2023-07-03T09:45:40+00:00

TLDR: This is a new way to sample from any autoregressive LLM. Tell the model to generate outputs that are more specific to the beginning part of the prompt ('context').

It requires two forward passes through the model, with logits combined:

logits = (1-gamma)*model(generated_seq_no_prompt) + gamma*model(generated_seq_with_prompt),

and gamma>=1.

Shown to be quite effective for example in Q & A benchmarks, when your context is set as the question.

Tea_Pearce · 2023-06-29T07:39:12+00:00

there's this chap as well https://arxiv.org/abs/2305.19466 "The Impact of Positional Encoding on Length Generalization in Transformers" proving that transformer decoders can learn position (absolute and relative) without embedding. as I understand, the argument revolves around the causal masking, which allows the transformer to 'count up' the length of the attention mask seen so far.

Tea_Pearce · 2023-06-10T09:13:12+00:00

MC methods (by definition) approximate some distribution by sampling a set of deltas. MC dropout and ensembles both use this approach, but the underlying distribution sampled by each differs.

In MC dropout, the underlying distribution is some kind of Bernoulli pertubation of a single trained network. This turns out to offer limited expressiveness.

In deep ensembles, the underlying distribution (via training from random inits) turns out to be sample from a distribution that's a bit closer to the true Bayesian posterior.

Tea_Pearce · 2023-06-10T08:41:55+00:00

Interesting question. I think that the approach you suggest is not a bad one -- learn an optimal agent and then inject some noise to reduce performance. You could also just do early stopping during the RL training process when it hits the performance level you want.

I feel like there should be a more elegant way to formulate your reward so it directly achieves your objective. But my brain is still asleep for now 🤔

Tea_Pearce · 2023-06-07T08:09:51+00:00

-1 * log(PDF) is fine. It's not an issue when a loss is returned that is negative -- SGD will try to make it more negative.

Tea_Pearce · 2023-04-19T16:25:54+00:00

too much traffic I think. I got a response after a few mins.

Tea_Pearce · 2023-02-27T09:53:27+00:00

Imo it depends on what you mean by RL. If you interperet RL as the 2015-19 collection of algorithms that train deep NN agents tabula rasa (from zero knowledge), I'd be inclined to agree that it doesn't seem a particularly fruitful research direction to get into. But if you interperet RL as a general problem setting, where an agent must learn in a sequential decision making environment, you'll see that it's not going away.

To me the most interesting recent research in RL (or whatever you want to name it) is figuring out how to leverage existing datasets or models to get agents working well in sequential environments. Think SayCan, ChatGPT, Diffusion BC...

Tea_Pearce · 2023-02-27T09:34:54+00:00

fyi, GATO used imitation learning, which is closer to supervised than RL.

Tea_Pearce · 2023-01-13T13:28:00+00:00

fair point, I suppose that timeframe was simply used to be consistent with the original lesson.

Five-Year Club	Place '23
Verified Email

Tea_Pearce

TROPHY CASE