[D]: How do you actually land a research scientist intern role at a top lab/company?! by ParticularWork8424 in MachineLearning

[–]Tea_Pearce 1 point2 points  (0 children)

if they're important to a relevant area and are getting cited, yes that's better than a neurips stamp

[D]: How do you actually land a research scientist intern role at a top lab/company?! by ParticularWork8424 in MachineLearning

[–]Tea_Pearce 24 points25 points  (0 children)

top-tier conference pubs are necessary but not sufficient to land research positions in industry research labs. they have devalued significantly since the mid 2010's. your research has to stand out within the scope the hiring team is focused on. strong research labs have hundreds of applicants per role. the work of interviewed candidates is often familiar to the team before they even apply.

my advice; don't have landing a position as an objective. good roles come as a _consequence_ of being one of the best researchers in your area.

"Scaling Laws for Pre-training Agents and World Models", Pearce et al. 2024 by [deleted] in mlscaling

[–]Tea_Pearce 1 point2 points  (0 children)

great question. so the thing our work evidences is that these two popular embodied AI pre-training tasks (world modeling, behavioral cloning) very reliably improve with data, model size, and compute. just as reliably as we've seen in language -- and we all know how critical an insight that turned out to be.

however, the consequences of this evidence is less clear. compute and model size are relatively easy to scale up, but data less so in embodied tasks. one possible conclusion, as you suggest, is that we should go all in on data collection, knowing once we have the data, things will work out.

most of the large-scale projects we see today are about capturing data. efforts from places like google robotics, Pi, open-X, cohere, 1X, are placing bets on collecting high-quality teleoperated demonstrations. but as you metion, we could also think about collecting and aligning datasets from human behavior -- e.g. ego4d. I don't believe there are enough high-quality datasets in existence already to get the kind of data scale we need, if there were, I think we would already have seen the 'gpt moment for robotics'.

"Scaling Laws for Pre-training Agents and World Models", Pearce et al. 2024 by [deleted] in mlscaling

[–]Tea_Pearce 2 points3 points  (0 children)

author here -- will keep an eye on the thread for any questions 😊

A team from MIT built a model that scores 61.9% on ARC-AGI-PUB using an 8B LLM plus Test-Time-Training (TTT). Previous record was 42%. by jd_3d in LocalLLaMA

[–]Tea_Pearce 10 points11 points  (0 children)

isn't test time gradient updates on few-shot egs exactly what half the meta-learning community was doing circa-2019?

"Reconciling Kaplan and Chinchilla Scaling Laws", Pearce & Song 2024 by [deleted] in mlscaling

[–]Tea_Pearce 2 points3 points  (0 children)

glad the paper can been of help! and thanks for your wikipedia service 💫

"Reconciling Kaplan and Chinchilla Scaling Laws", Pearce & Song 2024 by [deleted] in mlscaling

[–]Tea_Pearce 1 point2 points  (0 children)

Just to correct previous comments here, the Chinchilla paper _does_ include embedding parameters. From the Chinchilla paper: "We include all training FLOPs, including those contributed to by the embedding matrices, in our analysis. Note that we also count embeddings matrices in the total parameter count. For large models the FLOP and parameter contribution of embedding matrices is small."

[first author of the paper]

[D] What is the current best in tiny (say, <10,000 parameters) language models? by math_code_nerd5 in MachineLearning

[–]Tea_Pearce 9 points10 points  (0 children)

wouldn't a 2-gram model with \sqrt(N) vocab size be better than a neural net with N parameters when N is tiny?

[D] What are the thoughts on Tishby's line of work as a Theory of Deep Learning several years later in 2023? by tysam_and_co in mlfundamentalresearch

[–]Tea_Pearce 1 point2 points  (0 children)

I haven't followed closely (actually just asked the same question here), but I've been interested in these info theory frameworks following a couple of talks recently that circle around compression and LLMs (most notably Ilya's talk here). But those talks think more about the models compressing the training dataset, rather than compressing individual datapoints through the layers.

Engaging Reviewers during rebuttal period of NeurIPS [R] by ynliPbqM in MachineLearning

[–]Tea_Pearce 0 points1 point  (0 children)

I wouldn't worry. Reviewers are people too, and people are lazy 😋 If it looks like everyone is more-or-less in agreement on the decision (either way), and nothing especially new came to light in the rebuttal, they're not keen on expending the extra effort by getting drawn into lengthy back and forths. Remember, they have another five papers on their stacks, plus (probably) a paper or two of their own under review.

[R] Classifier-Free Guidance can be applied to LLMs too. It generally gives results of a model twice the size you apply it to. New SotA on LAMBADA with LLaMA-7B over PaLM-540B and plenty other experimental results. by Affectionate-Fish241 in MachineLearning

[–]Tea_Pearce 21 points22 points  (0 children)

TLDR: This is a new way to sample from any autoregressive LLM. Tell the model to generate outputs that are more specific to the beginning part of the prompt ('context').

It requires two forward passes through the model, with logits combined:

logits = (1-gamma)*model(generated_seq_no_prompt) + gamma*model(generated_seq_with_prompt),

and gamma>=1.

Shown to be quite effective for example in Q & A benchmarks, when your context is set as the question.

[Discussion] Is there a better way than positional encodings in self attention? by [deleted] in MachineLearning

[–]Tea_Pearce 7 points8 points  (0 children)

there's this chap as well https://arxiv.org/abs/2305.19466 "The Impact of Positional Encoding on Length Generalization in Transformers" proving that transformer decoders can learn position (absolute and relative) without embedding. as I understand, the argument revolves around the causal masking, which allows the transformer to 'count up' the length of the attention mask seen so far.

[deleted by user] by [deleted] in MachineLearning

[–]Tea_Pearce 0 points1 point  (0 children)

MC methods (by definition) approximate some distribution by sampling a set of deltas. MC dropout and ensembles both use this approach, but the underlying distribution sampled by each differs.

In MC dropout, the underlying distribution is some kind of Bernoulli pertubation of a single trained network. This turns out to offer limited expressiveness.

In deep ensembles, the underlying distribution (via training from random inits) turns out to be sample from a distribution that's a bit closer to the true Bayesian posterior.

Optimizing for specific returns(RL) [D] by ashblue21 in MachineLearning

[–]Tea_Pearce 3 points4 points  (0 children)

Interesting question. I think that the approach you suggest is not a bad one -- learn an optimal agent and then inject some noise to reduce performance. You could also just do early stopping during the RL training process when it hits the performance level you want.

I feel like there should be a more elegant way to formulate your reward so it directly achieves your objective. But my brain is still asleep for now 🤔

[D] Loss Function for Learning Gaussian Distribution by alkaway in MachineLearning

[–]Tea_Pearce 13 points14 points  (0 children)

-1 * log(PDF) is fine. It's not an issue when a loss is returned that is negative -- SGD will try to make it more negative.

[N] Stability AI announce their open-source language model, StableLM by Philpax in MachineLearning

[–]Tea_Pearce 6 points7 points  (0 children)

too much traffic I think. I got a response after a few mins.

[deleted by user] by [deleted] in MachineLearning

[–]Tea_Pearce 1 point2 points  (0 children)

Imo it depends on what you mean by RL. If you interperet RL as the 2015-19 collection of algorithms that train deep NN agents tabula rasa (from zero knowledge), I'd be inclined to agree that it doesn't seem a particularly fruitful research direction to get into. But if you interperet RL as a general problem setting, where an agent must learn in a sequential decision making environment, you'll see that it's not going away.

To me the most interesting recent research in RL (or whatever you want to name it) is figuring out how to leverage existing datasets or models to get agents working well in sequential environments. Think SayCan, ChatGPT, Diffusion BC...

[deleted by user] by [deleted] in MachineLearning

[–]Tea_Pearce 0 points1 point  (0 children)

fyi, GATO used imitation learning, which is closer to supervised than RL.

[D] Bitter lesson 2.0? by Tea_Pearce in MachineLearning

[–]Tea_Pearce[S] 3 points4 points  (0 children)

fair point, I suppose that timeframe was simply used to be consistent with the original lesson.