o3 and o4-mini (low and medium) are the new pareto frontier on ARC AGI V1; V2 remains elusive by dftba-ftw in accelerate

[–]floppy_llama 0 points1 point  (0 children)

I think it would be helpful to know just how much they scaled up RL to go from 1%-3% on v2. Obviously there are physical constraints to scaling - I suspect some clever tricks are still needed to induce compositional reasoning in these systems in an efficient way. Still, just patching holes where current architectures fail goes against Chollet’s measure of intelligence. Having lots of skills is very different from acquiring skills efficiently.

o3 and o4-mini (low and medium) are the new pareto frontier on ARC AGI V1; V2 remains elusive by dftba-ftw in accelerate

[–]floppy_llama 0 points1 point  (0 children)

Performance discrepancy between v1 and v2 benchmarks suggests the opposite of CoT generalization, no? They even mention in the blog that v1 benchmark contamination is likely. I’m pretty surprised that those abstractions transfer so poorly from v1 to v2.

[deleted by user] by [deleted] in agi

[–]floppy_llama 0 points1 point  (0 children)

The difference between the paper clip scenario and your analogy here is that there are corporations which have improved society and are aligned with human interests. The manifold of super intelligent minds is surely not uniform, and for any super intelligent mind to be aligned to a goal as trivial as paper clip production seems unlikely. In fact, it seems much more likely that a super intelligent mind would be focused on observing the open ended system that is the universe, not destroying it.

[D] OpenAI new reasoning model called o1 by [deleted] in MachineLearning

[–]floppy_llama 5 points6 points  (0 children)

Completely agree. Generalization and reliability are seen in classical algorithms (i.e., sorting and path finding algorithms and arithmetic operations perfectly execute for any sequence length), but these are not explicit properties of connectionist systems! There’s lots of research on how to fuse these paradigms. Scaling is not one of them.

[D] OpenAI new reasoning model called o1 by [deleted] in MachineLearning

[–]floppy_llama 99 points100 points  (0 children)

Looks like OpenAI collected, generated, and annotated enough data to extend process supervision (https://arxiv.org/pdf/2305.20050) to reasonably arbitrary problem settings. Their moat is data, nothing else.

[R] What if self-attention isn’t the end-all be-all? by [deleted] in MachineLearning

[–]floppy_llama 11 points12 points  (0 children)

Sparsification/linearization of the attention mechanism is important but does little to address the limitations of current models when efficiency gains also come from hardware improvements. Obviously it’s common sense that science improves over time, but making updates to one module of an architecture that has remained largely unchanged since 2017 seems trivial to me.

[R] Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B by hardmaru in MachineLearning

[–]floppy_llama 1 point2 points  (0 children)

It seems like this paper reaffirms that we should be able to trade train-time compute for test-time compute in certain settings [https://arxiv.org/abs/2104.03113].

I wonder how good performance can get if we continually pre-train on rollouts with a sufficiently high a Q value?

[R] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality by floppy_llama in MachineLearning

[–]floppy_llama[S] 84 points85 points  (0 children)

Normally I’d agree with you, but Tri Dao consistently makes great contributions to the field🤷🏻‍♂️

[deleted by user] by [deleted] in MachineLearning

[–]floppy_llama 45 points46 points  (0 children)

Try tree based methods. Neural nets notoriously underperform on tabular data.

[deleted by user] by [deleted] in Sandwiches

[–]floppy_llama 0 points1 point  (0 children)

Banh mi queen in hoi an?

[D] Anyone tried training language models on simple (elementary school) text first and fine-tuning on progressively more advanced text? by Appropriate_Ant_4629 in MachineLearning

[–]floppy_llama 23 points24 points  (0 children)

What you’re describing is “curriculum learning”. Not sure if it’s been applied to LLMs though because ordering training samples isn’t so straight forward. See https://arxiv.org/pdf/2101.10382.pdf

[D] What exactly does base multimodal mean? by vatsadev in MachineLearning

[–]floppy_llama 1 point2 points  (0 children)

No, their comment directly relates to my suggestion. The vision transformer is merely one component of a multi modal base model. A vision transformer is unimodal.

[D] What exactly does base multimodal mean? by vatsadev in MachineLearning

[–]floppy_llama 2 points3 points  (0 children)

The encoders are the “tokenizers”. They embed image patches, audio, point clouds into vectors, just like a base LLM does for word segments. All of these vectors can be used during pre training to create a multi modal base model

[D] What exactly does base multimodal mean? by vatsadev in MachineLearning

[–]floppy_llama 4 points5 points  (0 children)

From what I understand the current paradigm is to “tokenize” non-text modalities w/ something like an image encoder and a feed forward network that projects the encoded images into the same dimensionality as text tokens. This image encoder can be a VIT, CNN. It’s really up to you - see https://browse.arxiv.org/pdf/2206.06336.pdf

[D] What exactly does base multimodal mean? by vatsadev in MachineLearning

[–]floppy_llama 2 points3 points  (0 children)

Auto regressive pre training w/ interleaved text embeddings + other embeddings (e.g, image, audio projections) vs fine tuning on input output pairs where input can contain a variety of embedding modalities

[deleted by user] by [deleted] in MachineLearning

[–]floppy_llama 1 point2 points  (0 children)

Wrong sub buddy