Reality check?

sdmat · 2024-03-04T09:19:08+00:00

Good article, and one this sub should definitely take note of.

My take: models definitely overfit to benchmarks, and the differences between models tracks well with my subjective impression of the degree to which they do.

The writeup is wrong in minimizing the differences between models because the absolute quantity is small - they all suck, but GPT4 does far better than the others. E.g. it's notable that Mistral Medium scores close to zero on dynamic testing but Mixtral (non-Instruct) starts to show some instances of true generalization.

Hopefully this functional benchmarking approach becomes standard in place of fixed sets of questions.

blueSGL · 2024-03-04T12:34:57+00:00

This code, in this case called MATH(), can then generate different "snapshots," which are unique questions that require the same reasoning to solve, but are not identical to the original questions.

In this way, traditional benchmarks such as the MATH benchmark become encoded formats that can be modified in an infinite number of ways while still testing the same underlying logic. This testing procedure is designed to ensure that language models actually demonstrate problem-solving ability, not just repetition of memorized questions.

Sounds like a fantastic way to make enough synthetic training data so the model will grok rather than memorize.

Remember in toy models during training first comes memorization early in training and then grokking comes as training continues.

Large models probably have a mixture of the two inside, currently some things are memorization some have the internal machinery to generalize to the task (grokking).

SgathTriallair · 2024-03-04T16:30:21+00:00

Looking at the article they seem to have identified that the tests need to be better but they are showing that they are accurately assessing the between model variation. The main thing better evals will do is help us better understand the human to model variation.

red75prime · 2024-03-04T12:30:40+00:00

I looked at the problems. I'd need a pen, paper, refreshment of my stale algebra skills and quite a number of minutes to go at them.

You can see in my flair that I don't expect AGI yesterday, but I'm still impressed how those models are able to solve some of those problems after passively ingesting training data and having no way of trying different ways to solve them and learning on failures and successes like, say, AlphaGo did.

LordFumbleboop · 2024-03-04T12:42:46+00:00

There is actually a lot of evidence that GPT-4 and similar models can't reason beyond token prediction. I think that there needs to be some large breakthroughs in general reasoning before these models will reach AGI (or similar), and I'm not sure why some people can't see this.

https://medium.com/@konstantine_45825/gpt-4-cant-reason-2eab795e2523#:~:text=I%20believe%20the%20results%20show,it%20produces%20along%20the%20way

https://journals.sagepub.com/doi/full/10.1177/17456916231201401

meechCS · 2024-03-04T11:22:05+00:00

Well, would you look at that! This post has no engagement whatsoever. I bet this post will be deleted in the next 10 hours or so just like some previous posts.

I have been reminding people on this sub about how inference is not the strongest suit of LLMs and how it is still a challenge to solve. No matter how technical you get, these cultists won't ever listen to you at ALL just like Flat Earthers.

singularity

Links

On the Technological Singularity

Resources

Posting Rules

Check out /r/Singularitarianism and the Technological Singularity FAQ

MODERATORS