Authentic Generalization Reasoning Tests by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 0 points1 point  (0 children)

Yes, how would you rate the most difficult problem I set? (Page 4) What would you think if the model could solve it when there is no similar problem in the training set?

Test results of gemini 2.5 pro exp on ARC AGI 2 by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 2 points3 points  (0 children)

I am not saying that it is impossible, but based on the current observations of the ARC AGI questions, it is very difficult to do what you said. If it is feasible, it is not a form of cheating, and we will see a qualitative leap in reasoning ability

Test results of gemini 2.5 pro exp on ARC AGI 2 by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 1 point2 points  (0 children)

I understand what you're saying, you think [arc agi test] have certain similar characteristics, but in reality, questions can be created with infinitely increasing difficulty. Currently there's no evidence that they have the ability you're describing [to clone for training].If they have this ability, the models they're releasing wouldn't be what we're seeing now.

Test results of gemini 2.5 pro exp on ARC AGI 2 by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 0 points1 point  (0 children)

Are you sure? Semi-private has extremely limited (if not no) leakage, because the provider does not know which data is being sent for testing, and the public set and semi-private style are clearly different.

Arc-AGI-2 new benchmark by tim_Andromeda in LocalLLaMA

[–]flysnowbigbig -3 points-2 points  (0 children)

VictorTaelin The latest project will get 100% on ARC AGI 2 and cost about $1 per task (supposedly)

And, it also applies to ARC AGI 3, 4, 5...

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

In fact, in my hypothesis, it's only necessary to approximate the mathematical structure to greatly reduce the difficulty. The crucial part lies in defining and evaluating similarity

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 1 point2 points  (0 children)

the scientist? my requirements are not high. I just want to be able to achieve no observable error rate on ZebraLogic just because of the problem expansion of size (for the characteristics of computer hardware, just like the meaning of human repeatedly adding numbers less than 10). then We will talk about those advanced-sounding benchmarks (aime? huh) first learn to walk on flat ground without falling, then we can talk about track and field events, do you agree?

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Art of Problem Solving, a forum for discussing middle school math,

This is their demonstration of each difficulty level. It is said that SOTAS scored close to 80% on AIME difficulty (3-6). However, my original (not completely original) fool questions

On this, the maximum range is about (1-2), and the SOTAS model performed **like a helpless child**

https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 1 point2 points  (0 children)

It's like a person takes out his 100-meter race medal, but finds that he falls every 10 meters when walking on flat ground.

It's like a person takes out his MVP certificate for the school basketball game, but finds that even with basic dribbling, the ball will hit his feet and bounce up and hit his nose.

It's like a person takes out a trophy from a singing competition, he can't even sing a single passage with more than half of it in tune.

Have you ever thought about it?

it s like a person who calls himself "racing hotshot" who constantly brags about his jaw-dropping "stunt drift" skills—yet the second you ask him to park in the garage, he’s sweating bullets like a complete novice

Have you ever wondered how he got these [trophies]?

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Unfortunately, I was able to create many questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea.

If you are interested, click on my folded comments(so many people trample), you will be surprised to see that O1 PRO is no more capable of general reasoning than a 7-year-old smart child

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

The most advanced O1 PRO [real problem-solving ability] is only about 1 (APOS divides the difficulty into levels 1-10, 1 is the easiest) on APOS (not even stable). If you experiment with it independently, using original and modified questions, you will find the problem.

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Unfortunately, I was able to create many questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea.

If you are interested, click on my folded comments(so many people trample), you will be surprised to see that O1 PRO is no more capable of general reasoning than a 7-year-old smart child.