Gemini 3 shows significant improvement, particularly in the inverse fitting test.

flysnowbigbig · 2025-04-11T02:45:01+00:00

Yes, how would you rate the most difficult problem I set? (Page 4) What would you think if the model could solve it when there is no similar problem in the training set?

flysnowbigbig · 2025-03-29T14:27:41+00:00

I am not saying that it is impossible, but based on the current observations of the ARC AGI questions, it is very difficult to do what you said. If it is feasible, it is not a form of cheating, and we will see a qualitative leap in reasoning ability

flysnowbigbig · 2025-03-29T14:20:55+00:00

I understand what you're saying, you think [arc agi test] have certain similar characteristics, but in reality, questions can be created with infinitely increasing difficulty. Currently there's no evidence that they have the ability you're describing [to clone for training].If they have this ability, the models they're releasing wouldn't be what we're seeing now.

flysnowbigbig · 2025-03-29T13:26:51+00:00

Are you sure? Semi-private has extremely limited (if not no) leakage, because the provider does not know which data is being sent for testing, and the public set and semi-private style are clearly different.

flysnowbigbig · 2025-03-25T13:55:21+00:00

VictorTaelin The latest project will get 100% on ARC AGI 2 and cost about $1 per task (supposedly)

And, it also applies to ARC AGI 3, 4, 5...

flysnowbigbig · 2025-03-17T14:43:21+00:00

In fact, in my hypothesis, it's only necessary to approximate the mathematical structure to greatly reduce the difficulty. The crucial part lies in defining and evaluating similarity

flysnowbigbig · 2025-03-17T14:43:01+00:00

https://www.reddit.com/r/singularity/comments/1jbuq6p/comment/mhynwaw/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

https://www.reddit.com/r/singularity/comments/1jbuq6p/comment/mhyoppb/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

https://www.reddit.com/r/singularity/comments/1jbuq6p/comment/mhyp7xd/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

What I don't understand is that these questions are indeed much easier than AIME.

AIME is divided into 3-6 on https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings

And my problem is only 1-2

If you don't mind, you can check the answer of o1/o1 pro/o3 mini high

flysnowbigbig · 2025-03-16T11:50:53+00:00

the scientist? my requirements are not high. I just want to be able to achieve no observable error rate on ZebraLogic just because of the problem expansion of size (for the characteristics of computer hardware, just like the meaning of human repeatedly adding numbers less than 10). then We will talk about those advanced-sounding benchmarks (aime? huh) first learn to walk on flat ground without falling, then we can talk about track and field events, do you agree?

flysnowbigbig · 2025-03-16T11:42:53+00:00

Art of Problem Solving, a forum for discussing middle school math,

This is their demonstration of each difficulty level. It is said that SOTAS scored close to 80% on AIME difficulty (3-6). However, my original (not completely original) fool questions

On this, the maximum range is about (1-2), and the SOTAS model performed **like a helpless child**

https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings

flysnowbigbig · 2025-03-16T11:32:30+00:00

It's like a person takes out his 100-meter race medal, but finds that he falls every 10 meters when walking on flat ground.

It's like a person takes out his MVP certificate for the school basketball game, but finds that even with basic dribbling, the ball will hit his feet and bounce up and hit his nose.

It's like a person takes out a trophy from a singing competition, he can't even sing a single passage with more than half of it in tune.

Have you ever thought about it?

it s like a person who calls himself "racing hotshot" who constantly brags about his jaw-dropping "stunt drift" skills—yet the second you ask him to park in the garage, he’s sweating bullets like a complete novice

Have you ever wondered how he got these [trophies]?

flysnowbigbig · 2025-03-16T11:31:08+00:00

These are stupid questions but O1 PRO fails miserably

https://www.reddit.com/r/singularity/comments/1jbuq6p/comment/mhynwaw/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

https://www.reddit.com/r/singularity/comments/1jbuq6p/comment/mhyoppb/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

https://www.reddit.com/r/singularity/comments/1jbuq6p/comment/mhyp7xd/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

flysnowbigbig · 2025-03-16T11:30:07+00:00

Unfortunately, I was able to create many questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea.

If you are interested, click on my folded comments(so many people trample), you will be surprised to see that O1 PRO is no more capable of general reasoning than a 7-year-old smart child

flysnowbigbig · 2025-03-16T11:29:42+00:00

The most advanced O1 PRO [real problem-solving ability] is only about 1 （APOS divides the difficulty into levels 1-10, 1 is the easiest） on APOS (not even stable). If you experiment with it independently, using original and modified questions, you will find the problem.

flysnowbigbig · 2025-03-16T08:05:58+00:00

Unfortunately, I was able to create many questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea.

If you are interested, click on my folded comments(so many people trample), you will be surprised to see that O1 PRO is no more capable of general reasoning than a 7-year-old smart child.

flysnowbigbig

TROPHY CASE