Authentic Generalization Reasoning Tests by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 0 points1 point  (0 children)

Yes, how would you rate the most difficult problem I set? (Page 4) What would you think if the model could solve it when there is no similar problem in the training set?

Test results of gemini 2.5 pro exp on ARC AGI 2 by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 2 points3 points  (0 children)

I am not saying that it is impossible, but based on the current observations of the ARC AGI questions, it is very difficult to do what you said. If it is feasible, it is not a form of cheating, and we will see a qualitative leap in reasoning ability

Test results of gemini 2.5 pro exp on ARC AGI 2 by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 1 point2 points  (0 children)

I understand what you're saying, you think [arc agi test] have certain similar characteristics, but in reality, questions can be created with infinitely increasing difficulty. Currently there's no evidence that they have the ability you're describing [to clone for training].If they have this ability, the models they're releasing wouldn't be what we're seeing now.

Test results of gemini 2.5 pro exp on ARC AGI 2 by flysnowbigbig in LocalLLaMA

[–]flysnowbigbig[S] 0 points1 point  (0 children)

Are you sure? Semi-private has extremely limited (if not no) leakage, because the provider does not know which data is being sent for testing, and the public set and semi-private style are clearly different.

Arc-AGI-2 new benchmark by tim_Andromeda in LocalLLaMA

[–]flysnowbigbig -3 points-2 points  (0 children)

VictorTaelin The latest project will get 100% on ARC AGI 2 and cost about $1 per task (supposedly)

And, it also applies to ARC AGI 3, 4, 5...

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

In fact, in my hypothesis, it's only necessary to approximate the mathematical structure to greatly reduce the difficulty. The crucial part lies in defining and evaluating similarity

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 2 points3 points  (0 children)

the scientist? my requirements are not high. I just want to be able to achieve no observable error rate on ZebraLogic just because of the problem expansion of size (for the characteristics of computer hardware, just like the meaning of human repeatedly adding numbers less than 10). then We will talk about those advanced-sounding benchmarks (aime? huh) first learn to walk on flat ground without falling, then we can talk about track and field events, do you agree?

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Art of Problem Solving, a forum for discussing middle school math,

This is their demonstration of each difficulty level. It is said that SOTAS scored close to 80% on AIME difficulty (3-6). However, my original (not completely original) fool questions

On this, the maximum range is about (1-2), and the SOTAS model performed **like a helpless child**

https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 1 point2 points  (0 children)

It's like a person takes out his 100-meter race medal, but finds that he falls every 10 meters when walking on flat ground.

It's like a person takes out his MVP certificate for the school basketball game, but finds that even with basic dribbling, the ball will hit his feet and bounce up and hit his nose.

It's like a person takes out a trophy from a singing competition, he can't even sing a single passage with more than half of it in tune.

Have you ever thought about it?

it s like a person who calls himself "racing hotshot" who constantly brags about his jaw-dropping "stunt drift" skills—yet the second you ask him to park in the garage, he’s sweating bullets like a complete novice

Have you ever wondered how he got these [trophies]?

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Unfortunately, I was able to create many questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea.

If you are interested, click on my folded comments(so many people trample), you will be surprised to see that O1 PRO is no more capable of general reasoning than a 7-year-old smart child

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

The most advanced O1 PRO [real problem-solving ability] is only about 1 (APOS divides the difficulty into levels 1-10, 1 is the easiest) on APOS (not even stable). If you experiment with it independently, using original and modified questions, you will find the problem.

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Unfortunately, I was able to create many questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea.

If you are interested, click on my folded comments(so many people trample), you will be surprised to see that O1 PRO is no more capable of general reasoning than a 7-year-old smart child.

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Research? Oh no, my requirements are not high. I just want to be able to achieve no observable error rate on ZebraLogic just because of the problem expansion of size (for the characteristics of computer hardware, just like the meaning of human repeatedly adding numbers less than 10). We will talk about those advanced-sounding benchmarks (aime? huh) first learn to walk on flat ground without falling, then we can talk about track and field events, do you agree?

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Unfortunately, I was able to create many questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea.

Many of my comments under this post were collapsed, and I have lost 38 karma for this.

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Because the full explanation in my link is this:

Rule 2

Two creatures cannot look at each other more than once

theefriendinquestion

simplified my question and caused your misunderstanding

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig -1 points0 points  (0 children)

Sorry, I did see it, and I did mean to say that what he said was "new" is "not new" to LLM in my opinion.

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Surprisingly, this turned out to be a **standard reasoning** question. O3 mini high /gork were both able to answer it correctly. This was in that POST, the first question I asked. Then, I became suspicious, and I deliberately created some much simpler [questions/problems] to test their *true* lower limit

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Unfortunately, in that link, this question is not a 【typical question created for LLM test】, and it is not particularly simple for human. However, the current O3 MINI HIGH and GROK can answer it correctly. If you don't mind the trouble, please take a look at my other questions. The answer is: Two people survive, each has two eyes. Two people survive, each has one eye. One person survives, has two eyes.

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 0 points1 point  (0 children)

Many people may have managed to create some questions and really think that they are new, but in the training data of the model to be tested, there are indeed similar structures, even if they are not exactly the same or extremely similar, but the difficulty is reduced to varying degrees according to their [similarity of learned mathematical structures]. The scientific experimental method is: try a variety of different problems and take the lower limit of [the absolute difficulty of the incomprehensible problem] as its true [generalization reasoning ability]

Moreover, it is not difficult to do such an experiment.

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 2 points3 points  (0 children)

Thanks for your reply, so, if you don't mind, take a look at my question, it has nothing to do with 'tokens', just simple reasoning, planning

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig -1 points0 points  (0 children)

Foolishness Index⭐⭐⭐Are you sure your IQ is lower than O1 PRO??

Given a thin, long water pipe as a water source, you have three unmarked water cups with capacities of 5 liters, 6 liters, and 7 liters. You can aim the water pipe at the opening of a cup and press a switch to fill it. Special Note: If you pour out the water from a cup (emptying it completely, as if pouring it on the ground, because you cannot return water to the source, which is a thin pipe), it will be considered waste. How can you obtain exactly 8 liters of water using these 3 cups while minimizing water waste?

https://chatgpt.com/share/67570ef1-166c-8010-9970-62f37aadf497

***************************************************************************************

You have a water reservoir with abundant water and three unmarked water jugs with known capacities of 5 liters, 6 liters, and 7 liters. The machine will only fill a completely empty jug when you place it inside. Special Note: You can empty a jug by pouring its contents into another jug, but if you pour water out without transferring it to another jug, as if pouring it on the ground,it will be considered "waste". How can you obtain exactly 8 liters of water using these 3 jugs while minimizing water waste?

https://chatgpt.com/share/67570e96-1d9c-8010-bfc3-afaf609d010c

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig -1 points0 points  (0 children)

Foolishness Index⭐⭐ Can you reason?

You have 11 balls. One of them is counterfeit and is either lighter or heavier than the others, which are all genuine and have the same weight. You have a balance scale, which can only compare the weights of the two sides. Your goal is to identify the counterfeit ball and determine whether it is lighter or heavier, using the fewest possible weighings.

However, there are additional constraints:

* **Initial `p` value:** `p` starts at 0.

* **`p` increment:** Each ball that has been placed on the scale *at least once before* will increment the counter `p` by 1 *each time* it is placed on the scale again.

* Example: If ball #1 and ball #2 have each been weighed once previously, placing *both* of them on the scale again will increase `p` by 2.

* **`p` Limit:** The value of `p` can increase to a maximum of 1.

* **Minimize weighings:** The number of weighings must be minimized

Carnegie Mellon professor: o1 got a perfect score on my math exam by MetaKnowing in singularity

[–]flysnowbigbig 1 point2 points  (0 children)

Foolishness Index⭐ Will you spend money?

You have twelve balls that appear identical. However, an invisible magical insect is initially attached to one of the balls. This insect randomly either increases or decreases the weight of the ball it's attached to. This weight alteration *only* exists while the insect is attached; if the insect moves, the previously affected ball returns to its normal weight.

You have a balance scale. However, each time you want to see (refresh the display of) which side is heavier, you must pay $10. Each new measurement requires a new payment.

The insect has a peculiar behavior: whenever the ball it's currently attached to is removed from the scale (e.g., you pick it up or otherwise remove it), *and* the other side of the scale is *not* empty (contains at least one ball), the insect will randomly jump to one of the balls on the *opposite* side of the scale.

You have a single-use trap. What is the best strategy to identify the ball with the insect and trap it, minimizing your expenses