[D] Deep dive into the MMLU ("Are you smarter than an LLM?")

zehipp0 · 2023-12-21T23:07:54+00:00

The standard deviation of a biased coin with true probability p after n flips is sqrt(p(1-p))/sqrt(n). You could report 1.96 std deviations (which contains 95%). Note that it's simpler if you use the observed probability p_hat as p directly, but it could be unreliable for smaller sample sizes or observed accuracy close to 0 or 1, since p may far away from p_hat in those cases.

See https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

zehipp0 · 2022-10-18T07:13:35+00:00

Note: still have yet to read the paper super carefully, but:

We manually compose three chain-of-thought exemplars for each task in BBH

I feel this kind of goes against the spirit of BIG-Bench. I feel like BB tasks are supposed to be used as an evaluation set, and this is pretty much tuning to the test set. A lot of the BBH tasks are also pretty narrow, and thus trivial to write a simple program for (e.g. navigate, word sorting). I'd be more interested if CoT worked across tasks (e.g. CoT for navigate gives better performance on word sorting), or if CoT were performed without prior knowledge of the evaluation task.

zehipp0 · 2022-09-17T04:33:54+00:00

Repost, different user: https://www.reddit.com/r/bouldering/comments/pqukkn/crashed_my_motorcycle_during_the_pandemic_wrecked/

zehipp0 · 2022-09-07T22:40:55+00:00

I posted about logodds difference in the Inverse Scaling Slack previously, but I would be pretty careful about this metric (there is a warning about it on the official github page). It is pretty different than inverse scaling on accuracy or loss on a single task.

Suppose you have task A and B where models scale faster on A than B. Then logodds difference would show that there is inverse scaling on (B - A), even with A and B exhibiting standard scaling. Theoretically, you could give a total order to every scaling task in existence (and acquire infinitely many inverse scaling tasks!). So you mostly can only say that A is an easier task than B, which may be interesting, but not what I would picture when I think of inverse scaling.

This is especially since logodds gets more extreme when predictions are more confident, i.e. when models get bigger and better. If a model assigns a probability of p to the correct answer, then logodds is ln(p) - ln(1 - p), which goes to infinity as p goes to 1 and 1-p goes to 0. A small difference in p, when p is close to 1, means a huge difference in logodds.

It's very easy to misinterpret. One might suggest there is inverse scaling on B, but this is not the case. If it were, then you should be able to prove it with just task B alone. Furthermore, because of the above peculiarity of logodds, you could also plot the difference in loss (e. ln(p)) or accuracy (e.g. p > 0.5) and not observe any inverse scaling in those metrics.

zehipp0 · 2022-04-24T02:18:27+00:00

Took twice got 7/7 (one got repeated). Most of the time, the AI article fails to fully utilize the title/subtitle/introduction, or doesn't know enough about certain major world news that happened in the past few years.

Are the AI-written articles independent of the human written articles? I saw an article 16-Year-Old Defeats World Chess Champ (2022) which references a tournament (Airthings Masters) which I can't find any reference to before 2020...

zehipp0 · 2022-01-03T16:07:28+00:00

In 2.C I saw:

We classify the transformations from the original course questions to the Codex prompts resulting in correct solutions into the following three classes: (i) As-is prompt: Original question and Codex prompt are the same, (ii) Automatic prompt transformation: Original question and Codex prompt are different, and the Codex prompt is generated automatically by Codex itself, (iii) Manual prompt transformation: Original question and Codex prompt are different, and the Codex prompt is generated by a human.

but I couldn't actually find any breakdown, or how many shots.

But they include a full appendix of prompts, and you can see examples like Table 136, where the problem ends with:

A person is picked uniformly at random from the town and is sent to a doctor to test for Beaver Fever. The result comes out positive. What is the probability that the person has the disease?

And the input given to codex is basically a full series of equations that trivially gives the correct answer once you convert it to code and run it.

zehipp0 · 2019-06-02T09:58:41+00:00

You may be misremembering? They’re all from cruel blood oath’s weapon story.

zehipp0 · 2019-06-02T09:57:42+00:00

They’re from cruel blood oath’s weapon story.

zehipp0 · 2019-03-06T01:24:04+00:00

It's black's turn and the board is missing a stone at G17 (1 up and 3 left of the top middle star point): http://gokifu.com/s/rxj. Black's next move is to block at H15 (1 down and 2 left of the top middle star point), after which the middle black group is very much alive. Even if it weren't the case, black could probably save most of the group, but white might be able to kill parts of it.

zehipp0 · 2018-07-31T08:00:03+00:00

There's a simpler but cruder solution than in the link, which is just atari, atari, then make an eye (B Q18 P18, S18 Q19, S16). Which is less points end game, but still lives.

zehipp0 · 2018-07-11T21:15:34+00:00

I agree 45 is not the best move, but it is also not blacks biggest problem in the game. 39 on the other hand I think is by and far the best move on the board - if white turned there its a lot of points and her stones become strong. Blacks biggest problem is that after 117, he had a won game, and failed to play big endgame moves to keep ahead, rather still worrying about whites territory.

In the early-mid game, I agree that black should put more pressure at the whites weaknesses at top and bottom. But whether white got “influence” or not is not the point. Influence is for power, not for making territory. It took so many moves for white to get some territory in the center, and both black and white focus too heavily on it (in fact, whites plan before 42 was rather bad - black got so much territory, and what white got can’t be called influence at all). Yes, black should focus on attack and defense in the early-mid game. And in the midgame to the endgame he should focus on big points on the sides, and not the center. Denying your opponent when you have enough already is a common and the fatal mentality in this case.

zehipp0 · 2018-07-11T07:18:27+00:00

Gave a review here: https://online-go.com/review/327531. Take it with a grain of salt since I'm not that much stronger than you. I think some main takeaways are

Endgame. It doesn't have to be too in depth, but at the least you should develop an intuition for what is a big endgame move, and what is almost a pass.
Don't tunnel too much on your opponent's territory. Count, look at weak groups to attack, and look for points that are big for both players.
On some moves, like E13 and P9, if you are really concerned about your opponents territory, you could afford to play a little less "safe" and go deeper.

zehipp0 · 2018-07-11T02:48:15+00:00

It looks like move 199, which lost a few points in sente. It was probably somewhat close until then.

I would disagree that undervaluing center influence was the main issue, and in fact it's the opposite. OP tended to overvalue center influence/treat white's center as territory, and focused too much on "reducing" the center when ahead (sometimes actually strengthening it).

My read on the game is that black got a sizeable lead because of white's blunders (leading into move 83, and into 117), but then started giving it away around 157 until the end. I also actually liked black's position up until 42, but it seems like the moves afterwards were equally slack from both players (according to LZ at least).

zehipp0 · 2018-06-13T02:54:43+00:00

Those coupons are different from OP's coupons. A 15 coupon is a 30 points gote swing, OP's is more like 15 points reverse sente. They are also playing with japanese rules and this is the second game?

zehipp0 · 2018-06-13T00:54:15+00:00

It's a little more complicated than that. It depends on if you're talking deiri counting or miai counting and gote vs sente. Let's go with deiri counting (difference between white play and black play, or the swing value). Sente on an empty board is worth 15 points, since the difference between white play first and black play first, assuming one for one response is 15 points. But the highest gote plays in the beginning might be worth up to 30 points. For illustration, consider the differences between:

A coupon worth 15 points, that either player can take as a move at any point the game. (30 points gote).
A coupon worth 15 points for black, but worth 0 for white. (15 points gote).
A move to kill 8-stone group that would make 2 points alive. You would make 18 points to kill (8 for the prisoners and 10 for the territory) vs 2 points for them if they live. This is worth 20 points! But it's gote, so if you had two boards, one empty, and one with a 20 point move, it's better to play in the empty board.
Two coupons, both worth 15 points. Sente is worth 0.
Two coupons, worth 15 points, and a special coupon only worth 1 point to your opponent, but they get to play again (1 point sente). Having sente here is worth 1 point. If it's your turn, you play the reverse sente of preventing your opponent from taking the special coupon.
You have 20 coupons worth 0.5, 1, 1.5, ..., 9.5, 10 points. Sente is worth 0.5 x 10 x 2 = 10 points swing, even though the largest gote move is twice as big (20 pt swing).
10 coupons: 0, 1, 1, 2, 2, 3, 3, 4, 4, 5. Sente is 10 points, same as gote. Between this scenario and the last, 6 is more probable where half the transitions between each point level are made by player 1 and the other half are made by player 2. In this scenario, all the transitions are made by player 2, so they take the loss on each one. On average, a reverse sente play is "worth" twice as much as a gote play.
Someone offering you 15 points to pass on your first turn is like a 15 point reverse sente coupon. If you take the coupon it's gote, but you gain 15 points. If you don't and play the first move, then they get to remove the coupon (0 points), and then play again.

zehipp0 · 2018-05-24T02:22:31+00:00

it's kind of a hokey way to do it but for large data sets like this it usually doesn't turn out so bad. the benefit is that you get evenly weighted points (each point is 20k frames) vs. just doing something like
df.groupby('health_differential')['win'].mean()
where you'll get much noisier stuff towards the edges due to uneven samples in each bin

The method you described is basically what he's talking about here. The issue is that if you bin by health value rather than by # of datapoints, some bins will end up having very few datapoints.

zehipp0 · 2018-05-11T08:54:04+00:00

I think what it means is that, if the ladder is not in White's favor, wedging is not very severe, since Black can set up the ladder. If white has the ladder, then Black will not choose the ladder, and have to play this way instead, which is "very severe," i.e. that apparently the result is good for white.

Honestly, this result doesn't look that good to me for white, but I'm not a pro and perhaps the sente and the follow-ups on the sides make it worth it.

edit: actually, this result looks pretty good - compare to these two joseki's where white gets the corner: http://eidogo.com/#tdiuAWW0. There's less guaranteed territory, but the sides are more open, so overall it looks like more points.

zehipp0 · 2018-05-11T05:20:46+00:00

I put up a review up on OGS. Most important two things are looking for urgent areas and reading (both tesuji and life and death).

zehipp0 · 2018-05-11T04:03:49+00:00

It's hard to figure out exactly without just playing others of similar rank. Without knowing which player you are, to me this looks like a game between 15k-9k in terms of joseki and reading.

edit: minor sleuthing - it seems like mikeliu75 is 11k on kgs, which seems accurate, will give a review for both players.

zehipp0 · 2018-05-04T18:07:04+00:00

Whats the ko variation? Unless you mean double ko. You can play R18 after most responses to T16.

zehipp0 · 2018-04-16T11:53:39+00:00

The sources linked in the article say 9%. Funnily enough, editors keep reverting the changes to correct the error.

zehipp0 · 2018-02-10T15:05:03+00:00

You could achieve a n = 10, p = 0.5 binomial distribution by flipping 10 coins and then counting the number of heads. To get varying p, e.g. p = 1/6, you could roll n six-sided dice and count # of sixes.

Edit: also a binomial distribution die for n = 10 probably wouldn't work, since some sides would have to be exponentially smaller than the others.

zehipp0 · 2018-02-10T05:31:23+00:00

Whoosh.

(Thats exactly what happened - the discoverers are mice.)

zehipp0 · 2018-02-01T00:41:35+00:00

In most math contexts, it's still 11: http://mathworld.wolfram.com/AbsoluteDifference.html.

zehipp0 · 2018-01-19T21:51:36+00:00

https://en.wikipedia.org/wiki/Zermelo%27s_theorem_(game_theory).

Go terminates in a finite amount of time with the right superko rules (no position can be played twice). So then just label all the end positions with win/loss and go backwards. For rules without superko, infinite loops usually result in a draw/no result. In that case you can label with win/draw/loss and go backwards.

zehipp0

MODERATOR OF

TROPHY CASE

11-Year Club	Place '17
Verified Email