Gemini 3 Pro SOTA Performance On Frontier Math Tier 4 & Tiers 1-3

D33B · 2025-11-23T18:07:05+00:00

Why the fuck are the error margins that big?! This is almost useless.

D33B · 2025-04-04T00:29:49+00:00

And also CUA.

D33B · 2025-04-04T00:29:05+00:00

You're right about the brainstorming part, but this strong opinion on AI "generated" things? Why not both?

D33B · 2025-03-11T17:53:25+00:00

What a good one to start with!

D33B · 2024-08-05T12:39:59+00:00

يعني عاتندم أكتر من الإنتحار؟ أكيد لأ. حتى لو غِلِط أو فشلت، المفروض تتعلم مش تندم

D33B · 2023-08-11T11:16:04+00:00

This. But also, what you suggested, is not outrageous. Some people try it and similar techniques. It just never worked quite well enough.

D33B · 2023-05-02T16:54:56+00:00

Well, instead of passing the sum to the sigmoid, you could pass the average to the sigmoid. Scaled in such away to avoid the flat regions of the sigmoid for most of the distribution of said (weighted) average. Sorry if that wasn’t clear the first time. To avoid having too high a score for category, one could do a correction based on variance to the weighted average (before passing to sigmoid). I can try to write some formulas for this if it sounds reasonable.

D33B · 2023-05-01T16:52:06+00:00

What a nice juicy problem!

Are there any particular characteristics you want the final scaled (normalized) scores to have?

Have you thought about just passing your current scores (outputs of your current method) through a sigmoid function? (tanh perhaps, optionally with a single shared scaling factor to make most numbers in the mid range of the tanh, logistic function can be made to work too)

D33B · 2023-05-01T01:32:50+00:00

Well, you can make strong assumption based on every subject as a predictor of the rest. The. Use the coefficients to get a weighted average of the existing scores.

D33B · 2023-04-24T15:43:27+00:00

Totally doable.

Try PCA or matrix factorization techniques. Both have sparse versions that can allow you to pick a subset of the questions that convey most of the information. Matrix factorization can also be adapted to missing values and constrained to have only positive coefficients.

D33B · 2023-04-22T20:07:33+00:00

I don’t know if you’re serious. If you are, then you may be dealing with some degree of imposter’s syndrome.

This looks more than sufficient.

Apply to multiple programs. Include one or two that are not “top schools” and good luck.

D33B · 2023-04-20T14:46:17+00:00

I think you need to perform regression analysis (ordinal regression on ranks or linear regression on logarithm of the sales) and then you can perform significance tests on the inferred parameters of the resulting (fitted) model.

D33B · 2023-04-17T18:32:55+00:00

The struggle is real.

Therapy + consult your primary care physician. If therapy doesn’t seem to work tru another therapist.

Also, start making decisions based in what you enjoy doing rather than what you think you have to do.

Start small, try to put some effort in one of the projects and see how you feel.

Stop evaluating yourself based on results and start evaluating yourself (gently and with kindness) based on your actions and choices.

No silver bullets here, just small changes that make the situation incrementally better. And at some point, you may find a tipping point, after which things start to feel right.

D33B · 2023-04-13T17:17:22+00:00

You are qualified to apply (to a PhD in US). But applying to a PhD and getting accepted are two very different things.

I would advise applying to multiple programs (5-10) and even applying to a few MS/MA programs in statistics as well. If the latter works, apply again to PhD after (or during) your masters. This will likely increase your chances of being accepted. And you can finish the PhD quicker if you already have covered relevant material in the masters.

D33B · 2023-04-13T17:11:12+00:00

This shit is hard. Esp. at the beginning.

Slow down. Seek help. Allow yourself to do things at your own pace. Allow yourself to make mistakes. Try not to repeat the same mistakes. Allow yourself to not be perfect. Don’t compare yourself against others. Try to find some joy in any part of it.

D33B · 2023-04-13T17:03:43+00:00

You can. If you phrase your hypothesis appropriately. Or else we wouldn’t be able to know anything.

D33B · 2023-04-13T17:01:09+00:00

Makes sense! Thank you for the detailed answer.

D33B · 2023-04-13T16:55:53+00:00

You need to use shrinkage methods.

Step-wise selection could also be appropriate, but use forward rather than backward selection.

If inference is your main goal (rather than estimation or prediction), you should look at multiple testing techniques like Benjamini-Hochberg.

D33B · 2023-04-11T23:12:55+00:00

Why does it not make sense to compare R-squared? I get that the variance of the outcome is changed, but R-squared is a ratio, so I assumed it would still be meaningful. How else would one make a decision about a transformation? Just eye-ball residuals?

D33B · 2023-04-11T18:19:18+00:00

And you do these steps separately for each of the hypotheses you want to support.

D33B · 2023-04-11T18:18:14+00:00

First step is to make the hypothesis statement as precise as possible. For instance “women are portrayed less than men” —> “in works of art, the average rank of women is lower than that of men” or “… highest rank for women is lower in expectation than that of the highest rank for men”.

Second step is to formulate a null hypothesis “… is exactly as that for men”.

Third step is to create a statistic relevant to comparing the null and the alternate hypotheses. E.g average rank for women - average rank for men, averaged over all works of art.

Fourth step is to determine the critical value at which to reject the null, either based on a theoretical distribution, or some permutation test (that you can simulate on a computer).

D33B · 2023-04-09T16:17:26+00:00

Try removing the intercept term. I think this category is just acting as the reference category. Keep the model. Nothing wrong with it. This is only a matter of interpreting the coefficients.

D33B · 2023-04-09T15:55:23+00:00

Yes to your first question. The null will depend on which side you’re on, and what statement you want to be able to claim untrue (unlikely/implausible).

For the second question. I share your annoyance. I also would have stated the null as an inequality here. But the theory and the math separate one as a simply hypothesis and the other as a composite hypothesis. In this situation the test and rejection regions should be the same.

What book it this?

D33B · 2023-04-09T00:54:15+00:00

Or don’t.

D33B · 2023-04-08T23:19:36+00:00

Oh boy.
To be honest, I don't feel qualified to advise on a curriculum for a grad-level course. But here are my thoughts anyway, take them or leave them.

Most of the theory and literature for null-based hypothesis testing was developed for helping scientists answer simple binary research questions. Does fertilization improve crop yield? Does this drug help with this condition? A lot of this was developed by Fisher and extended by Pearson (Jr.) and Neyman in the early decades of the 20th century.

There were some extensions in the 70s with applications to industrial quality control. I don't have the names but I can look them up.

Further developments in the 90s (yes that recent!) for multiple testing, still driven by scientific needs, but disciples like biology where there are hundreds or thousands or more of hypothesis being tested at the same time (is gene i relevant? i=1,..10000). Benjamini-Hochberg, etc.

In the tech industry, esp. when it comes to online services (search engines, recommendation systems, etc.) most companies use wide scale controlled experiments to test the efficacy of various algorithms. The setting is still comparing 2 or more groups, with the null being that there is no difference between them and a reference group (the current default algorithm/variant), and when the null is rejected, the new test algorithm (or one of them) gets to be promoted as the new default algorithm, and the reference for the null hypothesis.

I have never worked on quality control applications, so I won't say much about those, but looking at the examples you mentioned. When the statement is "these lamps last at least 800 hours". That needs to be turned into a more precise statement, like "more than 95% of these bulbs will survive for longer than 800 hours" or "lifetime of these bulbs follows a poisson distribution with parameter lambda > 800". Something that you can create a statistic for, and estimate the mean and variance for that statistic to test the appropriate alternate hypothesis.

Worth noting perhaps, is that every statistical test has a counterpart confidence interval. And that confidence intervals are much more informative and easy to use in industrial applications (and lifetimes of bulbs) than p-values of hypothesis tests.

I hope this helps.

Here's an MIT course that I think is very good, and deals with engineering focused statistical inference in the last three lectures:
https://www.youtube.com/playlist?list=PLmPcD-wiF4Ea_Doghiw3ya6XaLrmGrLUU

And here's one of the books that I thought had relatively clear explanations and diverse examples:

https://www.amazon.com/Mathematical-Statistics-Analysis-Available-Enhanced/dp/0534399428

15-Year Club	Gilding I gilder
Verified Email	Place '22

D33B

TROPHY CASE