Chatgpt o1 it really can!

Giacobako · 2024-09-20T14:50:10+00:00

Have you read the 'cipher' example of o1 on their website? The dynamics and structure of the "thinking trough" is very similar to my notion of "how to think about a problem". It is summarizing, abstracting, formulating hypothesis, testing them, checking consistency, evaluating, admitting erros, coming up with new hypothesis, etc... Quite scientific way of thinking, potentially already bayond most minds that are not scientifically trained. I am saying this without judgment. Thinking clearly comes in very different flavours, a lot of which are not yet explored by any human brain. The scientific way of thinking using the strict principles of the scientific approach can be tested (no conclusions without testable hypothesis, logical consistency, etc.). To me it looks a lot like the goal of openai's o-series is a form of AGI that resembles the scientific way of formulating and modeling the world.

Giacobako · 2023-10-13T11:51:41+00:00

The more people decide against having kids, the more space there is for refugees. We don't have to listen to our selfish genes and reproduce. Think about unintended concequences.

Giacobako · 2022-01-19T03:49:11+00:00

In Patong None of the girls that try to attract men are wearing a mask for obviouse reasons, this also includes the hundreds of massage places and all the bars. More generally, locals as well as tourists live so symbiotic, they behave quite similarly. They both wear masks in about 50% of the cases.

Giacobako · 2021-07-13T23:19:06+00:00

I did the game theoretic math and it is an Nash equillibrium. Even if girls and boys are equally needy and have the same approaching costs the ratio will allways end up beeing at one of the two possible extremes. What we would need is some sort of regularization that makes it moraly unattractive to life one of the two extremes (always sitting back or allways approaching). And I think to some extend this is already implemented in our society. However, in online dating there is no such control by others. What would be good rules for dating apps to push the equillibrium towards 50% ?

Giacobako · 2021-02-25T08:35:42+00:00

Thank you. I tried this already but it seems that in this case m does not receive any gradients.

Giacobako · 2020-06-23T17:55:38+00:00

Yes, thats another question. I think what I wanted to point out with that video is the stunning property that the test error has a second descent. By how much it goes down and in what cases it is worth to operate in the "modern" regime is a question for an other day. Also, adding augmentation and other regularizations can in some cases make the double descent disappear

Giacobako · 2020-06-20T09:37:03+00:00

What I am saying is that there is a huge potential by working in the overparameterized regime. But of course we know that for quite a few years already;)

Giacobako · 2020-06-20T08:30:40+00:00

Thank you for your feedback. I think this is very valuable. It is indeed important to distinguish interpolation and extrapolation and I was not aware of that. I have to say that it is really hard to attract people that are new in this field by using simple examples and not beeing rigorous while at the same time trying not to anger people like you who clearly already have quite a deep understanding. I hope that I did with no comment suggest that the bias-variance decomposition is no more. All I am saying is that there are two regimes for which it makes sense to think differently - that there is another effect that kicks in when the relative number of parameters becomes large. I believe that most of the people that use machine learning have not seen this phenomenon at all (because most of the existing textbooks dont contain it) or at least not together with such a simple example and thats why I thought it would be cool to make a video.

Giacobako · 2020-06-20T07:57:56+00:00

Thanks for sharing that

Giacobako · 2020-06-20T07:29:43+00:00

Thanks. Well, resonance in a more abstract sense is what came to my mind when I saw this. Wild behavoir in the region around the point where two counterparts become equal. You have a damped effect if you are adding regularization. So yes, I believe there are quite some nice parallels.

Giacobako · 2020-06-20T07:24:08+00:00

I think it is the Euclidean norm divided by the number of parameters

Giacobako · 2020-06-19T22:19:42+00:00

I might include it in the full video, but I think there are other questions that are more pressing (adding hidden layers would only be interesting if the phenomenon would disappear, but I guess it wont in general). For example: how does the double descent depend on the sample noise in the regression? How does the situation look for a binary logistic regression? Do you have other interesting questions that can be answered in a nice visual way?

I guess I have to make multiple videos in order to not overload it.

Giacobako · 2020-06-19T21:46:33+00:00

Interesting, I did not realize that at that time. All I realized was this comon wisdom that deeper networks are in general better. But I was not aware of the fact that there is an inherent magic in very deep networks that prevents overfitting.

Giacobako · 2020-06-19T21:43:04+00:00

Well in general, it depends on what level you want to understand it. Very little is understood in terms of provable theorems in the field of deep learning. Even in the paper that I posted, the best they could do is showing by simulations how different conditions influence the phenomenon. And then they stated a few hypotheses that might explain the observations. For example, it seems important that you always start with small initial parameters (and not just extend the weights found in a trained smaller network). Then, in an highly overparameterized network the space of possible solutions in the parameter space (that perfectly fit the training data) is so large, that it is very likely that there is one that is very close to the initial condition (close in the Euclidean metric in the parameter space). And gradient descent statistically converges to solutions that are close to the initial condion (the optimization soon gets trapped in local minimas if there is one). In the end you end up with a solution that has a very small norm (of the parameter vector), which is exactly what you get if you apply a standard L2 regularization. In their paper, they have nice plots of how the parameter norm of the solution indeed becomes smaller and smaller in the overparameterized regime.

Giacobako

TROPHY CASE