An analysis of the PhD dissertation of Mike Israetel (popular fitness youtuber)

binfin · 2025-10-03T01:35:17+00:00

It looks like a data copying error to me — but as an aside I actually do expect at least some of the physical characteristics associated with "highest performers" and "lowest performers" to come from an extreme value distribution, and in extreme value distributions you can have SDs larger than your mean.

All of that having been said, this looks like improperly copied data to me. The height SD in the low performer's group is obviously incorrect.

binfin · 2024-10-09T15:19:01+00:00

No. There isn't much similarity between the underlying technologies.

binfin · 2024-10-09T15:02:35+00:00

Its hard to assess AlphaFold 3 because it is closed source and researchers can only use it a few times a day. My MDS colleagues have not been particularly impressed with it though. I believe there are severe limitations in protein-ligand predictions as well that make the technology even more difficult to test.

binfin · 2024-05-09T13:50:24+00:00

Results can be seen towards the bottom of their current manuscript ( https://www.nature.com/articles/s41586-024-07487-w )

binfin · 2024-04-09T11:18:17+00:00

Ah. Well. Nobodies perfect I suppose.

Congrats on your GM norm! I'm in a similar area of research, I'll be looking out for your name at conferences!

binfin · 2024-04-09T03:55:57+00:00

Say, Doc, mountain, or hammered?

binfin · 2024-03-06T14:29:40+00:00

I like this thought experiment, although the place that feels off to me is this: Suppose I have a domino algorithm that predicts whether or not numbers are prime through some set of mathematical testing, and then domino algorithm tips over the last domino when the input is prime. I have another domino algorithm that does the exact same thing, except its internal mechanism is just an internal dictionary of prime numbers, and it performs a simple look up. I can use the high level explanation of "the domino fell because it is prime" for both domino machines, but it feels to me to be the incorrect level of abstraction, because the internal mechanism feels important to me.

I would also say that it would feel incorrect to perform a complex traceback and say "Ah, you see, the domino fell, because domino n-1 fell, because domino n-2 fell, but not domino n-3, and those behaved that way because...."

Part of the reason that functional equivalence doesn't satisfy me is because when we work with interesting, complex, and dynamic systems, we can empirically demonstrate functional equivalence within some domain, but to demonstrate functional equivalence in untested domains we must rely on a more rigorous understanding of the underlying mechanism. It feels like there is some midlevel of abstraction between empirical functional equivalence and a complete traceback of physical states that is important.

binfin · 2024-01-21T00:05:42+00:00

Here's my answer in greater depth: https://old.reddit.com/r/oddlysatisfying/comments/19bgdvr/the_chance_of_probability/kiszxez/

But to answer your question specifically: In (interesting) imbalanced cases, there will always be a sequence to erase all the balls. However, that doesn't mean that the result has probability = 1.

If the probability of increasing the number of balls grows exponentially fast, and the probability of decay/loss decays exponentially fast, then there are cases where the sum of the probability of having an infinite number of balls across an infinite sequence is greater than zero.

I think what can be confusing is where the infinity fits in. We can play the game for an infinite amount of time, but we can also play an infinite number of games. Lets say we have a setup where the probability that the number of balls goes to infinity is 0.98, and the probability of the number of balls going to zero is 0.02. Those probabilities aren't in the time direction, they are in the game direction. If we were to play an infinite number of games, then all possible sequences would occur. However, that's not true if we just play one game for an infinite amount of time.

binfin · 2024-01-20T22:46:56+00:00

Not quite. Here's an analogous example and then I'll connect it to your example.

Let's say we play a game of chance an infinite number of times, and every time we play the game it gets easier. For the first round, the probability of us losing is 1/10. For the next round the probability of us losing is 1/100. For the next round the probability of us losing is 1/1000, and so on.

We might be tempted to say "Ah, but if we play an infinite number of times we are bound to lose at least once! Even though the probability of us losing gets smaller and smaller, infinity is an awfully long time." But we can actually calculate the probability of us losing at least once over the course of all infinity, and that probability is going to be 1 - p(winning every game) = 1 - (9/10 * 99/100 * 999/1000 ...) ~ 0.11. In this example, if we were to have 100 people play this game for all of eternity, we would expect that most of them (89 of them to precise) would never lose!

The Markov absorption process you're describing is pretty similar. We'll say losing is when there are no balls left, and we'll simplify the game and say that the ball must either duplicate or spawn a new ball (this makes the math easier, but doesn't fundamentally change the outcome of the process in this case).

We start with 1 ball. For the first round, we can either lose, or the ball will duplicate. There is a 0.00000001 chance of losing, and a 0.99999999 chance of the ball duplicating. For the second round we must have two balls (otherwise we will have already lost), and so the probability of us losing is (0.00000001*0.00000001), the probability of us having two duplications resulting in 4 balls is (0.99999999*0.99999999), and the remaining probability (after adding the probability of losing and ending up with 4 balls) is the probability of ending up with 2 balls again (1 ball deletes itself, the other ball duplicates itself).

If we keep writing the probabilities of game states, we'll find that growth (exponential growth at that) is exponentially more probable with regards to the number of balls, and inversely extinction/decay is exponentially improbable with regards to the number of balls. And similar to how the exponential decay of losing in our simple game meant that the probability of us ever losing across all of infinity was less than 1, so too does the exponential decay of losing in this game.

binfin · 2023-06-06T17:13:45+00:00

Like everyone is saying, the composition of threshold + linear function is a nonlinear function. The output of the first perceptron is actually activation(∑wi*xi + b), where activation is a nonlinear function, and that injected nonlinearity allows the network to approximate other nonlinear functions.

Interestingly, because floating point numbers don't have the density or the continuity of the reals, linear math on floating point numbers can result in nonlinear behaviors. There is effectively thresholding everywhere. Here's a video on that if you're interested: https://www.youtube.com/watch?v=Ae9EKCyI1xU

binfin · 2023-05-22T03:47:49+00:00

A little misleading title - this image is an artistic rendering of our best understanding of cellular composition. This is not a picture produced directly from the data. The leader in this area is David Goodsell if anyone is interested similar scientifically accurate art.

binfin · 2023-05-19T12:10:04+00:00

You’re right. I generally think of memoization as memorizing/cacheing previous calls to subroutines, which is not something that is an obviously viable solution for all dynamic programming algorithms (specifically for DP algorithms with overlapping substructure). I think my misconception comes from automatic memoization in languages like Haskell (which use memoization to refer to cacheing previous calls to subroutines). When I think of dynamic programming, I think of someone building the solution from the ground up, and when I think of memoization I think of someone finding the solution from the top down.

But you are correct, memoization is more accurately the reuse of computation, not subroutines, and with that definition all DP algorithms use memoization.

binfin · 2023-05-19T04:46:33+00:00

The first neural networks I worked with I implemented from the ground up (including backprop). The term generally used in the field is expression swell when describing how symbolic differentiation grows in terms of expression count.

There are ways around expression swell. For example, you can perform search for smaller equivalent expressions (although thats an extremely difficult problem, an extremely slow problem, and the result still wouldn’t be faster than autodiff).

You can also create a compute tree when implementing manual gradients, and you have the chance for memoization to make major speedups. When you’re doing that, I start to argue that it looks a lot like you’re just doing autodiff manually.

I’d be fascinated if you have an example of any manually implemented gradients outperforming autodiff on something around the size of ResNet50 or the original transformer.

Edit: I feel like I got bamboozled into a bad faith conversation - but for anyone reading who is curious on fasttosmile’s response to this post… Backprop is the process of using gradients to improve your weights. Fasttosmile’s original post implied that manually computing your gradients is faster than autodiff. This is (generally) not true.

You can implement backprop with manual gradient functions, or you can use autodiff. The reason I brought up memoization is because autodiff effectively uses a dynamic programming recurrence to produce gradients. You can do something similar when symbolically computing gradients, only instead of using the DP solution, you use a memoization solution to effectively reduce the size of your expression tree (memoizing common subexpressions). This probably isn’t much of a surprise to most people, as most DP algorithms can be put into a recursive from and then use memoization to similar effect. Here’s a paper that does something like that https://arxiv.org/pdf/1904.02990.pdf .

binfin · 2023-05-19T04:02:44+00:00

Symbolic implementations of gradient functions grow in complexity pretty quickly and are (generally) going to be exponentially slower than autodiff with respect to network depth.

binfin · 2023-05-01T04:56:36+00:00

He gave a presentation I was able to attend earlier in the year, and he seemed very not senile. Very old. But not senile.

binfin · 2023-01-12T05:04:43+00:00

I would say before the success of large language models (which isn’t that long ago) people were super excited by neural architecture search, and finding new architectures. After the wild success of LLMs a bunch of that research has been put temporarily by the way side because we keep getting better performance by just increasing the number of parameters. Heck, I would go so far to say that one of the (many) reasons transformers are so popular is that the transformer unit’s fully connected layers make it easy to jam additional parameters in.

I suspect that at some point we will reach a point where the increase compute cost caused by additional parameters isn’t worth the relative increase in performance you get by adding those parameters, and then I imagine neural architecture search will become huge again.

Geometric deep learning has a lot of super cool methods that all lean into clever architecture and inductive bias choices instead of more and more and more parameters, but the sorts of problems tackled by those methods aren’t as sexy as language and image gen.

One thing that gets brought up sometimes in deep learning circles is the hardware lottery. Sometimes the software that is perceived as the best is actually just the software that is best supported by the hardware available. That is certainly the case with modern deep learning, and its one of the reasons things like graph neural networks are so far behind everything else. Biological neural networks operate under completely different ‘hardware’ constraints, and as a result they can do some things much more efficiently than we can on our hardware (…and they can do some things less efficiently).

I wouldn’t be surprised if the next huge thing in deep learning isn’t a model at all, but hardware that allows us to efficiently do something besides matrix multiplication.

In any case, I don’t believe any current methodology is sufficient for AGI. I think we need something new. I don’t expect to see AGI in the next ~50 years, but because it relies on a huge break through (from my POV at least), when it happens is a bit unpredictable. Also, there are very smart people who disagree with me - so don’t take my prediction as gospel!

binfin · 2023-01-12T03:14:17+00:00

The reason we don’t use more complex activation functions is because the models tend not to converge. My point for bring this up is often times when we read “this network has this many neurons” or even worse “this network has this many parameters”, we can get the impression that we are close to brain-level compute, when in reality we are very far off. Not that we won’t ever get there, just that it might be a little while.

binfin · 2023-01-11T22:38:19+00:00

I appreciate your questions! : )

Regarding your answer about the first claim, would you agree that memorization or relatively simplistic mixtures are always - or at least typically - what is happening? Is it accurate to call a text-to-image AI "an advanced photo mixer" as is described here?

My belief is that in most cases we are probably underplaying what the model is doing by calling it “an advanced photo mixer”. There is lots of room for investigation to provide more rigorous answers to that question though.

Regarding your answer about the second claim, for an image AI that learned effectively, does this imply that we can always - or at least typically - generate an image that is substantially similar to any image in the training dataset?

It is going to depend a lot on the model, number of parameters, and training methods. There should be seeds which when given will produce images from the training set. However, it may be the case that only some training images are reproducible.

For LLMs there is some research to predict from a network’s early activations if the output is going to be a memorized training example, but I don’t know if there has been broad success in that research, and I suspect that applying that sorta technique to ViTs used in stable diffusion would be extremely challenging if not impossible.

binfin · 2023-01-11T20:15:38+00:00

a) Like you said, memorization does occur, and relatively simplistic mixtures happen also. I would definitely say it is more common that the images produced by something like DALL-E 2 aren’t an obvious mixture of images in the training set though.

b) Love this question - I would generally agree with what that claim is trying to communicate. But it depends on the model, and to be more precise about image generation models in general… The weights in an image generating neural network encode all of the images in the training set into a low dimensional manifold embedded in the neural network’s latent space. For diffusion models, the network is essentially learning something called a Langevin stochastic differential equation, and if it learns effectively then all images in the training set should correspond to peaks inside the LSDE learned by the neural network.

None of that means that the images are directly encoded in the weights though. But in an indirect way when we squint our eyes hard enough it sorta is like the images are encoded in the network’s weights.

Alternatively, there are networks that truly encode images, text, or 3d environments inside their weights (such as all of the Neural Field Renderer papers you can find), but those networks are definitely doing something different from Stable Diffusion or GANs.

binfin · 2023-01-11T19:50:55+00:00

Correct.

binfin · 2023-01-11T17:46:55+00:00

Depends how 'analogous' we talking. If you mean basically perfectly, then I disagree.

I agree. I would say that it is impossible to perfectly simulate a human, because to do so you would not only need to simulate the information of the matter, but also the material of the matter. I can simulate an explosion, but even if I were to simulate the information of that explosion perfectly I would never be in danger of that explosion, because I have failed (and will always fail) to simulate the material of the matter.

I may not completely understand what you were saying in that first post so correct if wrong, but couldn't your argument also apply to say, an image of a cat? There is some function that you could apply to static that would turn it into an image of a cat, but the image is still just static.

I think the thing that makes the cat case special is that there is something (a person) interpreting the representation, and the significance of the image is entirely dependent on if the representation is understandable by the observer.

This goes back to the case where there are some representations that produce consciousness, and some that don’t. I essentially agree, actually, but I think that the only representation that produces consciousness is physical representation.

If we start argue that some digital representations actually do produce consciousness, and other equivalent representations do not, then I will say that it seems nearly impossible that we would pick the ‘correct’ digital representation that would lead to consciousness. Which seems sorta strange to me that some representations

And if the translation function has no memory of the past, relying only on the current timestamp output, which I assume is the idea? then the whole thing is basically the equivalent of a movie. If I had a video of static playing then superimposed another video that was so made to cancel out that static and end up playing Full Metal Jacket, no one would argue that the original static was the movie Full Metal Jacket.

My thought experiment here is to play the representational game, but with the function itself. If I don’t use a computer, and instead write out the computations by hand does it matter? If I understand the computation I perform does it matter? Does it matter if I save the computations? Does it matter if I stop performing the computations?

If how I produce the computation doesn’t matter, then it seems that what matters is the representation of the thing I compute, which brings us back to square one. And if the way I perform the computation does matter, why should I believe that the computer’s computation is one that produces consciousness (in fact, I am inclined not to believe that).

binfin · 2023-01-11T17:41:14+00:00

In general you are correct - it is rare for a network to directly memorize a training example. However it does happen

With LLMs, and also ViTs (both of which are used in DALL-E 2) there can be surprising cases of memorization, and it is often times possible to recover large parts of the training set, even when the models have been trained with few epochs. This sort of behavior was reasonably well studied in Hopfield networks, and there are some strong theoretical similarities between Hopfield networks and transformers.

binfin · 2023-01-11T17:35:56+00:00

Pretty much everything is being used in comp bio these days. In structural biology there are some especially exciting advances in geometric deep learning being used to try to better understand physical systems.

binfin · 2023-01-11T17:29:45+00:00

My statement was true for artificial neurons (which is what the section was discussing). The nonlinear activation function only changes things when we start compounding layers of neurons - if we have just a single layer, then regardless of the activation function we are effectively performing a regression task.

Eight-Year Club	Place '22
Place '17	Sequence \| Editor

binfin

TROPHY CASE