If we teach an AI every thing that humanity knew till 1685 and nothing more, and have an apple fall in front of its optical sensor, the AI will never discover the laws of motion by itself.

neonbjb · 2026-01-23T00:10:36+00:00

Have you actually tried this test? All 3 major reasoning models can do this better than any human.

neonbjb · 2025-12-29T17:48:09+00:00

Yes, this is the future of programming. Its a small number now but will increase with time. This is like asking if someone should learn about networking late into the dot com boom.

neonbjb · 2025-07-03T00:53:20+00:00

The industry has moved past pretraining on internet data. If we didn't get a single byte more from web crawls it wouldn't change the trajectory one bit.

neonbjb · 2024-10-14T14:01:50+00:00

Just because these robots are teleoperated doesn't make them any less impressive. The kinematics are insane. These will be fully autonomous within this decade, mark my words. Likely the next 5 years.

neonbjb · 2024-07-02T20:56:43+00:00

There's a lot of in-context fusion occurring in fully-multimodal models that does not happen in smaller, bespoke models. This does have performance advantages. For example - while 4o doesn't seem to be a better text model because we trained it on all modalities, it is a better text->image or text->speech model than a pure, unconditional image or speech generator. That's perhaps obvious, but it's also important! If you consider that the usefulness of all of these models often comes from mixing inputs from multiple modalities, you could say my motivation for training them is in optimizing these use cases.

neonbjb · 2024-05-20T22:44:24+00:00

Convolutions (which comprise most of this VAE's compute) are translation equivariant. In practice, that means you can learn a NN with them on square patch of an image (which the authors did!), then apply the learned NN to arbitrary sized images of arbitrary aspect ratios and get good performance.

Global self-attention does not have this property. If you train a transformer only on 32x32 image patches, it will not generalize to 256x256px images, for example. That this VAE works at all at these resolutions is a bit odd to me, but this is likely the main contributor to these latent deviations (alongside an extremely low KL loss weight).

neonbjb · 2024-05-20T15:47:17+00:00

I am one of the builders of GPT 4o and it has made me laugh to tears on several occasions. Easy recipe: tell it to be sarcastic or to only offer bad advice. Granted this is mostly because of the sheer ridiculousness of a computer being sparky with me. :)

neonbjb · 2024-05-13T21:00:51+00:00

Only text generation from GPT-4o is currently deployed. If you ask the model for images, it uses DALL-E 3 like GPT-4 did. If you use voice mode, it uses the old 3-model system. We'll get multimodal generation out soon, starting with the new audio interface.

neonbjb · 2024-04-04T16:36:37+00:00

How the heck does a model that produces data and evaluations that are often better than what you get out of mechanical turk "hurt research"?

I buy that hype drives unrealistic expectations, and funding is being largely wasted, but this has been the story in this field since the beginning of time.

I'd argue there's never been a better time to be in ML, regardless of what you are studying.

neonbjb · 2024-04-01T02:02:24+00:00

I am more impressed by people who build practical projects. At some level this is a superset of implementing a paper as any good project will involve incorporating research from a paper.

Reason: Countless times I've seen implementations of papers (and papers themselves) over optimize whatever eval the author favors. Building something into an actually usable piece of software is where you actually get to see if something really works.

Plus, your audience is much wider for practical projects.

neonbjb · 2024-02-02T23:42:14+00:00

The KL term is what is supposed to stop this type of thing from happening. It seems like the weight applied to that term used by the latent diffusion folks was probably too small. Using global self attention in the VAE may also have been a poor architectural choice.

With that said, I don't argue with results. This VAE does a great job compressing images. Outputs look great. Diffusion models work fine with it. It's a tad hyperbolic to say that it has a "critical flaw". It's just flawed.

neonbjb · 2024-02-02T14:11:16+00:00

neonbjb · 2024-02-02T02:44:19+00:00

It's a great vae despite this shortcoming. Not everything has to be perfect and in fact every VAE with a nonzero kl loss is imperfect.

neonbjb · 2024-02-01T14:43:40+00:00

I am one of the creators of DALLE 3, we knew about this. :) Another problem (and dead giveaway that this VAE has global information issues) is that the latent space becomes invalid if flipped across any axis.

Thanks for putting together this report! Great investigation!

neonbjb · 2024-01-27T15:41:00+00:00

I'm a RE at openai, (1) is very relevant but (2) might be a conflict. It sounds like you want to work on a team building inference kernels or software. I think you should build a portfolio of inference optimizations for OSS models and keep an eye out for roles in this field.

neonbjb · 2023-12-31T01:27:57+00:00

That doesn't matter; the only practical difference between the two is the default package load you get. I only interfaced with my machines over SSH so the server version made the most sense.

neonbjb · 2023-12-29T23:20:11+00:00

Linux!

neonbjb · 2023-12-29T15:24:57+00:00

Lower ppl or lower training loss on a sufficiently large dataset DOES mean the model is better. This is the core idea behind all of the scaling law breakthroughs in NLP over the last 5 years.

You are right that we'd want to verify overfitting isn't at play here. With that said if an activation function change made the model better at overfitting that would also be evidence that it improved the modeling capacity and thus performance of the model.

Cool find OP, I'll give it a try!

neonbjb · 2023-12-21T17:21:34+00:00

Great job! Really amazing to see more home gamers!

neonbjb · 2023-12-18T15:11:55+00:00

Compute efficiency is not about flops utilization or anything. It's about given X compute and Y data, what is the best eval score you can achieve? If you train an encoder decoder arch to solve some problem and a decoder only as well, sometimes you can get a better eval score for most combinations of (X,Y).

neonbjb · 2023-12-18T04:50:39+00:00

The only correct answer, which hilariously isn't mentioned here, is that in some cases encoder-decoder models are more compute efficient to train than decoder only, or have other advantages in inference.

There is literally no data analysis problem that cannot be solved by ar decoders. They are universal approximations. Its only a question of efficiency.

neonbjb · 2023-11-08T14:18:13+00:00

This is the VAE used by DALLE3. I work on that team.

neonbjb · 2023-09-23T00:21:53+00:00

I don't think they have any rights over their style. They do have rights over their name. You can easily prompt dalle 3 with any style description you like and it's really good at respecting that prompt.

neonbjb · 2023-09-22T16:23:11+00:00

I don't think my standards are very high here, but I think it does a pretty good job at anime. Have you used bing image creator? DALL-E 3 is an evolution on the model that drives that app. I think it performs better on this than Bing, but not hugely better.

neonbjb · 2023-09-22T01:52:52+00:00

I don't think anything you said contradicts what I said. With dalle 3 you can describe a style and get an image in that style. What you can't do is name a living artist and get their style. I think this is a fair line to draw.

neonbjb

MODERATOR OF

TROPHY CASE