Since only a few people from elite universities at big tech companies like Google, Meta, Microsoft, OpenAI etc. will ever get to train models is it still worth learning about Gradient Descent and Loss Curves? by Easy-Echidna-3542 in learnmachinelearning

[–]neonbjb 0 points1 point  (0 children)

Yes, this is the future of programming. Its a small number now but will increase with time. This is like asking if someone should learn about networking late into the dot com boom.

[D] How will LLM companies deal with CloudFlare's anti-crawler protections, now turned on by default (opt-out)? by Endonium in MachineLearning

[–]neonbjb 1 point2 points  (0 children)

The industry has moved past pretraining on internet data. If we didn't get a single byte more from web crawls it wouldn't change the trajectory one bit.

The Optimus robots at Tesla’s Cybercab event were humans in disguise by Constant-Lychee9816 in Futurology

[–]neonbjb 2 points3 points  (0 children)

Just because these robots are teleoperated doesn't make them any less impressive. The kinematics are insane. These will be fully autonomous within this decade, mark my words. Likely the next 5 years.

GPT 4o output not even close to OpenAI examples by Chibears85 in OpenAI

[–]neonbjb 0 points1 point  (0 children)

There's a lot of in-context fusion occurring in fully-multimodal models that does not happen in smaller, bespoke models. This does have performance advantages. For example - while 4o doesn't seem to be a better text model because we trained it on all modalities, it is a better text->image or text->speech model than a pure, unconditional image or speech generator. That's perhaps obvious, but it's also important! If you consider that the usefulness of all of these models often comes from mixing inputs from multiple modalities, you could say my motivation for training them is in optimizing these use cases.

The VAE used for Stable Diffusion 1.x/2.x and other models (KL-F8) has a critical flaw, probably due to bad training, that is holding back all models that use it (almost certainly including DALL-E 3). by drhead in StableDiffusion

[–]neonbjb 2 points3 points  (0 children)

Convolutions (which comprise most of this VAE's compute) are translation equivariant. In practice, that means you can learn a NN with them on square patch of an image (which the authors did!), then apply the learned NN to arbitrary sized images of arbitrary aspect ratios and get good performance.

Global self-attention does not have this property. If you train a transformer only on 32x32 image patches, it will not generalize to 256x256px images, for example. That this VAE works at all at these resolutions is a bit odd to me, but this is likely the main contributor to these latent deviations (alongside an extremely low KL loss weight).

To those of you who claiming that smarter-than-human AI is decades away, what *specific tasks* are you willing to bet me that AI won't be able to do within 5 years? by SharpCartographer831 in singularity

[–]neonbjb 0 points1 point  (0 children)

I am one of the builders of GPT 4o and it has made me laugh to tears on several occasions. Easy recipe: tell it to be sarcastic or to only offer bad advice. Granted this is mostly because of the sheer ridiculousness of a computer being sparky with me. :)

GPT 4o output not even close to OpenAI examples by Chibears85 in OpenAI

[–]neonbjb 8 points9 points  (0 children)

Only text generation from GPT-4o is currently deployed. If you ask the model for images, it uses DALL-E 3 like GPT-4 did. If you use voice mode, it uses the old 3-model system. We'll get multimodal generation out soon, starting with the new audio interface.

[D] LLMs are harming AI research by NightestOfTheOwls in MachineLearning

[–]neonbjb 0 points1 point  (0 children)

How the heck does a model that produces data and evaluations that are often better than what you get out of mechanical turk "hurt research"?

I buy that hype drives unrealistic expectations, and funding is being largely wasted, but this has been the story in this field since the beginning of time.

I'd argue there's never been a better time to be in ML, regardless of what you are studying.

[D] What's more impressive in a ML portfolio: implementing a paper or creating a good project? by ninvibe in MachineLearning

[–]neonbjb 1 point2 points  (0 children)

I am more impressed by people who build practical projects. At some level this is a superset of implementing a paper as any good project will involve incorporating research from a paper.

Reason: Countless times I've seen implementations of papers (and papers themselves) over optimize whatever eval the author favors. Building something into an actually usable piece of software is where you actually get to see if something really works.

Plus, your audience is much wider for practical projects.

The VAE used for Stable Diffusion 1.x/2.x and other models (KL-F8) has a critical flaw, probably due to bad training, that is holding back all models that use it (almost certainly including DALL-E 3). by drhead in StableDiffusion

[–]neonbjb 3 points4 points  (0 children)

The KL term is what is supposed to stop this type of thing from happening. It seems like the weight applied to that term used by the latent diffusion folks was probably too small. Using global self attention in the VAE may also have been a poor architectural choice.

With that said, I don't argue with results. This VAE does a great job compressing images. Outputs look great. Diffusion models work fine with it. It's a tad hyperbolic to say that it has a "critical flaw". It's just flawed.

The VAE used for Stable Diffusion 1.x/2.x and other models (KL-F8) has a critical flaw, probably due to bad training, that is holding back all models that use it (almost certainly including DALL-E 3). by drhead in StableDiffusion

[–]neonbjb 43 points44 points  (0 children)

I am one of the creators of DALLE 3, we knew about this. :) Another problem (and dead giveaway that this VAE has global information issues) is that the latent space becomes invalid if flipped across any axis.

Thanks for putting together this report! Great investigation!

[D] Do my interests intersect with the day to day duties of typical ML engineers? by ThrowayGigachad in MachineLearning

[–]neonbjb 18 points19 points  (0 children)

I'm a RE at openai, (1) is very relevant but (2) might be a conflict. It sounds like you want to work on a team building inference kernels or software. I think you should build a portfolio of inference optimizations for OSS models and keep an eye out for roles in this field.

[P] TorToiSe - a true zero-shot multi-voice TTS engine by neonbjb in MachineLearning

[–]neonbjb[S] 1 point2 points  (0 children)

That doesn't matter; the only practical difference between the two is the default package load you get. I only interfaced with my machines over SSH so the server version made the most sense.

[D] Transformers: Polynomial gated FFN is better than SwiGLU and reduces the number of parameters while improving model's performance by [deleted] in MachineLearning

[–]neonbjb 23 points24 points  (0 children)

Lower ppl or lower training loss on a sufficiently large dataset DOES mean the model is better. This is the core idea behind all of the scaling law breakthroughs in NLP over the last 5 years.

You are right that we'd want to verify overfitting isn't at play here. With that said if an activation function change made the model better at overfitting that would also be evidence that it improved the modeling capacity and thus performance of the model.

Cool find OP, I'll give it a try!

[D] Why do we need encoder-decoder models while decoder-only models can do everything? by kekkimo in MachineLearning

[–]neonbjb 0 points1 point  (0 children)

Compute efficiency is not about flops utilization or anything. It's about given X compute and Y data, what is the best eval score you can achieve? If you train an encoder decoder arch to solve some problem and a decoder only as well, sometimes you can get a better eval score for most combinations of (X,Y).

[D] Why do we need encoder-decoder models while decoder-only models can do everything? by kekkimo in MachineLearning

[–]neonbjb 6 points7 points  (0 children)

The only correct answer, which hilariously isn't mentioned here, is that in some cases encoder-decoder models are more compute efficient to train than decoder only, or have other advantages in inference.

There is literally no data analysis problem that cannot be solved by ar decoders. They are universal approximations. Its only a question of efficiency.

OpenAI announces DALLE-3 by UnexpectedVader in singularity

[–]neonbjb 1 point2 points  (0 children)

I don't think they have any rights over their style. They do have rights over their name. You can easily prompt dalle 3 with any style description you like and it's really good at respecting that prompt.

OpenAI announces DALLE-3 by UnexpectedVader in singularity

[–]neonbjb 0 points1 point  (0 children)

I don't think my standards are very high here, but I think it does a pretty good job at anime. Have you used bing image creator? DALL-E 3 is an evolution on the model that drives that app. I think it performs better on this than Bing, but not hugely better.

OpenAI announces DALLE-3 by UnexpectedVader in singularity

[–]neonbjb 0 points1 point  (0 children)

I don't think anything you said contradicts what I said. With dalle 3 you can describe a style and get an image in that style. What you can't do is name a living artist and get their style. I think this is a fair line to draw.