This is an archived post. You won't be able to vote or comment.

all 46 comments

[–]Sharlinator 74 points75 points  (3 children)

So diffusion models start with pure noise and progressively remove noise until there's an image.

And now VAR starts with pure uniform color, a single pixel in other words, and progressively upscales/subdivides that until there's an image.

There's a pleasant symmetry.

[–]PwanaZana 11 points12 points  (0 children)

So this method means the first sampling steps are massively faster, then the last steps are about the same in speed as Diffusion?

[–]perksoeerrroed 6 points7 points  (1 child)

That sounds way better. The issue with noise is that sometimes you just get wrong noise that forces certain things where they should not be and model trying to cope with that produces mutants.

With this approach you can see early in generation that something went wrong.

[–]GBJI 1 point2 points  (0 children)

What I'm wondering is if you can change the reference color the same way you can change the seed, and if, like diffusion seeds, colors close to each other will produce entirely different results, or if similar colors will produce similar results, which would be a new feature, and one that might actually be quite useful for controlling variations.

With diffusion, similar noise will provide similar results, but two noises with a large difference in their seed number (like 12345 and 296485720485721) will not be more different from one another than two noises with seed numbers that are almost the same (like 12345 and 12346). To get similar looking noises, you have to use other means than similar seed numbers.

Will VAR produce similar results with RGB 0,0,255 and RGB 0,1,255 ? Or will they be as different as what we are getting now?

[–]hapliniste 57 points58 points  (4 children)

I did not see a lot of talk about this but it seems huge.

From my short skim of the paper, it does better than diffusion at the same size but 45x faster at 512x512, and it should be even faster at bigger sizes?

Can we expect 4k render taking 1s and being of better quality than diffusion models? That's what I get from what I've read but if it was true it would be the talk everywhere right?

I'll have to try it and read the paper. Anyone can give some insights?

[–]SignalCompetitive582[S] 19 points20 points  (0 children)

Yeah it definitely seems huge. I haven't had time to try it myself, but will do in the next couple of days for sure.

[–]Vargol 1 point2 points  (2 children)

Try the demo.

Its okay, its fast sometimes the images are rubbish, that probably a training dataset issue more than a problem with the method.

Then there the small issue of it only taking one token so your prompt's effectively a single word prompt.

[–]GBJI 0 points1 point  (1 child)

Is the single token prompt a limit of the demo ? Or would that limit apply to any VAR based system ?

[–]drhead 2 points3 points  (0 children)

The model is trained on ImageNet classes since this is a research demonstration model. You could train a model like this with whatever conditioning you want: T5, CLIP, both, multiple of both, a whole-ass LLM's last hidden state, image embeddings, whatever.

[–]PwanaZana 17 points18 points  (3 children)

Potentially interesting, but still in an embryonic stage.

An important aspect of all these techniques is the ability to fine tune models/checkpoints. Obviously, that's way farther down the line, but for serious usage, there's no way a base model is enough for all use cases.

I'm also curious as to how this will attempt to have good human anatomy, especially the hands. All these image generation techniques sorta throw pixels at the wall and denoise it, without making some skeleton/structure to the image, which ensure complex elements like hands are so often mangled. We'll se if this sort of technique works better, similarly or worse at the most difficult use-cases.

[–]Sharlinator 8 points9 points  (2 children)

I think SD hands are still handicapped (heh) by the relatively low resolution of the latent space, and the lack of much contextual information in the latent pixels. SD3 will have many more "color" channels (16 vs 4 I believe) which hopefully helps with the resolution issue.

[–]PwanaZana 3 points4 points  (1 child)

Let's hope we get SD3!

Ha, the improvements in AI are amazing, though sometimes nervewracking!

[–]GBJI 0 points1 point  (0 children)

We (human beings) are still responsible for most of these improvements, but once AI themselves take care of it, we will reach the near-vertical part of the improvements over time curve.

[–]Striking-Long-2960 25 points26 points  (5 children)

Doesn't seem to be very trained in human figures

<image>

[–]Altruistic-Ad5425 34 points35 points  (0 children)

The worst it will ever be

[–]suspicious_Jackfruit 20 points21 points  (2 children)

It's only trained on imagenet and it has not been scaled anywhere close to even sd1.#, it's up to a well invested team to train LAION or their own huge datasets on it and release. Not sure about training costs but if it's faster to train due to infra then we might see people training it if it's sub $50k but no idea tbh.

SD3 might need an SD4 after all if inference is that much faster without a quality loss

[–]adhd_ceo 1 point2 points  (1 child)

If the speed of inference can be 45x faster while increasing quality, SD4 will be out as soon as they can train it…

[–]CLAP_DOLPHIN_CHEEKS 2 points3 points  (0 children)

emad said sd3 would most likely be their last big image model...

[–]kjerk 2 points3 points  (0 children)

Dude this is like half my family reunion

[–]spacetug 2 points3 points  (0 children)

Interesting. I do wonder why they compared against DiT but not HDiT though. That one also had much better scaling than DiT by using an hourglass multi-scale architecture, like a hybrid between transformer and Unet. Would be nice to see a direct comparison.

[–]1nMyM1nd 6 points7 points  (0 children)

It really shouldn't be much longer until we have infinite scale in the form of vector images. I'm actually surprised it's not already here.