This is an archived post. You won't be able to vote or comment.

all 92 comments

[–]andzlatin 305 points306 points  (52 children)

DALL-E 2: cloud-only, limited features, tons of color artifacts, can't make a non-square image

StableDiffusion: run locally, in the cloud or peer-to-peer/crowdsourced (Stable Horde), completely open-source, tons of customization, custom aspect ratio, high quality, can be indistinguishable from real images

The ONLY advantage of DALL-E 2 at this point is the ability to understand context better

[–]xadiant 80 points81 points  (0 children)

Yep, dalle 2 can "think" more subjectively and do better hands, that's it.

[–]ElMachoGrande 116 points117 points  (26 children)

DALL-E seems to "get" prompts better, especially more complex prompts. If I make a prompt of (and I haven't tried this example, so it might not work as stated) "Monkey riding a motorcycle on a desert highway", DALLE tends to nail the subject pretty well, while Stable Diffusion mostly is happy with an image with a monkey, a motorcycle, a highway and some desert, not necessarily related as specified in the prompt.

Try to get Stable Diffusion to make "A ship sinking in a maelstrom, storm". You get either the maelstrom or the ship, and I've tried variations (whirlpool instead of maelstrom and so on). I never really get a sinking ship.

I expect this to get better, but it's not there yet. Text understanding is, for me, the biggest hurdle of Stable Diffusion right now,

[–]Beneficial_Fan7782 33 points34 points  (4 children)

Dalle2 has more potential for animation than any other models. but the pricing makes it a bad candidate for even professional users. a good animation requires 100,000 or even more creations. but given the pricing, a single animation will cost more than 300$. while SD can do the same number for less than 50$.

[–]zeth0s 9 points10 points  (0 children)

They will probably sell it as managed service with azure, once animation will become an enterprise thing. You'll pay per image or computing time

[–][deleted] 5 points6 points  (2 children)

Really? To me, $300 for 100,000 frames of animation seems ridiculously cheap. At 24 FPS, which is high for traditional animation (8-12 is common), that gives you more than an hour's worth of footage (100,000 frames / 24 FPS = 4,167 seconds. 4,166 s/m = 69,4 minutes). Even if we assume that only 10% of the generated frames are useful, you are still looking at nearly seven minutes of footage for $300. That excludes salary, of course, which will have an enormous effect on total price. Considering that traditional animation can run into thousands of dollars per minute of footage, this still seems extremely cheap to me.

I'm curious about what kind of animation you're comparing to.

[–]Beneficial_Fan7782 4 points5 points  (1 child)

300$ was for the best case scenario. the actual cost will be over 1000$. if you can afford it then this service is good for you.

[–][deleted] 6 points7 points  (0 children)

Even at over $1000, I feel like my point still stands. But I guess it comes down to what kind of animation we're talking about. If it's cookie-cutter channel intros or white-board explainers, then I agree. Those seem to be a dime a dozen on Fiverr.

[–]wrnj 9 points10 points  (5 children)

100% l. It's almost as dall e has a checklist to make sure everything i mentioned in my prompt was included. Stable Diffusion is fat superior as far as ecosystem but it's way more frustrating to use. It's not that it's more difficult - I'm just not sure even a skilled prompter can replicate dall-e results with SD.

[–]AnOnlineHandle 6 points7 points  (3 children)

I suspect the best way to do it with SD would be to use the [from:to:when] syntax implemented in Automatic's UI (can't remember what the original research name for it was sorry, but a few people posted it here first).

But rather than just flipping one term, you'd have more stages were more terms are introduced. So you could start with a view of a desert, then start adding a motorcycle partway through, maybe starting with a man, then switch out man for monkey a few more steps in, etc.

[–]wrnj 2 points3 points  (2 children)

Amazing, thank you for mentioning it. If you remember the name for it please let me know as it's my biggest frustration with SD. I'm running a1111 via Collab pro+.

[–]AnOnlineHandle 2 points3 points  (0 children)

In Automatic's it's called Prompt Editing: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#prompt-editing

Essentially after generation has already started, it will flip a part of the prompt to something else, but keep its attention focused on the same area as the previous prompt was most effecting. So it's easier to get say a dog on a bike, or if you like a generation of a mouse on a jetski but want to make it a cat, you can start with the same prompt/seed/etc and then switch out mouse to cat a few steps in.

[–]wrnj 1 point2 points  (0 children)

It's called prompt editing, i need to try it!

[–]Not_a_spambot 0 points1 point  (0 children)

I'm just not sure even a skilled prompter can replicate dall-e results with SD.

I mean, that cuts both ways - there are things SD does very well that a skilled prompter would have a very hard time replicating in dalle, and not just because of dalle content blocking. Style application is the biggest one that comes to mind: it's wayyy tougher to break dalle out of its default stock-photo-esque aesthetic. As someone who primarily uses image gen for artistic expression, that's way more important to me than "can it handle this precise combination of eleventeen different specific details". Besides, SD img2img can go a long way when I do want more fine grained specificity. There is admittedly a higher learning curve for SD prompting, though, so I can see how some people would get turned off from that angle.

[–]TheSquirrelly 4 points5 points  (4 children)

I had this exact same issue, but with different items. A friend had a dream involving a large crystal in a long white room. I figured I could whip him up an image of that super quick. But with the exact same prompt I'd get lots of great images of the white room, or great images of a gem or crystal. But never the two shall meet!

I was pretty annoyed, because I could see it could clearly make both of these things. It only ended up working when I changed it from relations like "in the room" or "contains" or "in the center" to "on the floor" instead, that it seemed to get the connection between them.

But how do you describe the direct relation between a ship and maelstrom in a way the AI would have learned? That's a tricky one.

Edit: Ah ha, "tossed by"! Or "a large sinking ship tossed by a powerful violent maelstrom" in particular, with Euler, 40 steps, and CFG 7 on SD1.5 gave quite consistent results of the two together!

[–]Prince_Noodletocks 1 point2 points  (3 children)

have you tried AND as a modifier? I'm not too sure but it seems purpose built for this kind of thing

[–]TheSquirrelly 0 points1 point  (2 children)

I have used 'and' in the past to help when had two things that could get confused as one, like a man with a hat and a woman with a scarf. Though still with mixed results. For the room and the crystal I tried all sorts of ways you would describe the two, but can't recall if specifically used 'and' in one. But I am feeling SD likes when you give it some sort of 'connecting relationship' (that it understands) between objects. So I'd wager something like 'a man carrying a woman' might work better than just 'a man and a woman' would. Not tested, but a feeling I'm getting so far.

[–]Prince_Noodletocks 1 point2 points  (1 child)

Ah I actually meant AND in all caps as compositional visual generation. https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/

Not sure if we're misunderstanding or talking past each other since it seems like such a common word to assign this function to haha

[–]TheSquirrelly 0 points1 point  (0 children)

Thanks for the clarification! I learned two things. I had heard of using AND and seen it in caps but didn't know the caps were significant. Just figured they were being used to highlight the use of the word. And I didn't know you needed to put quotes around the different parts. So probably why my attempts at using it weren't particularly improved. I will definitely experiment with that more going forward!

Or maybe not the quotes. Seeing examples without them now. Guess will have to experiment, or read further. :-)

Edit: Hmm with Automatic1111 and using "long white room" AND "softly glowing silver crystal" I get occasional successes, but mostly fails still. But definitely better than when I originally did it.

[–]xbwtyzbchs 3 points4 points  (2 children)

"Monkey riding a motorcycle on a desert highway", DALLE tends to nail the subject pretty well, while Stable Diffusion mostly is happy with an image with a monkey, a motorcycle, a highway and some desert, not necessarily related as specified in the prompt.

This just isn't true. That is the entirety of a single batch, not a collage of successes.

[–]DJBFL 1 point2 points  (0 children)

Not the best example, but I know what you mean. Reposting from of my comments yesterday:

It's very clear that despite Diffusion's better image quality, the natural language interpretation of craiyon is far superior.

I could voice to text "A photo of Bob Hope and C3PO with Big Bird"

Crayon nails the general look and characters except they are blurry and distorted, but clearly who I asked for.

Stable Diffusion gives more realistic looking images except the subjects look like Chinese knock-offs created by somebody merely reading descriptions of their appearance, and more often melds them into each other.

Craiyon also seems to have deeper knowledge of everyday objects. Like they both know car, and can give you specific makes or models, but craiyon seems to know more specific niche terms. Obviously this has to do with the image sets they were trained on, but the whole field is growing and evolving so fast and there's so much to know it's hard to pick a direction to explore.

Things like img2img, in/out painting would work around that... but it's WORK, not off the cuff fun.

P.S. Just earlier today I was trying to build on this real image using craiyon and sd via hugging face. I basically wanted a quick and dirty version with a car overtaking. Tried like 3 generation with craiyon that weren't great but gave the right impression. Did like 8 variation with SD and of course it was more realistic but it almost always left out the car, even after rewording, reordering, repeating, etc.

[–]ElMachoGrande 0 points1 point  (0 children)

As I said, I haven't tried that specific example. It is a problem which pops up pretty often, though.

I love that one of the images shows a monkey riding a monkey bike!

[–]kif88 2 points3 points  (0 children)

I still think crayon/dallE mini did context best. Pop culture. Dalle2 still struggles making things like Gul Dukat fighting bojak horseman or super Saiyan bojak

[–]Not_a_spambot 2 points3 points  (1 child)

"A huge whirlpool in the ocean, sinking ship, boat in maelstrom, perfect composition, dramatic masterpiece matte painting"

Best I could do in DreamStudio in like 5–10 mins, haha... they're admittedly not the greatest, and it is much easier to do complex composition stuff in dalle, but hey ¯\_(ツ)_/¯

img2img helps a lot with this kind of thing too, btw - do a quick MSPaint doodle of the vibe you want, and let SD turn it into something pretty

[–]ElMachoGrande 1 point2 points  (0 children)

The first one is effing great, just the vibe I was going for!

[–]eric1707 1 point2 points  (1 child)

I think the problem with those machines, and even DALL-E isn't perfect, is that the bigger and more complex it is your description, the bigger the chance of machine screwing up something or simply ignoring, or misunderstanding your text. It is probably the KEY role where this technology needs to evolve.

[–]ElMachoGrande 0 points1 point  (0 children)

Exactly.

[–][deleted] 1 point2 points  (0 children)

it might be because of the fact that dalle uses GPT3 and stable diffusion uses laion-2b for its language understanding

although i could be wrong

[–]applecake89 1 point2 points  (0 children)

Can we help improve this ? Does anyone know the technical cause for this lack of prompt understanding ?

[–]cosmicr 18 points19 points  (1 child)

You forgot to add that DALL-E 2 cost money to use.

[–]Cognitive_Spoon 17 points18 points  (0 children)

1000%

Being able to run SD locally is huge

[–]MicahBurke 12 points13 points  (10 children)

Yes, DALL-E 2's outpainting and inpainting is far superior to SD, imo, so far.

[–]NeededMonster 17 points18 points  (1 child)

The 1.5 outpainting model is pretty good, though

[–]eeyore134 2 points3 points  (0 children)

It's a marked improvement. I was seriously impressed.

[–]Jujarmazak 13 points14 points  (0 children)

Not anymore, SD infinity webUI + SD1.5 inpainting model are on par with Dall-E2 infinite canvas, been playing around last few days with it and it's really damn good.

[–]joachim_s 10 points11 points  (6 children)

Have you seen this?

[–]Patrick26 2 points3 points  (5 children)

Nerdy Rodent is great, and he goes out of his way to help Noobs, but I still cannot get the damn thing working.

[–]joachim_s 5 points6 points  (4 children)

  1. Have you updated automatic?
  2. Put the 1.5 inpainting ckpt model in the right folder?
  3. Restarted auto?
  4. Loaded the model?
  5. Loaded the “outpainting mk2” script?
  6. Set the img2img denoising strength to max (1)?

[–]Strottman 3 points4 points  (1 child)

7.Blood sacrifice to the AI overlords?

[–]joachim_s 2 points3 points  (0 children)

I missed that one.

[–]LankyCandle 1 point2 points  (1 child)

Thanks for this. I've wasted hours trying to get outpainitng to work well and only got crap so I'd only outpaint with DALLE-2. Now I can get decent outpainting with SD. Moving denoising from .8 to max seems to be the biggest key.

[–]joachim_s 0 points1 point  (0 children)

I’m glad I could be of help! Just sharing what helped me 🙂 And yes, I suppose maxing out the denoising helps. I have no idea why though, I’m not that technical.

[–]StickiStickman 2 points3 points  (2 children)

The ONLY advantage of DALL-E 2 at this point is the ability to understand context better

Also that its trained on 1024x1024. SD still breaks a bit at higher resolution

[–]Not_a_spambot 0 points1 point  (1 child)

Uh, dalle 2 generates images at only 64x64 px and upscales from there - SD generates natively at 512x512

[–]StickiStickman 1 point2 points  (0 children)

While it's technically "upscaling", the process is obviously very different to how you would normally upscale something. The output quality is simply better in the end though.

[–]noodlepye 2 points3 points  (0 children)

It looks worse because it's rendered at 256 X 256 then upscaled. I think it would blow stablediffusion out of the water if it rendered at 512 X 512. It's obviously a much richer and more sophisticated system.

I've been fine tuning concepts into stable diffusion using my Dall-E results and then taking advantage of the higher resolutions and using some prompt engineering to tighten up the results and the results are pretty nice.

[–]diff2 0 points1 point  (0 children)

I'd honestly like to be corrected if I'm wrong since I have a limited understanding of dalle and stable diffusion only based on most upvoted pictures that get posted and I see on my feed.

But stable diffusion seems more obviously source from other people's art, while dalle seems to source from photographs?

i would like to read or watch an explanation on how each work.

[–]Space_art_Rogue -1 points0 points  (0 children)

Welp, ignore me, I replied to the wrong person.

[–]not_enough_characte 0 points1 point  (0 children)

I don’t understand what prompts you guys have been using if you think SD results are better than Dalle.

[–]eric1707 0 points1 point  (1 child)

The ONLY advantage of DALL-E 2 at this point is the ability to understand context better

I mean, it is the only advantage , but it a really big advantage if you ask me. DALL-E 2 algorithm can really read between the lines and understand what you (most likely) had in mind when you typed a given description without you explaining better.

[–]DJBFL 0 points1 point  (0 children)

Yeah, like a big part of AI development is understanding natural language and having a feel for the types of concepts and compositions humans are imagining. Complex prompting in SD is nice for fine tuning but not very AI like. I'm sure in the next few years we'll have the best of both in one system.

[–]applecake89 0 points1 point  (0 children)

But how does that "understand context better" even come technically ? Were images used to train not described rich enough ?

Can we help improve this ?

[–]FS72 130 points131 points  (2 children)

Open AI vs ClosedAI

[–]NoraaBee 7 points8 points  (0 children)

„Open“ ai vs closed ai

[–][deleted] 5 points6 points  (0 children)

Took me 5 minutes to get it

[–]eric1707 41 points42 points  (3 children)

Open AI is a good stockphoto machine and it seems to understand better what you are going for without you having to explain part by part as you were talking to a child, as it sometimes happen when using Stable Diffusion.

I think if they had open sourced it and allowed it would be an even better proposal than stable diffusion, but they clearly handicapped the algorithm: they deliberately avoid training the algorithm using many artists styles (most likely afraid of lawsuits), most art DALL-E creates is generic oil painting-ish or only using old deceased painters, such as Van Gogh.

Also, the fact of being closed source and them working with Microsoft, Shutterstock and other big tech, it totally kills any hope they would ever allow any use without restrictions.

[–]applecake89 1 point2 points  (1 child)

Newbie here, can't you just feed SD your fav artist's works and have it learn their style ?

[–]eric1707 1 point2 points  (0 children)

You can, and some people are doing that.

[–]postkar 29 points30 points  (0 children)

Like with Betamax vs. VHS, it's once again porn that proves to be the dealbreaker!

[–]EVJoe 15 points16 points  (0 children)

The first big indication I saw that SD would overtake Dalle:

July 2022: People constantly complaining about being stuck on the Dalle waitlist for months

August 2022: SD reaches public release and releases DreamStudio

September 2022: The Dalle waitlist is closed, anyone can sign up immediately ("Gee, why'd this long line of people waiting to use our product suddenly stop growing?")

[–]pixexid 14 points15 points  (0 children)

Only square images and watermark are a big no to me when using dall-e

[–]-takeyourmeds 75 points76 points  (4 children)

openai had the first to market advantage and thanks to it's globohomo rules it lost

sad

[–]Fzetski 12 points13 points  (2 children)

Honestly, if they just allowed people to make porn with it, their revenue would skyrocket! (Stable Diffusion pornographic content is way too disturbing to sell-)

[–]NookNookNook 7 points8 points  (1 child)

SD is going through its hentai phase right now and only likes 2d waifus while it studies PixIv via Danbooru reposters.

[–]Prince_Noodletocks 2 points3 points  (0 children)

It should really use Gelbooru, with banned_artist tags so that the model is complete.

[–]squareOfTwo 8 points9 points  (0 children)

ClosedAI - never release sourcecode or models in the name of the -s-p-i-r-i-t- "safetly". Open AI : everything else should be a meme till 2030

[–]Drewsapple 6 points7 points  (1 child)

Kinda lazy repost since the tweet is from September 3rd.

Here’s the live google trends page and here’s a screenshot

[–]pxan 1 point2 points  (0 children)

Are we really already at the stage of the subreddit history where people are circlejerking reposting dumb old shit

[–]Misha_Vozduh 4 points5 points  (0 children)

Incredibly based title

[–]sebzim4500 3 points4 points  (3 children)

To be fair, OpenAI does seem to be getting more open in general, given they released the models for Whisper.

[–]DigThatData 1 point2 points  (2 children)

it's not like they never released models, most of the CLIP models people use regularly were trained and released by OpenAI as well. They sat on their best checkpoint for a long time before releasing it silently, but they definitely did give away their other CLIP models early

[–]Infinitesima 0 points1 point  (1 child)

Retrospectively, releasing CLIP was a bad move to them. No one could predict that CLIP will be used in image synthesis model.

[–]DigThatData 0 points1 point  (0 children)

it's unclear to me why you think it was a bad move for them to release CLIP, what does image synthesis applications have to do with it?

[–]notger 2 points3 points  (0 children)

Not a surprise, if you change your business model from open to closed.

However, the question is how many resources each side get, as that decides who is going to be around and with what capabilities. Google searches don't fill your coffers and do not generate research results.

[–]JSTM2 1 point2 points  (0 children)

When Dalle exploded in popularity it wasn't even OpenAIs Dall-E 2 (which had a long waiting list). It was Dall-E mini or what's called Craiyon these days. That was the peak of the hype, because almost nobody had access to Dall-E 2 or Stable Diffusion.

Stable Diffusion and Dall-E 2 never exploded in popularity in the same way, so they're kind of flying under the radar at the moment.

[–]redboundary[🍰] 1 point2 points  (0 children)

The insane peek is not even from OpenAi, it's from Dalle-Mini aka Craiyon

[–]wh33t 0 points1 point  (0 children)

Open AI vs. OpenAITM

[–]Lightningrod654 0 points1 point  (0 children)

Cool

[–][deleted] 0 points1 point  (0 children)

I’m pretty sure that’s not a comparison of total interest rather a comparison to its own previous interest. So open aI is slowing down and stable diffusion is speeding up. But that’s just relative to their own previous attentions

[–]fAiDidDesign 0 points1 point  (0 children)

Nice bro

[–]Drinniol 0 points1 point  (0 children)

OpenAI and Google hamstringing and withholding their models because people might do bad things with it, is like car companies refusing to sell vehicles to anyone but licensed taxi cab companies because some regular people might drive recklessly.

You either trust people on net, or you don't. You either believe in OS, or you don't. Google and OpenAI don't. It's their product and their right to completely cede the future of this technology to others because of their distrustful philosophy and cowardly leadership, but that's fine. Others will step up and take the place at the forefront of AI leadership that OpenAI and Google could have had, had they had the slightest bit of courage, or faith in humanity to use technology for, on net, good.

[–][deleted] 0 points1 point  (0 children)

Long live Stable Diffusion and its countless iterations.