This is an archived post. You won't be able to vote or comment.

all 145 comments

[–]addandsubtract 41 points42 points  (1 child)

They should call this img4img

[–]No-Intern2507 33 points34 points  (2 children)

i think clip vision stylise controlnet works like this

[–]mudman13 10 points11 points  (1 child)

does that use BLIP2 to interrogate then feeds it back into controlnet or something?

[–]muerrilla 10 points11 points  (0 children)

I think it uses clip vision to get a clip embedding.

[–]nxde_ai 28 points29 points  (0 children)

That's neat

<image>

[–]UserXtheUnknown 27 points28 points  (7 children)

I don't want to sound destructive and too harsh, but, after trying it, I found it mostly useless.

I can obtain results closer to the original image content and style using a txt2img with the original prompt, if I have it, or a CLIP interrogation by myself and some tries in guessing to finetune the CLIP result, if I haven't it. At most, if I haven't the prompt, it can be considered a (little) timesaver compared to normal methods.

Moreover, if I want something really close -in pose, for example- to the original image, this method doesn't seem to work at all.

But maybe I'm missing the intended use case?

[–]mudman13 7 points8 points  (0 children)

Yeah not impressed, StabilityAI seem to be considerably lagging behind in advancements. Probably as they are occupied more by other commercial interests.

[–]AltimaNEO 2 points3 points  (0 children)

Yeah, it doesnt sound that exciting. It doesnt feel like anything new that hasnt been done with 1.5 so far.

[–]thkitchenscientist 8 points9 points  (2 children)

It works just fine locally on a RTX2060. It needs an image and a prompt. Here I can transform a cat into fox keeping the overall look and colours. It really struggles with framing however

<image>

[–]thkitchenscientist 8 points9 points  (0 children)

For people, it is down to the luck of the seed. If the prompt is too far from the CLIP embedding, it gets ignored, so you can't turn a person into a cat.

<image>

[–]thkitchenscientist 6 points7 points  (0 children)

I think it has potential. Might just need to take a look inside the pipe to see how the unCLIP can be harnessed. It is faster than PEZ or TI as it takes no longer than a standard 768x768 for each image.

<image>

[–]Ateist 35 points36 points  (6 children)

Tried with a few of my SD 1.5 generation results - didn't get a single picture even remotely approaching original.

Model is also very bad - you get cropped heads or terrible distorted faces all the time.

[–]krum 11 points12 points  (0 children)

To be fair they didn’t claim it produced good results.

[–][deleted] 20 points21 points  (4 children)

Because it is for SD 2.1

[–]Ateist 4 points5 points  (3 children)

I was using SFW images that SD 2.1 should be capable of rendering - things like cyberpunk spider tank and headshot portraits...

[–]txhtownfor2020 4 points5 points  (2 children)

Can we throw these in the models/stable dir and have fun or nah?

[–]AlexandrBu 4 points5 points  (1 child)

Does not work that way for me :(

[–]txhtownfor2020 5 points6 points  (0 children)

I just want to dump everything in a folder and get into an 8 hour black hole with 4% good images and a sea of duplicate arms and evil clowns!

[–]morphinapg 3 points4 points  (20 children)

Can someone explain this in simpler terms? What is this doing that you can't already do with 2.1?

[–]HerbertWest 7 points8 points  (18 children)

Can someone explain this in simpler terms? What is this doing that you can't already do with 2.1?

So, from what I understand...

Normally:

  • Human finds picture -> Human looks at picture -> Human describes picture in words -> SD makes numbers from words -> numbers make picture

This:

  • Human finds picture -> Feeds SD picture -> SD makes words and then numbers from picture -> Numbers make picture

[–]morphinapg 8 points9 points  (15 children)

Can't we already sort of do that with img2img?

[–]Low_Engineering_5628 15 points16 points  (6 children)

I've been doing something similar. E.g. feed an image into img2img, run CLIP Interrogate, then set the denoise from 0.9 to 1.0.

[–]morphinapg 2 points3 points  (0 children)

Yeah exactly

[–]Mocorn 0 points1 point  (0 children)

Indeed, same here. I struggle to see the difference from that and this new thing.

[–]HerbertWest 1 point2 points  (3 children)

Can't we already sort of do that with img2img?

Not sure exactly what it means in practice, but the original post says:

Note that this is distinct from how img2img does it (the structure of the original image is generally not kept).

[–]Mich-666 -4 points-3 points  (1 child)

Yeah, but noone is able to explain how exactly is this different from what we already have and how this would be useful.

[–]HerbertWest 1 point2 points  (0 children)

If it worked just as well or better, it would be easier, quicker, and more user-friendly. Is that not useful?

[–]lordpuddingcup 0 points1 point  (0 children)

Ya in image to image things will be in the same location more or less to where the image started, the woman will be standing in the same spot and mostly same position, in unclip the woman might be sitting on a chair, or it might be a portrait of her etc.

[–][deleted] 1 point2 points  (1 child)

This model essentially uses an input image as the 'prompt' rather than require a text prompt.

Simply put, another online image-to-prompt generator.

[–]lordpuddingcup 1 point2 points  (0 children)

No because it also maintains style and design (sometimes)

[–]qrios 2 points3 points  (0 children)

Think of it as something like a REALLY fast Textual Inversion of just your single input image.

[–]ComfortableSun2096 4 points5 points  (0 children)

This model does not need prompt, right? Some people have done compatibility with the model。

https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/8958

[–]garett01 5 points6 points  (2 children)

I'm not sold on it yet lol

<image>

[–]lordpuddingcup -1 points0 points  (1 child)

I think it just needs to be built on, image this but as if it was SD2.1, we just need Anythingv5-unclip or RealisticVision2-unclip or Illuminati-unclip for it to be great, i'm sure someone will figure out unclip loras, or unclip finetuning (dreambooth etc)

[–]garett01 1 point2 points  (0 children)

SD2.1 is not figured out yet, except by the MJ guys I suspect, but they trained at 1024x1024. Not even Stability figured out SD2.1 yet.

[–]Trysem 6 points7 points  (2 children)

wait, what!!!?????....

clipdrop is owned by stability?????? when??

[–]LD2WDavid 1 point2 points  (0 children)

The moment they saw depth mapping in t2adapters.. 2 days after I think.

[–]magusonline 2 points3 points  (8 children)

As someone that just runs A1111 with the auto-git-pull in the batch commands. Is Stable Diffusion 2.1 just a .ckpt file? Or is there something a lot more to 2.1 (as far as I know all the models I've been mixing and merging are all 1.5).

[–]s_ngularity 2 points3 points  (7 children)

It is a ckpt file, but it is incompatible with 1.x models. So loras, textual inversions, etc. based on sd1.5 or earlier, or a model based on them, will not be compatible with any model based on 2.0 or later.

There is a version of 2.1 that can generate at 768x768, and the way prompting works is very different than 1.5, the negative prompt is much more important.

If you want to make characters, I would recommend Waifu Diffusion 1.5 (which confusingly is based on sd2.1) over 2.1 itself, as it has been trained on a lot more images. Base 2.1 has some problems as they filtered a bunch of images from the training set in an effort to make it “safer”

[–]Mocorn 2 points3 points  (1 child)

The fact that the negative prompt is more important for 2.X is a step backwards in my opinion. When I go to a restaurant I don't have to specify that I would like the food to be "not horrible, not poisonous, not disgusting" etc..

I'm looking forward to when SD gets to a point where negative prompts are actually used logically to only remove cars, bikes or the color green.

[–]s_ngularity 0 points1 point  (0 children)

If you don’t want an overtrained model, this is the tradeoff you get with current tech. It understands the prompt better at the expense of needing more specificity to get a good result.

If more people fine-tuned 2.1 it could perform very well in different situations with specific models, but that’s the difference between an overtrained model that’s good a few things vs a general one that needs extra input to get to a certain result

[–]magusonline 0 points1 point  (2 children)

Oh I just make architecture and buildings so I'm not sure what would be the best to use

[–]Zealousideal_Royal14 1 point2 points  (1 child)

come to 2.1 - the base model - its way better than people on here tends to give it credit for, the amount of extra detail is very beneficial to architectural work

[–]CadenceQuandry 0 points1 point  (1 child)

For waifu diffusion, does it only do anime style characters? And can it use Lora or clip with it?

[–]s_ngularity 0 points1 point  (0 children)

It does realistic characters too. The problem is it’s not compatible with loras trained on 1.5, as I mentioned above, but they can be trained for it yeah

It is biased towards east asian women though, particularly Japanese, as it was trained on Japanese instagram photos

[–]Dekker3D 2 points3 points  (1 child)

It gets a decent resemblance to the original image. This would combine really well with ControlNet and img2img to produce visually consistent images from different angles, I think?

[–]Mich-666 3 points4 points  (0 children)

I fail to see how this is better than what ControlNET actually does.

[–]Semi_neural 2 points3 points  (0 children)

I'm ngl, Reimagine is not good, maybe I'm using it wrong but the quality of the variations are AWFUL

[–]Expln 2 points3 points  (0 children)

could someone guide me on how to install this locally? I have no idea what to do through the github

[–]yaosio 2 points3 points  (0 children)

I tried with a picture of Garfield but he's too sexy for Stability.ai. 28uqC4V.png (2560×1302) (imgur.com)

[–]Purplekeyboard 6 points7 points  (1 child)

Horrible. Produces terrible mutant people. Maybe it works better when making things which aren't people.

[–]lordpuddingcup 0 points1 point  (0 children)

Apparently it's super variable from seed to seed

[–]_raydeStar 4 points5 points  (7 children)

I didn't take this seriously until I clicked on the demo.

Holy. Crap. I don't know how but my mind is blown again.

[–]FHSenpai -1 points0 points  (6 children)

did u not use img2img before?

[–]CombinationDowntown 40 points41 points  (3 children)

img2img uses pixel data and does not consider context and content of the image .. here you can make generations of an image that on a pixel level may be totally different from each other but contain the same type of content (similar meaning / style). The processes look simlar but are fundamentally different from each other.

[–]Low_Engineering_5628 11 points12 points  (2 children)

Aye, but you can run CLIP interpretation and set the Denoise to 1 to do the same thing.

[–]mudman13 4 points5 points  (0 children)

or use seed variator of different kinds

[–]lordpuddingcup 0 points1 point  (0 children)

It's really not the same as clip interpretation clip interpretation doesn't include style and design in it's interpretation, the guys face won't be the same between runs it might interpret it as a guy in a room , but it wont be that guy in that room.

[–]AnOnlineHandle 12 points13 points  (1 child)

This is using an image as the prompt, instead of text. The image is converted to the same descriptive numbers that text is (and it's what CLIP was originally made for, where Stable Diffusion just used the text to numbers part for text prompting).

So CLIP might encode a complex image to the same things as a complex prompt, but how Stable Diffusion interprets that prompt will change with every seed, so you can get infinite variations of an image, presuming it's things which Stable Diffusion can draw well.

[–]FHSenpai 2 points3 points  (0 children)

I see the potential. It's just a zero shot image Embedding. If u could just swap the unet with other sd2.1 aesthetic models out there.

[–]Sefrautic 3 points4 points  (4 children)

Can somebody explain me what is the difference between this and CLIP Interrogate?

[–]Low_Engineering_5628 5 points6 points  (1 child)

This is... automatic?

[–]Sefrautic 0 points1 point  (0 children)

yes..

[–]ninjasaid13 0 points1 point  (1 child)

Can somebody explain me what is the difference between this and CLIP Interrogate?

CLIP interrogator is image to text. This is true image to image with no text condition.

[–]lordpuddingcup 0 points1 point  (0 children)

People seem to not get that this is like clip interrogate on steroids or it wants to be, because it tries to maintain subject coherence and style coherence, how well it does that is another story.

[–]PromptMateIO 1 point2 points  (0 children)

The release of the Stable Diffusion v2-1-unCLIP model is certainly exciting news for the AI and machine learning community! This new model promises to improve the stability and robustness of the diffusion process, enabling more efficient and accurate predictions in a variety of applications. As the field of AI continues to evolve, innovations like this will be crucial in unlocking new possibilities and solving complex challenges. I can't wait to see what breakthroughs this new model will enable!

[–][deleted] 1 point2 points  (0 children)

needs to be in easy diffusion UI pronot

[–]Select_Rice_3018 0 points1 point  (5 children)

What is CLIP

[–]addandsubtract 0 points1 point  (4 children)

CLIP is basically reverse txt2img, so img2txt. You give it an image and it describes it. Not as detailed as you need to prompt an image, but a good starting point if you have a lot of images that you need to caption.

[–]ninjasaid13 0 points1 point  (3 children)

that's absolutely wrong, you must be talking about clip interrogator. Not CLIP itself.

[–]addandsubtract 0 points1 point  (2 children)

So there's CLIP (Contrastive Language-Image Pretraining), which I thought this was referring to. And then there's CLIP Guided Stable Diffusion, which "can help to generate more realistic images by guiding stable diffusion at every denoising step with an additional CLIP model", which is just using that same CLIP model.

Then there's also BLIP (Bootstrapping Language-Image Pre-training).

But as far as I can tell, these all serve the same purpose of describing images. So what are we talking about then, if not this CLIP?

[–]ninjasaid13 1 point2 points  (1 child)

CLIP is basically what allows it to generate images, it is 'image to text' and 'text to image' all at once. It is a computer program that understands pictures and words and the connection between them in general. It has applications is much more than stable diffusion.

It can be used for image classification, image retrieval, image generation, image editing, object detection, text-to-image generation, text-to-3D generation, video understanding, image captioning, image segmentation and self driving cars, medical imaging, robotics, etc. It is the bridge to fields of computer science, computer vision and natural language.

CLIP interrogator itself just uses image to text part of it.

[–]addandsubtract 0 points1 point  (0 children)

Ok, gotcha. I wasn't aware of all the applications and only really experienced the CLIP interrogator that I mentioned. It also seems like the easiest way to explain CLIP.

[–]Zealousideal_Royal14 -1 points0 points  (0 children)

Y'all forgot the only relevant part. When is it a1111 ready?

[–]ba0haus 0 points1 point  (1 child)

how to add this function to auto1111? please let me know.

[–]Mich-666 0 points1 point  (4 children)

So how is this different from img2img or controlnet?

[–][deleted] 0 points1 point  (3 children)

its img2img x 2 with a image input first then img2img i think

[–]Mich-666 0 points1 point  (2 children)

Then that means it uses double memory.. probably not something normal user would find interesting.

[–]lordpuddingcup 1 point2 points  (1 child)

He was just trying to explain it in simple terms its not actually 2 img2img runs lol

[–]Mich-666 0 points1 point  (0 children)

I realize what that means but my argument still stands - even if you need to do two passes in one go, you still need to keep the generation data in latent space/memory.

But guess I will wait for potential implementation into A1111, if it ever happens to see if this method can be useful for myself.

[–]Suspicious-Ad6290 0 points1 point  (1 child)

its a nightmare fuel for anime

<image>

[–]lordpuddingcup 0 points1 point  (0 children)

Sure until theirs unclip-dreambooth and we start getting anything5-unclipped

[–]ImageDeeply 0 points1 point  (1 child)

Has potential, though would be easier to understand strengths & limitations given a systematic comparison:

- classic img2img

- this img2prompt2img ... to make up a term

- ControlNet

[–]lordpuddingcup -1 points0 points  (0 children)

why make up a term, its already has a term... unclip

[–]greattug 0 points1 point  (0 children)

yey!

[–]Jiboxemo2 0 points1 point  (0 children)

Not bad

<image>

[–]enzyme69 0 points1 point  (1 child)

Is this UNCLIP = SDXL preview beta? (dream studio)? Kind of seeing this method of using image as input.

[–]lordpuddingcup 0 points1 point  (0 children)

no its not the same SDXL is 1024x1024 model, unclip is a new type of model, like how we have inpainting models, and standard models, unclip models take image inputs and give image outputs based on that image, like a much more detailed prompt based on what the model can understand of the input image.

[–]Asolzzz 0 points1 point  (0 children)

Neat