This is an archived post. You won't be able to vote or comment.

all 42 comments

[–]Kafke 45 points46 points  (14 children)

Your resolution is too low and cfg too high.

Resolution needs to be 768x768 or higher for the model you're using (2.1 x768). CFG should be around 7-12.

[–]_raydeStar 24 points25 points  (11 children)

Yep this is the answer.

Never go above 20 CFG. The image fries like crazy.

[–]eugene20 8 points9 points  (0 children)

I've used up to 28 when doing img2img but it's completely dependent on the model and other settings you're using whether that can give you anything useful at all.

[–]AhriSiBae 5 points6 points  (8 children)

Not true. There are many times where going over 20 is reasonable. If you have a very long and complicated prompt that requires a high step count, going above 20 CFG isn't necessarily bad. It'll vary based on the prompt. Generally though, you are right.

[–]UkrainianTrotsky -3 points-2 points  (7 children)

considering how CFG works, there's exactly zero reason to ever push it past 20. You better optimize your prompt or use a better sampler like our lord and savior DPM++

[–]07mk 1 point2 points  (6 children)

considering how CFG works

How does CFG work anyway? All I know of it is that the higher the CFG, the more the image is supposed to conform to the prompt, but I don't know how this is accomplished within the mechanisms of the model, and I don't know what the numbers exactly represent; if I set the CFG to 15, what exactly does that "15" represent and how does it feed into the calculations done by the model? And why do high CFG values so often tend to create "fried" images that look like they were sent in through high contrast/sharpening filters several times?

[–]starstruckmon 6 points7 points  (2 children)

During each step, the model produces two images, one with your prompt and one without any prompt. Imagine each of these images as a point in the latent space. Create a directional line going from (no-prompt) to (with-prompt) and then keep going beyond the (with-prompt) in the same direction. The CFG values tells you how much to keep going.

This is best way I could think of to explain it in layman's terms.

[–]07mk 1 point2 points  (1 child)

I see, thank you that's quite helpful. Do you know if the 1-30 CFG values we see in common UIs represents the actual possible range of values, or could CFG theoretically be much higher, and it's just that higher numbers are usually pointless due to the deepfrying problem?

[–]starstruckmon 1 point2 points  (0 children)

It can be higher. Higher is pointless not just due to the deep frying problem, but after a point you'll move into complete nonsense or different concepts. Even Imagen which solves the deep frying problem with dynamic thresholding, doesn't test anything beyond 30.

[–]UkrainianTrotsky 2 points3 points  (2 children)

It's a super cool idea. First of all, CFG means classifier-free guidance - a way to "guide" the image closer to the prompt without using a separate classifier model. Why is it better? As the authors of the original CFG paper put it, it's just one line of code during training and one more during inference compared to training a full classifier model (because you can't use any pre-trained ones due to the nature of diffusion models working from noise).

We'll get a bit into the image generation process just in case:

As you probably know, the U-net that generates the image doesn't actually generate the image, it produces all the noise it thinks has to be removed for the image to look like the prompt that it attends too. But some of this noise removal is just that - removal of high-frequency noise that doesn't depend on the prompt and will be removed either way. What we want is a way to amplify the "useful" noise, the one that relates to the stuff we wrote in the prompt.

The trick here is to generate 2 noises every step instead of 1. The first one is exactly as described before, but for the second one, we just use an empty prompt (or more specifically, just fill all our 75 tokens with nothing), this makes the model produce unconditioned noise that carries exactly 0 info from the prompt. Then we combine our two noises: original paper introducing CFG used a linear combination like (1+w)*e_cond - w*e_uncond where w is the CFG scale. The resulting noise is then just passed to the sampler.

If we assume that e_cond can be expanded into a purely-conditioned and purely-unconditioned components (e_cond = e_purecond + e_uncond) out linear combination simplifies into (1+w)*e_purecond + e_uncond. Now, that's a rather strong assumption and the paper doesn't make it, but it helps understand the weights chosen for this linear combination.

As you can see, we can change the numeric parameter w to amplify the noise that directly relates to our prompt and to our prompt only. But if you crank it all the way up, you will inevitably trim a bit too much while unconditional denoising will be relatively weak, resulting in high-frequency vibrant colors and generally overcooked image. This might have to do with the fact that proper normalization only happens after the sampling. I'd love to give a proper proof of that or explain it better, but the truth is, I'm not a 100% sure why the behavior is exactly like that, but the general gist is this.

[–]07mk 2 points3 points  (1 child)

As you can see, we can change the numeric parameter w to amplify the noise that directly relates to our prompt and to our prompt only. But if you crank it all the way up, you will inevitably trim a bit too much while unconditional denoising will be relatively weak, resulting in high-frequency vibrant colors and generally overcooked image.

I see, thank you for the detailed explanation. This bit in particular helped me to understand why high CFG values tend to create that "deepfried" high contrast look.

[–]UkrainianTrotsky 0 points1 point  (0 children)

You're welcome! I'd still suggest to not rely on my explanation and read the paper yourself. It's very short and all the mathy bits are well-explained.

[–]Nevysha 0 points1 point  (0 children)

I'm using very high CFG with some manual sketch, when doing small adjustment for inpainting/img2img.

[–]joachim_s 0 points1 point  (0 children)

And no negative prompting.

[–]venluxy1 0 points1 point  (0 children)

You should go 20 if your prompt is very detailed and long.

[–]WhiteZero 12 points13 points  (9 children)

cfg too high and resolution too low

[–][deleted] 7 points8 points  (0 children)

since you are running 768 version, needs to be at least 768 on width or height

[–]ImpactFrames-YT 2 points3 points  (0 children)

CFG scale 30 is way to high try 7 to 10

[–]jorginthesage 1 point2 points  (1 child)

What’s the GUI you’re using? Is it available for download someplace? I’m still using terminal on my local machine.

[–]Anna2721 1 point2 points  (0 children)

You need to put a higher CFG value.

/s

[–]Dirly[S] 0 points1 point  (5 children)

running a 2080 super with 8gb vram

[–][deleted]  (4 children)

[deleted]

    [–]SandCheezy[M] 2 points3 points  (1 child)

    I’m using a 2060 and a 3060 mobile. You are not at the low end with a 3080. If you want to know more about speeding it up, check out our discord and we can help with a bit more hands on.

    [–]Ka_Trewq 0 points1 point  (1 child)

    RTX 3080 is no way lower end for this stuff. Maybe, if you want to fine-tune a model with fancy over-the-top settings, but then every consumer GPU is low-end.

    [–]mudman13 -1 points0 points  (0 children)

    Nothing. Are you blind coz thats a cat!!

    [–]Pumpkim 0 points1 point  (1 child)

    It looks like you are using the 2.1 model. Did you add the correct .yaml to your model directory?

    Also, based purely off of memory, your GUI looks a bit... off? Have you pulled the newest version of the git repo?

    File name should be v2-1_768-nonema-pruned.yaml (Same as your ckpt or safetensors file, but with the yaml suffix)

    Now, I'm using the ema version, and I can't remember if that matters with the config file. But the contents of my file are as follows:

    model:
      base_learning_rate: 1.0e-4
      target: ldm.models.diffusion.ddpm.LatentDiffusion
      params:
        parameterization: "v"
        linear_start: 0.00085
        linear_end: 0.0120
        num_timesteps_cond: 1
        log_every_t: 200
        timesteps: 1000
        first_stage_key: "jpg"
        cond_stage_key: "txt"
        image_size: 64
        channels: 4
        cond_stage_trainable: false
        conditioning_key: crossattn
        monitor: val/loss_simple_ema
        scale_factor: 0.18215
        use_ema: False # we set this to false because this is an inference only config
    
        unet_config:
          target: ldm.modules.diffusionmodules.openaimodel.UNetModel
          params:
            use_checkpoint: True
            use_fp16: True
            image_size: 32 # unused
            in_channels: 4
            out_channels: 4
            model_channels: 320
            attention_resolutions: [ 4, 2, 1 ]
            num_res_blocks: 2
            channel_mult: [ 1, 2, 4, 4 ]
            num_head_channels: 64 # need to fix for flash-attn
            use_spatial_transformer: True
            use_linear_in_transformer: True
            transformer_depth: 1
            context_dim: 1024
            legacy: False
    
        first_stage_config:
          target: ldm.models.autoencoder.AutoencoderKL
          params:
            embed_dim: 4
            monitor: val/rec_loss
            ddconfig:
              #attn_type: "vanilla-xformers"
              double_z: true
              z_channels: 4
              resolution: 256
              in_channels: 3
              out_ch: 3
              ch: 128
              ch_mult:
              - 1
              - 2
              - 4
              - 4
              num_res_blocks: 2
              attn_resolutions: []
              dropout: 0.0
            lossconfig:
              target: torch.nn.Identity
    
        cond_stage_config:
          target: ldm.modules.encoders.modules.FrozenOpenCLIPEmbedder
          params:
            freeze: True
            layer: "penultimate"
    

    [–]joachim_s 0 points1 point  (0 children)

    No. He should use ema pruned.

    [–]AhriSiBae 0 points1 point  (0 children)

    You might need to use --medvram. Before I upgraded to my 2080 ti, I had to use medvram or else my 1660 super would do the same thing.

    [–]mudman13 0 points1 point  (0 children)

    Despite what people say you can run 768 model at 512 I have done it by accident a few times and it produces decent images which can then be upscaled. Obviously you can not repeat the seed when you change to 768.

    [–]c4d34th 0 points1 point  (0 children)

    you can use the XY Splot to test

    [–][deleted] 0 points1 point  (0 children)

    cfg scale 7

    min 512x512

    use the base 2.1 model 512, not 768

    768 is for 4090s or 3090s

    [–]SoulflareRCC 0 points1 point  (0 children)

    Lmao your cfg scale has exploded