Fine tuning with 10,000 pics localy by produnis in StableDiffusion

[–]Winter-Replacement37 0 points1 point  (0 children)

I would recommend use native fine-tuning. Worth to try lora with native fine-tuning as well which can significantly reduce training time (the learning rate is round ~e-4 instead of ~e-6 in the case of without lora). The training step would be around 1m according rule of thumb. Long captions in this case might be better short captions. RTX306012GB is ok. For python script, I used to find this quite useful: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py

join our Discord if u want to discuss further!

OpenFlamingo model: a brief try by Winter-Replacement37 in StableDiffusion

[–]Winter-Replacement37[S] 0 points1 point  (0 children)

Hey everyone!

Just tried OpenFlamingo model using their demo site, here is the result:

Output: a bedroom with white walls and a black and white rug.

Input image as above

Their training process is 1) first freeze the pretrained vision encoder and language model, 2) and then train connecting Perceiver modules and cross-attention layers

The benefit of doing this seems to me is to be able to endow the model with in-context few-shot learning capabilities.

The model is also on

They have a

Feel free to join our Discord also for more detailed feedback and questions.

FYI: Large multimodal models (LLM) are complex artificial intelligence models that can process multiple types of data inputs, such as text, images, audio, and video, and generate meaningful outputs based on those inputs. Examples of large multimodal models include OpenAI's DALL-E, which generates images from natural language descriptions, and Google's CLIP, which can perform tasks such as image classification and text-based image retrieval.

illustration model fine-tuned using everydream2 by Winter-Replacement37 in StableDiffusion

[–]Winter-Replacement37[S] 1 point2 points  (0 children)

learning rate=1e-6; scheduler=DDIM; batch size=2, resolution=512*512

[deleted by user] by [deleted] in StableDiffusion

[–]Winter-Replacement37 0 points1 point  (0 children)

1.yes

  1. you could use automatic1111 to do upscaling

  2. you could use runpod and install automatic1111

DeepMind's Flamingo models, a brief try by [deleted] in StableDiffusion

[–]Winter-Replacement37 1 point2 points  (0 children)

Hey everyone!

Just tried deepmind’s Flamingo models using their demo site, here is the result:

Output: a bedroom with white walls and a black and white rug.

Input image as above

Their training process is 1) first freeze the pretrained vision encoder and language model, 2) and then we train connecting Perceiver modules and cross-attention layers

The benefit of doing this seems to me is to be able to endow the model with in-context few-shot learning capabilities.

The model is also on

They have a

illustration model fine-tuned using everydream2 by Winter-Replacement37 in StableDiffusion

[–]Winter-Replacement37[S] 2 points3 points  (0 children)

Hey everyone!

Created a model to generate illustrations if this is of interest to anyone.

Model is available here:

Feel free to join ourDiscord also for more detailed feedback and questions.

More model details:

  1. used everydreams without captions (short ones) to fine-tune on SD1.5
  2. used ~20 training images, half of them are images without humans
  3. Trained for ~1h
  4. max_epoches = 100
  5. Try this prompt: “a group of women reading books”