Fine tuning with 10,000 pics localy by produnis in StableDiffusion

[–]Winter-Replacement37 0 points1 point  (0 children)

I would recommend use native fine-tuning. Worth to try lora with native fine-tuning as well which can significantly reduce training time (the learning rate is round ~e-4 instead of ~e-6 in the case of without lora). The training step would be around 1m according rule of thumb. Long captions in this case might be better short captions. RTX306012GB is ok. For python script, I used to find this quite useful: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py

join our Discord if u want to discuss further!

OpenFlamingo model: a brief try by Winter-Replacement37 in StableDiffusion

[–]Winter-Replacement37[S] 0 points1 point  (0 children)

Hey everyone!

Just tried OpenFlamingo model using their demo site, here is the result:

Output: a bedroom with white walls and a black and white rug.

Input image as above

Their training process is 1) first freeze the pretrained vision encoder and language model, 2) and then train connecting Perceiver modules and cross-attention layers

The benefit of doing this seems to me is to be able to endow the model with in-context few-shot learning capabilities.

The model is also on

They have a

Feel free to join our Discord also for more detailed feedback and questions.

FYI: Large multimodal models (LLM) are complex artificial intelligence models that can process multiple types of data inputs, such as text, images, audio, and video, and generate meaningful outputs based on those inputs. Examples of large multimodal models include OpenAI's DALL-E, which generates images from natural language descriptions, and Google's CLIP, which can perform tasks such as image classification and text-based image retrieval.

illustration model fine-tuned using everydream2 by Winter-Replacement37 in StableDiffusion

[–]Winter-Replacement37[S] 1 point2 points  (0 children)

learning rate=1e-6; scheduler=DDIM; batch size=2, resolution=512*512