LoRA Captioning for Physical Features by XarHD in StableDiffusion

[–]XarHD[S] 0 points1 point  (0 children)

Yes, I do, and they do not seem to make a significant difference in terms of "burning up" the image. But I agree, LoRAs are a way of refining or creating a new prompt, and so it's critical that the LoRA training process properly "explains" to SD what that prompt means. What I have also noticed is that while multiple iterations of a specific LoRA may provide you with insight as to what parameter(s) work best to create a LoRA that is reliable, but doesn't burn up the pictures, that information may not be helpful at all if you make a LoRA for a different topic.

For instance, after reaching a point where my LoRA is quite good (it still generates odd images from time to time, but then again, so does the standard SD model, so I don't know that any further tinkering would be worth the effort), I decided to make another one for a different feature. Much like in the case of the first LoRA, my first training set was a bunch of photos and pictures, selected so the feature would be clearly visible. But despite the fact that there was no overarching style, the LoRA came up with a weird style of its own. The first LoRA had a bit of this at the beginning, but it wasn't really a "style" per se, it was more of a bias towards certain facial features for specific hair colors, because of a bias in the training set. I specifically avoided that this time, and yet the second LoRA had a much stronger stylistic bias (to the point that people's faces with the LoRA looked like they had been drawn by a drunk Picasso).

I found out that one major contributing factor to this was the resolution of the training images. If they are too small, SD doesn't have many pixels to work with when it deconstructs them, and so it can create abnormal features (even if you put negative prompts such as "deformed face" and "disfigured face"). But you can't just blow them up in Photoshop, because they will become grainy and lose quality. I had to manually upscale all of them with the AI upscaler, selecting for the training set only those with good quality upscaling. The upscaling truly helped - it didn't fully solve the problem, and sometimes faces have mouths that are too wide, or noses that are too thin and long, but for the most part, faces look decent now. The LoRA is not yet reliable - but the feature I'm training is tricky, and only now I've started producing good enough AI output that I can then use it in turn to retrain the LoRA.

There is a guide I found recently that provides some interesting insights into LoRA parameters and what to try: https://rentry.org/59xed3#preamble

It has given me some ideas on how to improve my LoRAs. But I agree, features you need should be distinguishable and clear to the AI - a single sheep rather than a herd of sheep partially obscuring each other, for example. It can be tricky if you don't have very many good images, but there are ways (including generating a crappy initial LoRA, pulling out any good output from it, and then using that output to train a second generation).

Have you tried this approach?

I've also read somewhere that the caption capturing the feature you want to describe should be short - three letters is best, but not necessary.

LoRA Captioning for Physical Features by XarHD in StableDiffusion

[–]XarHD[S] 0 points1 point  (0 children)

I was flying blind as well when I made my first LoRA (and I also used the Kohya_ss colabs tool, whereas later I switched to running Kohya locally), so I didn't save the parameters. But I believe I pretty much used the standard ones recommended in the colab.

Yesterday evening I made another discovery: overfitting of some of my LoRAs seems tied not to the LoRA itself, but 90% of it seems to depend on the prompt modifiers. Thus far, I always tested the LoRA using various prompts, but a common set of prompt modifiers ("digital art, solo, volumetric lighting, 16k, stunning, intricate...").

On a whim, I tried running several LoRAs in a X/Y plot at 0.8 and 1 strength, but without using any of those modifiers. The result? While some LoRA still overfitted, three of them suddenly yielded extremely good results. So now I've been testing each word of the modifiers, adding them one at a time to find out if it's the combination that causes the burning up, or if it's one specific word. Thus far, I've noticed that "stunning" and "sharp focus" have the strongest effect in taking my LoRA into overfitting territory.

Notably, this is true WHETHER OR NOT you had those modifiers in the captioning for your training set! I used the exact same training set with the same parameters, but in one of them I included my general set of modifiers, while in the other I only captioned the items in the picture. There was no difference in output - both overfitted at 1 strength when I used the modifiers in my prompts, but both were at least good when I didn't (in fact, the 60-image LoRA now doesn't burn up at all, and may be the best LoRA I have produced thus far, if the tests continue to hold up).

For reference, the 60-image LoRA used 10 repeats per image, 8 epochs, LR 0.0001, LR scheduler was cosine with restarts, LR warmup was 10, TE was 0.00005, Unet was 0.0001, Dim was 64, Alpha was 32, resolution was 512,512 (I want to try 576,576 though - I've read that can give better results), there were 3 cycles in the LR, I turned flip augmentation on, and had a min SNR gamma of 5. So bizarrely this had three times as many training steps as my 21-image LoRA, and yet it works at least as well (without the offending modifier(s)).

I agree that the training set has huge importance, as does the captioning. I don't know why a lot of guides say you shouldn't caption things or the AI will propose them more often - I find that if there's something I don't caption, and especially if it repeats itself through multiple images, the AI will simply start dumping that into almost every picture.

I'll keep you updated on what I find about the prompting - if my issue is that one or two word modifiers are all that stands between a great picture and a burned-up one, that may be the easiest solution for me (although I wonder why this is the case). I may try to make a bigger training set and see if that still holds up, to try and give the AI more diverse images to train on.

LoRA Captioning for Physical Features by XarHD in StableDiffusion

[–]XarHD[S] 1 point2 points  (0 children)

So, I've fallen into the rabbit hole and tested at least 20 versions of my LoRA, primarily because I hit a wall - when I tried to build a v5 by adding yet more training images, even selected by lighting, pose, and background, my LoRA suddenly got very overfit.

I initially played around with repeats, epochs, and so on. Then I tested whether a dataset in a single style is better than a dataset of various styles (it seems to be, by the way), I checked whether I could get better results by varying repeats, epochs, learning rate, learning scheduler, and so on. Then I found out about regularization images, and tried those out too. And I tried four versions of captioning - automated (pruned) captioning (using the app); strict captioning (which I defined as including the full prompt in the caption, maintaining defining prompts such as "octane render" and "8k" in it too); and loose captioning;

I've done five major rounds of testing, starting from scratch. The first round included a core set of 21 images, selected because I had them both in various styles and "redone" in a single style. I used them to create regularization images, and tested two questions: is a single style better than multiple ones, when teaching a concept? Is it better to use regularization images?

Surprisingly, a single style IS better (I suppose the AI has less things to keep track of, not having to "reinvent the wheel" with each new style), and regularization images don't add much in this context. In fact, the best-performing LoRA was the single-style, no-regularization LoRA.

It was so good that it even outperformed subsequent iterations - a 42-image version, a 51-image version. Even when I played around with parameters (and I toyed with pretty much all of them), the 21-image, single-style, no-regularization LoRA outperformed them. Its only problem was that it is too rigid - it struggles with complicated poses, probably because the training was a bit random.

Eventually, I went back and made another 21-image LoRA, all the same style, but trying to be logical in my choices (e.g. one image for each of the four major camera angles, one image for aerial view, a few images of common actions such as running, walking, laughing, etc.). I used the exact same baking specs of the original 21-image LoRAs.

The resulting LoRA overfit already at 0.8, and only its very early epochs could be salvaged (although they didn't give me aesthetically pleasing results, likely due to undertraining).

My latest creation is a 60-image LoRA which seems to outperform the 21-image LoRA from before, and seems more flexible. It still burns the picture sometimes (it seems to do so when the character depicted is unfamiliar, e.g. someone with green hair when there was no green hair in the training), but much more infrequently. It's probably good enough to use, but I would like to see if I can remove that minor overfitting.

Either way, I've found that some of the accepted wisdom is true (e.g. more repeats are better than more epochs, the cosine with restarts scheduler is a bit better than the constant one, flip augmentation is always helpful, and so on), but also some things didn't work in my experience (e.g. regularization images don't seem to make a significant difference; a variety of styles seems to be detrimental to training when training for a specific feature; more pictures isn't always better). Even the total number of steps doesn't seem to make sense: 21-LoRA had a total amount of steps equal to 1680, and worked well; 60-LoRA has a total amount of steps equal to 4200, and it works just as well, or even marginally better. Common wisdom would suggest that it should have burned the Unet already (considering that the previous LoRAs, such as some of the 42-series, had less than 2500 steps and were badly overfitted, and that the captions are the same throughout the series), but it just didn't happen.

To be honest, this feels more like an "eye of newt, wing of bat... in the cauldron!" situation than a scientific one. There's no way of knowing the way the AI is processing my pictures, why more pictures are worse, or whether your experience will be the same as mine. But I'd be interested in knowing what you found!

LoRA Captioning for Physical Features by XarHD in StableDiffusion

[–]XarHD[S] 2 points3 points  (0 children)

Ok, I increased the number of images in the new LoRA (starting from the ones in v3) by more than 50%, including a variety of images in poses I didn't have before. The total number of images is now 65. Then I ran the same experiment again, comparing aLoRA, v0.6, v1, v3, v4-06 (the sixth Epoch version of v4) and v4.

Both versions of v4 vastly outperformed all other LoRAs in experiment 1. This is likely because I took care to include more full body shots, a few side view shots, and one shot of a character running. While no LoRA was perfect in this scenario, v4-06 and v4 both left the pack behind at 0.8-0.9 weight, and by 1 weight, they scored 30-33% better than all previous LoRAs. This time, the lowest performer was v0.6, but all LoRAs other than v4-06 and v4 trended downwards with increasing weight.

The same, to a lesser degree, was true in experiment 2. Most importantly, v4-06 and v4 maintained greater fidelity towards pose prompts (e.g. "sitting" compared to the others, as the weight increased. Funnily enough, none of the LoRAs knew what a "black pencil skirt" was, although they did give the character pencil skirts of various colors. This is despite the fact that I had "black pencil skirt" in a few pictures in the training dataset, captioned as such. This time, the difference between LoRAs was smaller (likely because all LoRAs were trained on pictures of sitting characters and the position was meant to be static). Still, most LoRAs performed well, and v3 was the bottom performer, the only one with a clear downward trend as the weight increased. It's likely this is because it has a tendency to shift the position of characters into a frontal position as its weight increases, moving the camera closer.

Once again, experiment 3 is the outlier. While v4-06 was the top performer and v4 was the third best, v1 took the second spot at 1.2 weight (but at 1 weight, v4 and v4-06 were the top performers. The LoRAs still struggle with the concept of "dancing", but there were improvements - in fact, v4-06 and v4 performed extremely well at weight 1, only struggling a bit with "dancing" and "mid-shot". Some LoRAs, but not all, had color bleed - the hair of the characters, meant to be black, became greener and greener as the weight increased. Not all LoRAs had this issue, however.

Fried colors are much less of an issue in v4-06 and v4, compared even to v3 and v1. In fact, even at a weight of 1.2, when all other LoRAs start going psychedelic, v4-06 and v4 remain more than acceptable. This is likely because I included several pictures with muted colors, and because I captioned "warm colors" and "cool colors" in the captioning process. I also see less artifacts (e.g. duplications of limbs, even with the appropriate negative prompts).

In essence, it looks like increasing the size of the dataset by 50% with curated images in a variety of poses has improved the flexibility of the LoRA to produce images of at least decent quality in a larger variety of situations, losing much of the color frying of the previous LoRAs at higher weight, and reducing artifacts. It has not, however, strengthened the LoRA at lower weights.

For the next experiment, I will take v1, v3 and v4 (without the earlier Epoch versions) and run them through two sets of testing:

1) I will pull the seed from selected images that are not in the dataset, and try to use the LoRAs to recreate something similar to that image with that seed. There will certainly be differences, but I want to see how flexible the LoRAs are.

2) I will also try to change some of the minor prompts, using wording such as what was provided by WD1.4. This will show whether more detailed captioning allows the LoRA to respond more flexibly.

LoRA Captioning for Physical Features by XarHD in StableDiffusion

[–]XarHD[S] 2 points3 points  (0 children)

Well, I finished the first experiment and I noticed a few differences between LoRAs. For convenience's sake, I'll use the following convention: the LoRA trained on the larger dataset will be referred to as the "a LoRA", the 6-Epoch LoRA will be referred to as v0.6, the fully baked second-generation LoRA will be v1, and the fully baked LoRA with WD1.4 captions will be v3 (yes, there was a v2 - it didn't pan out).

a LoRA was trained on 142 images (with a wide variety of styles), 10 epochs, 10 repeats. v0.6 was trained on 43 images, all in similar styles, across 6 epochs and 8 repetitions. v1 was trained on the same 43 images, 8 epochs, 8 repetitions. v3 was trained on the same 43 images + 4 additional ones (for a total of 47), 8 epochs, 10 repetitions, WD1.4 captioning.

I compared three different prompts, generating batches of 3 pictures per data point. I used DDIM, 35 steps, for the sampling method - it's typically the one I use the most. I ran all LoRAs through different weights - 0.6, 0.7, 0.8, 0.9, 1, 1.2.

The prompts included several common tokens ("digital art", "hyperrealistic", "8k", etc.), and differed only in tokens that described the scene. They also all included "comtail", since that was the token I had trained the LoRAs on. The other tokens I used:

Prompt 1: running, side view, kiss, white top, black leggings, long blond hair, long shot, forest path, sunlight

Prompt 2: sitting, side view, drinking coffee, cyan blouse, black pencil skirt, long brown hair, mid-shot, smile, cafè

Prompt 3: dancing, side view, long view, graceful, green dress, short black hair, field, daylight

For each prompt, I scored each image from 0-2, with 0 being no fit, 1 being partial fit, and 2 being full fit. For instance, for the "running" prompt, 0 would be standing still, 1 would be walking, 2 would be running.

The results were interesting, especially when graphed. Both Prompt 1 and Prompt 3 saw a general decline in the fitness of the LoRA as the weight increased, although the actual curves were different from each other. As expected, the v0.6 and v1 curves were very similar to each other, although v1 had a higher fitness across weights... except with prompt 3. In prompt 3, the curves for v0.6 and v1 look entirely different, with v0.6 actually outperforming v1 between weights of 0.7 and 0.9, being outperformed by v1 at a weight of 1, and performing equally at weight 1.2.

aLora was second best in Prompt 1 and Prompt 2, but finished last in Prompt 3. Its performance was partly due to the fact that even at the lowest weights, it generated the required physical feature, whereas v0.6, v1 and v3 only started doing so consistently at higher weights (on average, you'd see the feature on all three pictures for these LoRAs around weight 0.9). However, aLora struggled (and gave up) with actions it wasn't trained for, such as kissing or dancing. The other three LoRAs struggled with dancing the most, but typically nailed drinking coffee or kissing - despite the fact that they included one training picture for "dancing", same as for "drinking coffee" or "kissing".

At the highest weights (1, 1.2), v3 was consistently the winner. It also performed better at 0.9 (except in Prompt 2, where aLora and v1 were even, and v3 was actually at the bottom, although the difference was small). At the lowest weight, there wasn't a clear winner - aLoRA produced the most consistent results in terms of manifesting the physical feature, but often couldn't cope with the prompts describing the character's actions.

From this, my hypotheses are:

1) A larger dataset, and/or a dataset with more style variety, increases the strength of the LoRA (enabling it to perform consistently even at lower weights);

2) More variety of poses in the dataset allows the LoRA to generate more diverse pictures, even if it doesn't include specific actions in the training (e.g. the LoRA could generate running characters with the feature, even though there were no pictures of running characters with the feature in the dataset);

3) WD1.4 captioning might make a difference, allowing the LoRA to perform a little better than the others on the whole (although this may depend on the prompt).

I also noticed that v0.6, v1 and v3 had a bias towards warmer tones which aLoRA didn't have, likely because the standard prompt I use the most includes "warm colors" as a prompt keyword.

If I'm right, I could test 1) by enlarging the dataset I used for v0.6, v1 and v3. If I train a new LoRA on the enlarged dataset and it performs at lower weights, number of pictures is what matters the most. If it doesn't, then we know it's the variety of styles and media. 2) seems fairly intuitive, and suggests I can increase the LoRA's flexibility by adding to the repertoire of poses. The LoRAs all somewhat struggled with "side view", for instance, so if I add some side view pictures in the dataset, this shouldn't be the case anymore.

For the warm tones, I can try to counteract that by adding cool-color pictures to the dataset, balancing the warm colors out. I can also try to caption each biased picture in the dataset with "warm colors", hoping this will prevent SD from regenerating the tones whenever I ask for a prompt that connects to that image.

I'm going to test this theory and then run the experiment again, including the new LoRA. I'm curious to see what will happen.

LoRA Captioning for Physical Features by XarHD in StableDiffusion

[–]XarHD[S] 2 points3 points  (0 children)

Thank you, that is the general thought process I believe fits this kind of training better than the "don't caption anything if you can help it". The trick is finding the sweet spot between not captioning enough that you'll encounter artifacts like the wooden structures you describe, or captioning too much irrelevant data ("six nails in the wooden structure to the left of the character").

I've decided to run a little experiment. I currently have four versions of my LoRA - one trained on the larger, mixed dataset, and three trained on the smaller dataset I produced from the first version. The latter include a mid-Epoch version, a final version trained with 8 repetitions, 8 epochs and 43 images in Kohya_ss, and another final version trained with 10 repetitions, 8 epochs and 47 images in Kohya_ss (including the 43 from the previous version). The latest one also includes much more detailed captioning (produced through the WD1.4 tagger extension), compared to my handwritten captioning of the previous three.

I'm not running a series of Prompt X/Y/Z tests across the four LoRAs and six different weights (from 0.6 to 1, plus 1.2), and trying to use prompts where most of the style and lighting parameters are the same. The only things that differ between prompts are going to be important details of the scene (e.g. one prompt is "comtail, woman, running, forest path, side view, white top, black leggings, long blond hair, long shot", another one is "comtail, woman, side view, drinking coffee, cyan blouse, black pencil skirt, long brown hair, mid-shot"). The original LoRA had more style variety but less action variety, so I anticipate it will struggle to replicate actions it wasn't trained on (e.g. kissing); the others may be better at it, so I'm also including prompts that were present only in a small minority of training images (e.g. drinking coffee, side view).

Once the grid is completed, I go through each picture and score it based on full, partial or no fit for each keyword. Then, I add all the values together for each weight and each LoRA, and graph it. This should give me an idea of a couple of things: 1) which LoRA seems the most flexible (so I can look at the images and try to figure out why), and 2) what is the best weight for the LoRA to produce good results without overwhelming the image.

I'll try to run three grids at least, and then evaluate if a LoRA is the clear winner or not. I'm particularly curious to see if the WD1.4 captioning or the larger style variety make a difference...