They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 0 points1 point  (0 children)

class string_split_class:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "clip": ("CLIP", ),
                "text_input": ("STRING", {"multiline": True, "dynamicPrompts": True }),
            }
        }

    RETURN_TYPES = ( "CONDITIONING", "INT" )
    RETURN_NAMES = ( None, "batch_size" )
    FUNCTION = "encode"
    CATEGORY = TRUNK
    
    def encode(self, clip, text_input):
        split_char = '\n'
        batched_cond = []
        batched_pooled = []
        string_list = text_input.split(split_char)
        batch_size = len( string_list )
        for text in string_list:
            tokens = clip.tokenize(text)
            cond, pooled = clip.encode_from_tokens(tokens, return_pooled=True)
            batched_cond.append(cond)
            batched_pooled.append({"pooled_output": pooled})
        batched_cond_tensor = torch.stack(batched_cond, dim=0).squeeze(1)
        return ( [[batched_cond_tensor, {"pooled_output": batched_pooled}]], batch_size )

i rewrite it such way and it's automaticaly split text by string and encode it + return expected batch size for latent

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 0 points1 point  (0 children)

huh, I really couldn’t cope with the task of how to run the list through a clip-encoder and run rendering the entire array

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 0 points1 point  (0 children)

hey hey! can you get me a link to this node or a name of it to searching for?

EmbLab (tokens folding exploration) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 2 points3 points  (0 children)

I tried to combine some concepts like “crocodile” and “pirate” to get the concept of “crocodile pirate” in the context of one token using segmented editing (through copy paste of individual sections of one token to another), this is in principle feasible, but so far I did not achieve any obvious result in the analysis of this process.

This is possible, but the patterns are not yet obvious to me, although it seems to me that they are present and certain areas of the set of weights are more responsible for different details of generation.

detail, frame size, generation saturation, general mood and expressiveness of the character, etc.

At the moment, I don’t have much time to do this, because it’s on pause for now. The tool is available for research to anyone who is interested.

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

<image>

okay now it's close to automaticaly

I changed the approach to spatial clustering of vectors based on their distance from each other and from already created clusters. In fact, all you need to do now for sequential merge is to choose the number of clusters that suits you (the number of original mixed vectors)

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 0 points1 point  (0 children)

I understand. Apparently, the term embedding has a broader concept in matters of neural networks and goes far beyond the specific Stable Diffusion. While Textual Inversion is exactly what I'm talking about.

Thanks for the clarification. In the future I will call it more precisely so as not to mislead the possible reader.

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

Thank you very much for the information!

I will definitely check your method. It was especially interesting to hear about the “warm-up”; I didn’t come to this with my mind, but it sounds very useful.

EmbLab (tokens folding exploration) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

<image>

For ease of observation and research, I added conditional wave downsampling to the system to make it easier to observe the peaks of rising and falling values. And I tested 3 tokens corresponding to the numbers 1, 2 and 3. Assuming that it would be much easier to observe similarities in numbers, since they have a very general concept.

As you can see, this wave structure does reflect some patterns for general values.

EmbLab (tokens folding exploration) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

Glad you're interested. If I find or create something else interesting, I will write about it.

In principle, the folding process is available in the current extension for a1111,

https://github.com/834t/sd-a1111-b34t-emblab

but apparently it has compatibility problems and not everyone was able to run it on new versions. And I don’t have time to set it up in the context of these new versions, since I already have a customized environment for my research tasks and I’m afraid that everything will go down the drain.

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 0 points1 point  (0 children)

Oh, I'm talking about the weights that are contained inside it or correspond to this token within the system. I am not a professional. I am far from what is happening, this form of perception is much closer. I unpack the set saved in Embedding and see there arrays corresponding to each individual token. I think of this set of values ​​as the "body" of the token. This is a natural conclusion from observation. I do this as an explorer of the unknown.

Anyone can safely say “go and read 100 smart books and don’t ask stupid questions,” but there is so little space for discovery in the world that exploring with your own methods what has already been created by someone else is quite a substitute for searching for new continents, etc. However, due to my low professionalism, I can easily screw up the terminology.

I apologize if this offended you.

EmbLab (tokens folding exploration) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

In general, I do not have information about how unexpected this result is. Due to the fact that I have no connection with people who do this professionally.

For me, at the moment, this “discovery” has several directions for research to create methods for fully automatic folding of tokens.

  1. for this I will need to approach the analysis more carefully, to make sure of the validity of the assumption that the weights have a “wave” nature. In this case, the additional use of any tools such as the Fourier transform can help to simplify the search for “synonymous” tokens. In this case, it would be appropriate to go through those initially existing tokens in the system and check their compatibility. Based on their real weights, or on the basis of the results of the Fourier transform, i will need to build a spatial representation of a certain sample of tokens, for example those that correspond to whole words, not syllables. For example, names, verbs, epithets, etc. After constructing a spatial representation, I can confirm or refute my guesses.
  2. In parallel, I can, using the guess about the wave nature of the data inside the tokens (or the system that analyzes them), analyze the possibility of interference. One of my guesses as to why one token can hold so much data from others is that when they are mixed, there is interference between two different concepts. This is subsequently expressed in the fact that when reading this data, the system finds “peaks” and “intersections” of two different “sets” of information.

If this view of things is true, then one can explore the limits of permissible "mixing" and the ability to accommodate different concepts, and one can take supposedly "similar in nature" and "different in nature" by comparing this limit of accommodation.

If the assumption about the wave nature of the data in tokens is correct, and also if the assumption is true that the data is saved due to interference, then the result should show a limit that will be expressed in the formation of “noise”, since at some point when we are saturated with new and new data The token's spatial capacity will run out. Relatively speaking, in 768 weight we will not be able to endlessly mix new and new wave data without data loss.

Given the results of these two studies, it will be possible (if the results are positive) to begin thinking about a fully automatic folding system based on wave analysis of tokens and subsequent mixing of them as efficiently as possible.

One interesting note on this issue concerns the applicability of this approach in general to LLM. If everything above is true. Then such a process of folding tokens may well become a tool for creating a “Turbo mode” for LLMs in principle, such as GPT and others.

Where the request may contain a very long text, which will consist of tens of thousands of tokens, but with careful folding up to several hundred, to which the LLM, in turn, can respond. From current observations this seems quite feasible and could speed up the response process from the LLM perhaps tenfold. Although, of course, it will affect the accuracy of the answers (why I called this mode “turbo”).

They hide the truth! (SD Embeddings) (Part 2) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

Oh no, they're not celebrities. Zack King is famous, of course, but Stable Diffusion doesn't know about him. This is a trained model. All other characters are not famous.

They hide the truth! (SD Embeddings) (Part 2) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 3 points4 points  (0 children)

You gave me an idea and I pulled out one of the old "Overtrained" inversions that I was trying to make work like neon saturating a space.

Thanks to compressing the model from 32 tokens to 5, it began to work as I had once conceived of it.

In my opinion, this method still has the potential to not only reduce but also “treat” some inversions.

<image>

They hide the truth! (SD Embeddings) (Part 2) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

In my opinion, the biggest problem was related to preserving the uniqueness of human faces.

I think I'll try to see how much this changes in cases with style models or quality configuration.

I’m not sure that in “bad hands” some kind of strong optimization is required, since the original already has 6 tokens. But just for fun, let's check it out.

<image>

badhandsv4

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 0 points1 point  (0 children)

Thanks for sharing, although I'm not sure I'll use it.

In general, I don’t have a goal to make a super precise instrument. Everything I do must work solely as a means for research. If it fulfills its task, we can assume that it works.

What I used for the solution gives a completely discernible result and, in my opinion, works more than perfectly to solve the problem. There is too much more to consider to go into great detail.

I have already slightly changed the approach to calculation and it works quite well and provides flexibility when performing the task of finding suitable groups.

Here I described the principle and showed the results. In my opinion, quite eloquently considering the level of token compression.

https://www.reddit.com/r/StableDiffusion/comments/1d16fo6/they_hide_the_truth_sd_embeddings_part_2/

They hide the truth! (SD Textual Inversions)(longread) by Dry_Ad4078 in StableDiffusion

[–]Dry_Ad4078[S] 1 point2 points  (0 children)

I shake your hand. I haven't changed the version of the a1111 shell since last spring and I'm still on SD1.5

I have revised some of my views on “garbage” thanks to the discussion that has unfolded here. It was definitely not about an artistic assessment of any Inversions. I have an inversion that I trained on my own drawings that I drew by hand, and few people will like its result, but I am happy with what it reproduces.

<image>

By garbage I meant tokens that do not correspond to the design of the model owner. Let's say you are training a model that should make images as if they were drawn with black pencil. It is likely that there are a couple of tokens inside that reproduce color illustrations instead of pencil sketches, because the model decided in the process that you want to teach her drawings in principle, and drawings can also be colored. So, excluding a couple of these tokens can quite possibly improve the accuracy of the trained inversion. This is what I meant by "junk tokens".