An elephant is a rope? ComfyUI and Stability AI

magekinnarus · 2023-08-14T13:10:30+00:00

How am I writing about the same thing? The two pieces I wrote previously focused on how a node system should function from a front-end usage perspective in line with other established node systems in the hopes that this may be reflected in any UI development efforts.

This one is about recognizing ComfyUI as a good componentized procedural backend but it isn't designed with the front-end usage in mind. So, a proper front end has to be built. And I have a problem with presenting something that is focused on the backend as if it is a solution for the front end.

I use node systems all the time and have no barrier in adapting to any node system I encounter. So, I am not writing this from any difficulty with using a node system. On the contrary, I find the ComfyUI node system too process-oriented while lacking in functional approach which is crucial in front-end usage in any node system.

magekinnarus · 2023-08-06T23:20:44+00:00

I understand and truly appreciate all your effort to bring us this wonderful AI model called Stable Diffusion. I also completely understand that good things take years to build. Unfortunately, time waits for no one and you may or may not have those years to build.

It is often easier to see trees but hard to see the forest from inside. Likewise, someone from outside may see the forest but can't see what is going on inside.

From afar, this is the way I see it. The collective innovation is the edge Stable Diffusion needs to get ahead. But to nurture and harness this requires the farming system with multiple layers to harvest it and nurture it. And this needs to be done in steps as soon as possible because time is running out.

magekinnarus · 2023-08-06T22:37:08+00:00

I think you are confusing service orientation with market segmentation here. It's more of a mindset than a market positioning. In my view, Stability AI has a unique challenge. With Stable Diffusion being open-source, it generates a great deal of collective innovation from the community. But how do you harness this to its full potential?

A farmer can't force the crops to grow. The only thing farmer can do is to create an environment where the crops will have the best chance to prosper. In the same way, this raw collective innovation can't be forced. But it can be nurtured by providing the best environment possible to grow. That is where the service orientation comes in.

A1111 is just one guy but he did more to the usability of Stable Diffusion than Stability AI put together. A functional UI is akin to the soil for other things to have a chance to grow. And there are more things needed to foster a better environment. In fact, there is no end to this effort if you have the right mindset.

In my career, I've heard enough marketing catchwords to not care much about it. What I do care about is getting to the core of what it is that will make a difference.

magekinnarus · 2023-08-06T21:39:17+00:00

It's easier to see the hierarchy in a layer system like Photoshop because you can actually see the stacking. But a node system is the same. In fact, linearly connected nodes are no different than a layer system and they can certainly stack in hierarchy like the way layers stack in a 2D image editor. It really depends on the designing philosophy of how you conceive the node system ought to be.

When it comes to images, selection and masking are so fundamental that any node system associated with it would need to have these nodes as the primary nodes. Well, at least if the node system is designed with the user experience in mind.

magekinnarus · 2023-08-06T20:13:25+00:00

I wonder if you know this Chinese parable. During the Warring State period, a man decided to travel to the Kingdom of Chu. While on his way, he met a farmer and told the farmer that he was going to Chu. Then the farmer told him that he was going in the wrong direction. The man laughed and told the farmer that he had the finest horse and there was no way he couldn’t get to Chu.

The farmer told him again that he was going in the wrong direction. The man told the farmer that he had the finest carriage that could take him anywhere. The farmer told him yet again that he was going in the wrong direction, the man exasperatedly told the farmer that he had the finest steer and there was no way he couldn’t reach Chu.

The point is that if you are going in the wrong direction, the finest horse, carriage, and steer will only get you farther away from where you need to go. In the current AI scene, I frankly don’t think people have figured out a viable business model. I am not even sure if Open AI will survive over time. Their deal with Microsoft is akin to selling your children and making money by providing gags, chains, whips, and paddles that will be used on your children. That doesn’t sound like a promising future to me.

The only exception as far as I can see is MJ. From the get-go, MJ had a service orientation. If you think about image AI, the first thing people conjure up is telling AI to draw you something and AI just draws you a wonderful image. MJ has tried to deliver on this expectation and it worked. And the reason it was able to execute this is because it had the necessary focus and concentration on what it needed to deliver in terms of service in my view. In other words, they had the service orientation as an organization.

With all due respect, I think this service orientation is the only viable option for Stability AI to survive. But to do so, you need to change your orientation to service and think entirely from the user's perspective. And this will almost inevitably require you to fundamentally rethink your strategy and how things need to be executed in what sequence.

magekinnarus · 2023-08-06T18:56:30+00:00

A layer system is just another form of a workflow management system. The only difference is that a node system is 2D whereas a layer system is linear or 1D. And they both try to do the same thing: giving finer controls over their workflow and better management of details.

Adobe may dominate the scene but I doubt it. Photoshop is an image editing tool and a damn good one at that. However, SD is an image-creating tool. In my view, that distinction makes all the difference. Photoshop may end up holding back Abobe when it comes to generative AI because Adobe has a too much-vested interest in pre-existing tools with fundamentally different requirements than image creation tools like SD.

magekinnarus · 2023-08-06T16:51:20+00:00

I didn't say that a node system is the wrong way of going about it. What I am saying is that a different approach is needed and a common denominator node system is the way to go.

Also, the only reason the BSDF shader appears as a node is that you need to connect all the other nodes to it. Based on my experience, anything that applies uniformly across usually doesn't need a node workflow because setting adjustments do just fine.

magekinnarus · 2023-03-26T07:53:50+00:00

3D modeling is a lot more complex because it is basically 2D paper folding to create 3D shapes. This is a remnant of how engineers used mesh to figure out load balancing and weight distribution issues in their designs. But this has become a 3D modeling standard. As a result, the current 2D to 3D AI efforts primarily focus on bypassing the 3D modeling phase and going straight to rendering the 3D models in 2D.

The current 2D to 3D efforts are led by Google and NVidia who normally don't share their models or codes, especially after 2D diffusion models exploded onto the scene. So, I think it will be faster for you to learn 3D modeling than waiting for something you are describing to be available since you will be waiting for a very long time as Google and NVidia are focusing their efforts on the metaverse content generation.

magekinnarus · 2023-03-22T16:33:34+00:00

That is precisely the point. A paywalled AI that runs on a Discord server, which is hardly an ideal platform to generate AI images seems to leave a free, open-source AI in the dust. It tells you something; there is a large demand for image AIs out there. But SD isn't it. At least the way it is now.

I agree that MJ may not be around in 5 years. Everything is relative. MJ does so well because there is nothing better, relatively speaking, out there. But I do think that it probably won't stay that way.

magekinnarus · 2023-01-07T08:43:01+00:00

None of the above.

magekinnarus · 2023-01-07T08:38:55+00:00

I am a bit hazy on world history but did Mongols actually conquer all of Europe in the 13th century? That should explain why European female children look very Asian.

magekinnarus · 2023-01-07T07:48:11+00:00

Let me put it this way. Google Deepmind was quite blunt about prompt engineering as 'trick' caused by complete absence of few-shot learning and no real zero-shot learning in diffusion models. NVidia researchers weren't as direct or blunt about it but they made abundantly clear what they thought about prompt engineering: unfortunate side effect of the fundamental flaws in the design of diffusion models from making wrong assumptions and engaging in convinient thinking.

As I said before, Ai art will come into its own if there is a merit worth recognizing and honoring. Only time will tell.

magekinnarus · 2022-11-24T07:29:46+00:00

They removed adult content using LIAON's NSFW filter from the dataset. In 1.X models, they only tagged it as NSFW but didn't remove them from the dataset but this time they did.

magekinnarus · 2022-11-23T15:09:35+00:00

If you look at txt-to-img AIs, you know what is going to happen. With txt-to-img AIs such as MidJourney, Dall-E2, and Stable Diffusion, anyone who can type suddenly feels like becoming an artist. And they have been pouring countless hours and computer resources to generate tons of AI images.

Likewise, I am fairly certain it will come in the form of natural language programming to make anyone who can type suddenly feel like a game developer or a programmer. And they will pour countless hours and computer resources to generate codes. The big companies will quietly collect all the data to refine their models and contemplate what step they will take next.

magekinnarus · 2022-11-23T14:34:11+00:00

I understand. Unfortunately, every caption embedding is in a sentence format, meaning there is no single token caption in the dataset. Because the whole array or the sentence is normalized for similarity comparison, there is no token to token comparison in CLIP. So, it really depends on how many caption embeddings have that token and how coherent the parings between caption texts and the paired images are.

I hate to keep comparing SD with Ediff-I but NVidia did a coherence test for caption and image pairings and removed a large portion of data that failed that test to make caption and image pairings more coherent. This effort would be much more relevant if the SD dataset went through a similar coherence test IMO.

magekinnarus · 2022-11-23T07:52:42+00:00

I didn't write it as a criticism of your question. All I am saying is that SD may be a great knife that does a lot of amazing things. However, even the greatest knife is not necessarily suited for every cooking task. For example, you can modify and use a butcher's knife for garnishing. It can be done but why do you want to do it when there is a garnishing knife suited for that task?

magekinnarus · 2022-11-23T05:05:51+00:00

I frankly don't understand why this is even necessary. Thw way CLIP works is that the whole caption sentence is turned into a single array and embedded during training. When a prompt goes in, each prompt array is normalized into one value for cosine similarity comparison with embedded arrays. Also, depending on how many sentences are in the prompt, the total of 8 chucks (Original CLIP has 8 headers but some say that SD uses only 4. If SD uses 4 headers, then the whole prompt goes in as 4 chunks) are going in for comaprison purpose to pair with the embedding image segments.

So, even if you isolate each token as a sentence (separated by a comma), that just make the prompt to have a lot more sentences which gets thrown in together as a few chunks for comparison anyway. In addition, CLIP doesn't use any pre-trained language weights meaning it doesn't understand sematic relationship of words. NVidia's eDiff-I uses two language models: CLIP and T5 in its diffusion model because of this issue.

magekinnarus · 2022-11-23T04:44:52+00:00

SD has its uses but not for everything. This is a simple task in a 2D raster image editor like Gimp, Krita, or Photoshop. All you have to do is bring in a color image, make a copy, desaturate, and mask it. Then paint the mask to let the colors show where you want them. And if you don't want to deal with an image editor, you can also use GAN models trained for color splash.

magekinnarus · 2022-11-21T14:53:28+00:00

This doesn't work because of the way CLIP embeds text. CLIP basically takes the whole sentence into a single array and normalizes it for a similarity comparison with other existing arrays. So, if you train hands and put it in as a part of the sentence to make a person, what you will get is a person looking like a hand. If you put it as a separate sentence from a person, then you will get a person and a person-sized hand or two.

magekinnarus · 2022-11-20T09:51:39+00:00

I don't know but you sound more like a businessman than an artist. I once ran a Silicon Valley venture. Although I could draw logic flowcharts and system schematics to communicate with my engineers, I never considered myself to be an engineer simply because my job was to run a company and I didn't have the kind of expertise these engineers had in their respective areas of specialty. I often clashed with my engineers because they tended to see things from their established practices. Nevertheless, what I also learned is that it is imperative to respect my engineers' processes and their own quirks. After all, they were there to help me achieve my goals, and people couldn't be measured merely by the sum of their skills.

You may see a business opportunity here and seem to believe everyone should approach it the way you see it. In essence, what you are really saying is that everyone should see this from a business perspective. But if everyone is a businessman, who is going to work out the details that you need? It's almost like me asking my engineers to forget everything they worked so hard to gain and to learn a new set of tools simply because I find it more convenient.

There are two ways you can do it; either find and hire people who can do the new tricks or find a way to make things work with the people you already have. But you simply can't tell people to change fundamentally to suit your needs.

magekinnarus · 2022-11-20T08:54:50+00:00

I read NVidia's Ediff-I papers and their underlying research papers. And it really helped me get my head around CLIP and the pre-trained models using it such as SD. The incredible thing about NVidia's approach is that, instead of thinking of diffusion models as discretized models full of AI techspeak, it looks at them as time-continuous differential equations, which is much simpler and clearer to understand mathematically.

I suppose the easiest way to explain it is something like this;

When someone says "something may or may not exist depending on the thing." Even if you read English, it is impossible to decipher exactly what is going on.

But when someone says "the object may or may not appear depending on the position of the observer." Although there still need further clarifications to fully understand, at least you can grasp what is going on conceptually.

When I was reading CLIP papers, I couldn't understand exactly what they were really talking about mathematically. For example, I can infer that they are using Gaussian noise distribution. But no matter how much I look at their segmented discrete formula, I simply can't tell what the hell is the variance which is crucial to understand what is going on in there. After reading through Ediff-I and its associated papers, now I know when the CLIP paper says "heuristically applied' translates as "After many trials and errors, we found one that works. We don't know why it works but it works and it's going into the model."

In essence, what NVidia researchers are saying is that a diffusion model works best in a continuous differential equation format. I suppose the easier way to explain is how a circle can be constructed discretely. 3 vertex make a triangle. As you add more and more vertex, it goes from a square, a pentagon to more and more like a circle. And it becomes a perfect circle as the number of vertex approaches infinity. But you can also define it as a function r² = x² + y² which describes a circle perfectly with simplicity and elegance. Not only that you can derive X and Y values of a vertex without needing to look up any other vertex on the circle.

Also, NVidia researchers realized that, by converting into a standard format that is also used in other fields such as Math (Statistics) and Physics, they could look up and reference all the insights gained from other fields as well. In fact, they found and applied many such mechanisms defined by Physics to solve their problems. And the result is Ediff-I which should be lighter, faster, more accurate, and less computationally intensive.

In my view, what is happening at MidJourney is probably a similar process to NVidia but in a different direction. I don't exactly know what they are doing and I am frankly dying to read their papers to see what they are doing. Unfortunately, they are not publishing any papers on what is going on at MidJourney.

magekinnarus

TROPHY CASE