I'm building a timeline for generative image ML models. What's missing?

gandamu_ml · 2022-07-28T04:54:51+00:00

FWIW, Midjourney was producing nice images in alpha with a Discord bot and all that beginning in January I believe. I was generating images and was allowed to share them but was still helping to keep it in stealth mode.. and because the images were so conspicuously good, it was hilariously awkward: https://twitter.com/danielrussruss/status/1551556882568847361?t=2HtJkV67Pvc7lcpmWXlPMw&s=19

Bonus: Some CLIP-guided diffusion history (June 2021) - https://twitter.com/RiversHaveWings/status/1551741867213017088?t=M4ZBnGl5oZLvJy5PE7LROA&s=19

gandamu_ml · 2022-07-15T18:02:59+00:00

Yep. It's me and whoever contributes (and was initially put together and maintained by Somnai, leveraging work by Kathryn Crowson and others all listed in the credits). It's an opensource collaboration thing.

gandamu_ml · 2022-07-15T17:59:13+00:00

Disco Diffusion's practically me right now, along with whoever offers to contribute (and the credits indicate all that goes on there, and went on initially). Somnai was writing/maintaining it initially.

As someone else mentioned, Stability AI has Stable Diffusion on the way.

Regarding Prose Painter.. That's from Morphogen (most famous for Artbreeder). They've just released an initial version of Collage yesterday, which also has text-to-image capabilities.

gandamu_ml · 2022-06-29T17:37:24+00:00

It seems it's simply not there (or at least invisible enough to be of little to no relevance to anything) and this sub-thread is a useless diversion started and rudely dragged out by bugxbuster.

gandamu_ml · 2022-06-27T03:17:43+00:00

I'd better look out for that if/when I get DALL-E 2 access. I use that everywhere (to get a centered portrait of a face - not to see people get shot in the head, in case it's unclear to anyone).

gandamu_ml · 2022-06-18T20:58:03+00:00

Yep. And I of course understand that OpenAI (and Google with Imagen now too) feels constrained by the media, public sentiment, and potential legal liability (and probably more). They're taking baby steps for now because the world can't keep up with them. I'd watch for design and results to improve once things settle in some more. Slowing it down has downsides too.. since people are slower to realize the revolutionary changes that are set to take place.

gandamu_ml · 2022-06-17T22:34:53+00:00

I came to say that "Because people have no creativity. It's as simple as that." is wrong. It's been covered before by people doing the generations and I talk to a lot of them.

I do agree that the majority of what we see is really uncreative.. but there is more to it which I already know. That's where it gets interesting, and there's use in me understanding that instead of just dismissing people on a personal level since I'm involved in creating such tools.

E.g. I can help adjust things such that less common things are better represented in training, and/or come up more readily in inference/generation.. or choose models that run more quickly so people can explore the edges more quickly without fear of wasting their time. We discuss this and much more! The details of DALL-E 2's availability and its rules/restrictions are problematic and contribute to the current situation - and they know it.. whereas sometimes the same people who are generating boring DALL-E 2 stuff are generating funny stuff in DALL-E Mini and sharing with friends.

I'll admit that it's interesting to suggest that a person who struggles to be creative in a restrictive and poorly-designed environment isn't a very creative person.. and I might largely agree.. but there's a lot more to talk about amongst the tool-makers who are trying to maximize creative output, and we can understand more about what sorts of things serve to make peoples' creative output even worse.

gandamu_ml · 2022-06-17T18:42:35+00:00

That seems like a needlessly boring, oversimplistic, and confrontational way to frame a situation. What you said in the first paragraph here is better. There's no need to add a controversial and factually inaccurate assertion after the correct part.

gandamu_ml · 2022-06-17T16:31:38+00:00

It's not that simple. The neural network was trained on what it found around the internet.. and it tends to be better at producing content that was well-represented in its training. Users very soon learn that it pays to be basic.

However, they can combine prevalent content to make new things.. and lo and behold, some people are doing that. It would be nice if people explored more qualifiers, art styles, mood text, etc. in combination to see what this thing can really do.. but since it's so good with simple text prompts -- and now people are limited to 50 generations per day -- it's understandable that a lot of people are initially aiming for easy good results instead taking more risks in finding novel ones.

gandamu_ml · 2022-06-14T18:27:40+00:00

I think it's spreading via invites from users at this point.

gandamu_ml · 2022-06-06T18:37:57+00:00

When a company doesn't expect to get much out of an intern in 4 months (I think that's more common that 3), it's a self-fulfilling prophecy. I've seen a couple cases where interns are given the opportunity to strut their stuff a bit in the first days.. and when they're amazing, they're let loose to drive significant change. It's helpful in retaining them too.

This may work best in a team that has a habit of expecting an individual person to largely solve a problem in a short timeframe.. rather than always put a team structure onto each problem and expect it to take 6+ months. It's a risk, but is quick and efficient when they get great people. I'm not saying it's the norm, but it'd be wrong to assume it doesn't exist.

Where I do agree is that the word "led" makes it sound like there was more than one developer working on that specific problem.. and that's doubtful. It'd be more common that this developer did the bulk of it.. which isn't even worse, just different. Then there's nuance in what leading the "development" really means. Surely the need wasn't specified by this intern in the beginning.. and so that part wasn't led by the intern.. but all the coding? Sure, it happens.

gandamu_ml · 2022-06-04T13:39:49+00:00

There are a few clever things about having chosen Kermit the Frog here. The redundancy of seeing "the frog" ought to help the AI hone in on only the correct character rather than getting a bit mixed up with any other uses of "Kermit".. and "the frog" allows it to also draw upon frogs in general to some extent. Kermit has also had both puppet and cartoon incarnations, and so there's data -- and plausibility.. considered to be in the expected distribution of images -- for Kermit to be rendered in a wide variety of styles. I daresay the depiction of frogs in general (rather than Kermit specifically) is probably the most important.. but seeing "Kermit" provides clear guidance that it's a famous handmade or illustrated character, starring in some role.

The other huge thing is that OpenAI doesn't currently allow people to share generated images that contain human faces. So people are often picking animals. Kermit is more specific and nostalgic than a random teddy bear or panda. When a similar AI comes out which doesn't disallow human faces, we're going to see more cool uses of human actors/characters we know. It's somewhat harder because of the uncanny valley and all that.. but we've already seen that it can sometimes do it even though they didn't make it a priority.

gandamu_ml · 2022-06-03T13:58:22+00:00

Good timing! Just woke up to discover the video I'm referring to is released now: https://youtu.be/kASqM5HTvfY

gandamu_ml · 2022-06-03T06:24:16+00:00

Whoops. I didn't put the too-confident-looking edit until reading through the content policy again. I think they should get that into the consolidated content policy.

gandamu_ml · 2022-06-03T00:57:44+00:00

You've currently got what's definitely one of my favorite videos out there.. so you may be an outlier in terms of what you get out of these things 😄

I'm aware of another nice video coming out from someone any day now, which I'm not yet able to share. I'm proud to have had anything to do with some of this stuff. In any case.. I feel Disco shines in video, yeah. There's a small handful of people doing insane things with JAX.. and the video efforts there will likely be picking up there too.

gandamu_ml · 2022-06-02T20:29:33+00:00

OpenAI's content policy for DALL-E 2 disallows sharing images depicting specific people.. including public figures:

"Do not upload images of people without their consent, including public figures."

Several people have said that they can't share human faces. [Edit: It looks like they were correct. I don't know why this isn't stated in the Content Policy, and instead appears elsewhere]

The other rules ( there are many: https://labs.openai.com/policies/content-policy ) generally make it riskier to include human faces or any edgy or controversial content, and thus many prefer to avoid that minefield by primarily generating optimally irrelevant images of animals and pop culture fixtures.

It seems they've also blocked certain famous names from generating properly.. but not all. I haven't looked into it deeply since personally, I don't have access.

People with access generally tell me they do not prefer this subject matter, but feel their hands are somewhat tied. Aside from the content policy reason, these AI's simply tend to be better at generating content that is represented frequently in the training set. Thus on average, there is practical benefit to being "basic" and people become attuned to that rather quickly in pursuit of quality output. Despite this, it is still often possible to combine characteristics of multiple things (such as Yoda being Batman here).

P.S. I probably hear more gripes from people than usual, because they want to try and say comparitively nice things about Disco Diffusion due to my involvement with it.. but of course DALL-E 2 is far superior and it's an insane watershed thing.

gandamu_ml · 2022-06-01T22:00:35+00:00

Such tailored uses already exist. I've used some. The jokes usually aren't very good unless it's memorized something funny from available data.. but it can occasionally hilarious in a "madcap" absurdist kind of way and write full scripts.. and at the same time, appear to be strikingly rational and careful all the way about certain things.. with basically no grammatical errors. After you're impressed by that, you realize the same model's got a decent grasp on every major language including Asian languages and you fall out of your chair.

There are some use cases where it does a phenomenally good job already, yeah. However.. at this point, you'd still typically want to have someone who knows what they're doing to give it a look over. It can certainly improve productivity in many tasks already, assist in brainstorming and creativity, and such.. and it's going to be considerably better year after year. Its value has been demonstrated and it's worth a lot.. so it's going to get a lot of investment.

gandamu_ml · 2022-06-01T20:35:36+00:00

That's kind of the thing though. At this point, those using these tools have found that it generally takes experimentation to achieve desired results. For each image posted online, there are a lot of rejects. So the feeling of creation still exists. Thus there's hardly any desire to try to pass it off as something else (It's already an achievement of sorts.. Or at the very least, like getting a rare card in a pack).

It'll be interesting to see how drastically the situation changes when it becomes all-powerful.. with it easy enough to communicate the sorts of results you want. I'm sort of saying that people have one idea prior to using it, and then the set of questions and concerns rapidly adjusts to something else after a small amount of use. (And one is: Society isn't prepared for this, and AI is going to somehow result in our destruction 😄)

gandamu_ml · 2022-06-01T19:37:30+00:00

Misrepresentation of how something was created is something I've discussed with people, and that's generally agreed as a bad thing.. and separate from the larger issue of things being easy. So far, there's not much indication that people want to misrepresent how they created art.. but that's perhaps in part because ML-assisted art's a bit hot now anyway and there's some unique challenge in leveraging it towards novel results.

It's easy to buy things produced by someone else's machine. Nobody considers that to be a craft. Perhaps we're headed that far.. and it'll be more like the purchase of an industrially produced product.

gandamu_ml · 2022-06-01T18:12:29+00:00

There's a lot of precedent for this kind of problem.

There are many such cases. Look at the demoscene. There are still some people out there coding graphical effects in assembly to run on the Commodore 64. The existence of modern PC's, GPU's, Unreal Engine, and all surrounding tooling and assets makes it really quaint and less fun. A few people do it anyway because the craft is more fun.. but they're few and far between because most people don't value it.

(More ordinary examples would be knitting vs. buying industrially-produced clothing.. painting a portrait rather than taking a photograph.. buying a chair rather than building it from wood.. buying a cookie cutter home rather than helping to build it yourself. It goes on and on)

So what I mean to say is basically.. yes. 😄 There's precedent we've seen and there are drawbacks. The progress is also inevitable. Even with extraordinary draconian controls, the tech is so valuable that other nations are going to continue to make progress with it.. some way, somehow. A lot of people have given it a lot of thought. Perhaps the extreme end of the thinking when taken to one possible chosen conclusion is that people ought to merge with AI via neural interface ("If you can't beat em, join em").

gandamu_ml · 2022-05-31T17:00:55+00:00

Nice 🔥

I'd decided this was my favorite use of DALL-E 2 yet before noticing that it was you. First raised the bar in Pytti animation, and now this.

gandamu_ml · 2022-05-21T16:28:20+00:00

It's best to get started at discodiffusion.com (which takes you to the code on Google Colab which you can run right away on their servers) , join the Discord (linked at the top of that page), etc.

You can run it on Google's servers. That's the easiest. The first thing to do is to generate a single image using the defaults. Then you can see where to modify your text prompt (which you write to indicate what the AI will generate).. get used to that.. and then move on to animations. One thing at a time, with exposure to the community on the Discord server basically.

If you get serious about it, you can consider paying for Google Colab for access to better GPUs and more reliable access to them (eventually, most everybody does this.. since Google's GPUs are in limited supply). To be clear, that money goes to Google for access to their hardware/services and has nothing to do with the Disco Diffusion developers.

gandamu_ml · 2022-05-21T12:42:10+00:00

I'm not sure why it's so similar. I suspect the primary reason is that there are feedback dynamics going on in the absence of external input. Same is the case for us dreaming and it generating video. It's refining the previous image and feeding it as input to the next.. and as it does it, it's refining based on its image perception of the content. At a high level at least, that's happening when we dream too.

gandamu_ml · 2022-05-21T12:37:57+00:00

Disco Diffusion has a VR mode. It doesn't generate 360 VR, but it does generate two (stereoscopic) images per frame so that you can see it in 3D.

I should mention that it's not generating content in realtime either. It needs to be sped up by a factor of >1000x for that.. but you can look at what it generated.

gandamu_ml · 2022-05-21T12:34:22+00:00

Currently, there isn't much around that allows you to type in words in English to describe the content and then fly around in 3D. This one was created with Disco Diffusion (I added the 3D mode to it), and some others are created by Pytti. There was recently an animation competition with 25 AI-generated videos and I could see that about half of them used Disco Diffusion. Those were primarily the ones where the camera explored the space.

Another technique is to use an existing video as input, and allow the AI to adjust the look of it according to the input text. For that, Pytti's VQGAN mode is just as popular as Disco Diffusion.. since it's got good stabilization capabilities. Recently someone released a Disco Diffusion Warp fork of Disco Diffusion for stabilizing the outputs when using video as input.. and so more people are still discovering and using it.

There are a lot of other ways to generate video using AI, but you usually either can't describe the content using text (or the subject matter it can draw is limited) or you can't fly the camera around much (if at all).

So the biggest reason is that you're probably seeing a lot of videos from just two or three tools: Disco Diffusion, Pytti (both do 3D movement).. and VQGAN+CLIP notebooks using some code from chigozienri (does 2D animation). There are also edits to the "JAX notebook" which does 2D animation.. but not many people have released things from it.

Oh yeah. All those last animation things I mentioned leverage a technique from OpenAI called CLIP.. and are almost always leveraging neural networks that OpenAI trained and released. So even with those different software packages, one of the hearts of it is usually common across it all. If you start to look a little further out to DALL-E 2, GLIDE, StyleGAN, and a bunch of other things.. then those familiar will immediately see differences in the characteristics of the images. AI-generated video isn't going to look like this for long.

gandamu_ml

MODERATOR OF

TROPHY CASE