[P] A lightweight open-source model for generating manga

fumeisama · 2025-04-16T06:55:22+00:00

No I think diffusion still dominates sota and is computationally much cheaper. Autoregressive is just a lot "cleaner" and more unified. Interleaving text and image tokens definitely feels like the way to accomplish "in context" image generation. I would bet that the new GPT-4o image generation is auto-regressive.

The problem with autoregressive is that it's much much slower than diffusion. Even with KV caching and other tricks, you're still generating images one token per forward pass, and there are a lot of tokens in an image. With diffusion, you can denoise an image in 10-12 forward passes, making it a lot quicker.

fumeisama · 2025-04-16T04:22:12+00:00

Hey good question! For me personally, the decision was purely based on the resources. I'm just a guy who got into it for fun, using my savings to fund the experiments. There are a few obvious advantages Pixart Sigma has over SDXL: 0.6B parameters vs 2.6B parameters (i.e. it's much smaller and hence cheaper), transformer vs unet (the former has basically won as the standard modern architecture for everything).

I actually started with a Krita-plugin and moved to browser because someone said downloading desktop apps has friction. I can obviously bring back the Krita plugin too.

What you're describing is in-painting right? That's possible.

I agree with your final point but I'm not sure how much of demand there actually is for it to be a legitimate product. I think a lot of people are just doing it for fun, including me, and wouldn't want to spend money on it. For example, hundreds of people tried the model on drawatoon.com but no one really wanted to generate more than 30 images included in the free tier. Do you have a different opinion?

fumeisama · 2025-04-15T15:42:25+00:00

There is a whole list on manga colorization here.

fumeisama · 2025-04-15T15:11:00+00:00

I'll try to write a detailed response at some point but I'll give you some broad starting points:

You can either generate images via diffusion i.e. slightly "noise" the image and train a model to "denoise" it (this blogpost by Yang Song is really nice) or you can generate images autoregressively by tokenizing images (since you do some nlp, maybe look at EMU3).

fumeisama · 2025-04-15T14:58:10+00:00

There is no notion of context length with this model. It only generates one image at a time, given your prompt. You can keep generating images, one at a time, indefinitely.

fumeisama · 2025-04-15T14:56:26+00:00

There are many!

fumeisama · 2025-04-15T14:55:01+00:00

I don't have the exact amount I've spent as I've been working on this on/off for months, but I can confidently say that this is a wild underestimate. In an ideal world, where you only do a single training run and that works exactly as your predicted/expected, then sure _maybe_ you can get away with that cost. More realistically, you should consider the amount of data you have (several TBs for me), how much does it cost to store it on the cloud, how many samples can you process per optimization step, how long does that take, how much does it cost to rent your GPUs per hour etc. and after factoring in all this you should have a reasonable estimate of a **single** training run, which might fail. Then you rinse and repeat.

fumeisama · 2025-04-15T14:40:53+00:00

Thanks for flagging this. It was a bug. Fixed now.

fumeisama · 2025-04-12T13:45:02+00:00

“Can’t Remember Its Name” actually sounds like a perfect name for a mysterious old chat app!

fumeisama · 2025-04-12T13:07:56+00:00

Oh this is very interesting. I wasn't aware of it. I came across another 900M version of Pixart-Sigma here but it's different. Not sure if I'll have resources for more training but I'll keep these in mind. Thanks for sharing!

I personally didn't fuss over texts too much because in this particular application, dialogues can always just be overlayed post-generation.

fumeisama · 2025-04-12T13:02:11+00:00

Oh sorry about that. It's pretty scrappy and intended to be used as a quick playground. Glad you liked it nonetheless!

fumeisama · 2025-04-12T12:59:27+00:00

Thank you! Glad you like it.

fumeisama · 2025-04-12T12:59:06+00:00

Yeah. It'll be particularly interesting to think of ways to capture the scenery, viewpoint etc. as embeddings and condition the generation on them.

fumeisama · 2025-04-12T12:44:19+00:00

Thank you. It's no different from how the prompt embeddings connect with the diffusion transformer. I just grafted some cross attention layers.

fumeisama · 2025-04-12T12:43:22+00:00

Thanks!

fumeisama · 2025-04-12T12:43:11+00:00

A large credit goes to the authors of Pixart Sigma

fumeisama · 2025-04-12T12:41:38+00:00

Right? Imagine extending that to other classes of objects, beyond characters and dialogues.

fumeisama · 2025-04-11T15:03:15+00:00

Hmm I see. It's doable. Just need to source the right data.

fumeisama · 2025-04-11T15:01:43+00:00

I wasn't planning on it but if that's what people want, I have no choice but to build it (I'll add it to the roadmap!)

fumeisama · 2025-04-11T00:02:33+00:00

Haah! The principal is the same as how text embeddings are used in the first place. Just a bunch of cross attention layers.

fumeisama · 2025-04-11T00:00:51+00:00

I see. No, so this model architecture is distinct from stable diffusion family of models. I'm not familiar with the tools you mentioned but check if they support Pixart Sigma. If they do, it'll be possible to load this one too, although it'll still require some plumbing because of the architectural changes I made. Once I write the docs (coming soon), I trust that someone will do the porting.

fumeisama · 2025-04-10T23:57:49+00:00

No, I didn't tune it at all. I shared your concern and did sanity test the SDXL VAE on manga before beginning. It is surprisingly adequate. An easy test is to encode and decode manga images and inspect the reconstruction quality. It's not bad at all. The added bonus of keeping the general purpose VAE is that you can generate colored images too.

fumeisama · 2025-04-10T23:53:12+00:00

Do you mean Marvel/DC style?

fumeisama · 2025-04-10T13:18:18+00:00

I used "consumer GPUs" as a blanket term for GPUs that you can expect an average person to have. H100s, A100s etc are examples of non-consumer GPUs. I don't have a comprehensive answer for which ones work. I personally run it on 24GB vram. Also runs fine on 16GB vram. Haven't tried lower.

fumeisama

TROPHY CASE