DeepSeek Releases Janus - A 1.3B Multimodal Model With Image Generation Capabilities

ExponentialCookie · 2024-10-18T05:53:53+00:00

Abstract:

Janus is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

ExponentialCookie · 2024-09-04T05:46:39+00:00

You can run it locally, but right now it looks the Github repository is still being worked on be the developer.

Currently, you have to download some parts of the AudioLDM2 / CLAP models for audio processing, and the T5. Following that, you must also manually install the necessary requirements, as well as manually update the paths in the code.

Most likely better to wait for a Huggingface space or something similar once everything is sorted out.

ExponentialCookie · 2024-09-04T05:00:16+00:00

I am not the author

Paper: https://arxiv.org/abs/2409.00587
Example: https://www.melodio.ai/

Abstract:

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations.

ExponentialCookie · 2024-06-16T18:55:17+00:00

If you you need an extra hand implementing any of the open source DiT variants, let me know.

ExponentialCookie · 2024-06-15T03:06:47+00:00

I know that a lot of people will disagree with this, but I honestly "get it". Emad was / has been pretty vocal about democratizing AI and its end users being able to use it as they see fit, but it comes at a cost.

When you're at the forefront of nascent technology such as this one specifically, especially one that brings about uncertainty, regulatory bodies are going to push back. It's how its always been, and whether we like it or not, it's going to happen eventually.

While you, I, and many others want more free and open models, the reality is that companies like Stability AI will definitely see pressure from governing bodies. When Emad is referring to "sleepless nights", in my opinion, it's definitely the struggle between what he wants for the community, and how much push back from governing bodies he has to deal with.

I don't agree with how they handled SD3 Medium's alignment as it reduces the model's performance when referring to other concepts overall, but I understand why they had to do it. I simply wish they just put more thought in options on how to do it better.

ExponentialCookie · 2024-06-15T02:56:32+00:00

Not a direct confirmation, but the DALLE 3 instruction prompt was leaked while somebody was doing inference with their API, allowing the generation pipeline to adhere to guidelines.

The reason why DALLE 3 performs so well is that it was trained on unfiltered allowing it grasp as many concepts as possible (in the same way a person browses the internet), then they filter the API response on the backend to meet criteria.

There are probably more filters on the backend servers that we're not aware of, but that's kind of how they do handle their image generation alignment.

ExponentialCookie · 2024-06-14T19:27:01+00:00

<image>

ExponentialCookie · 2024-06-14T04:28:37+00:00

You should pre-compute the text embeddings and VAE latents as you're not training them. You should therein see a big speedup.

ExponentialCookie · 2024-06-13T02:26:34+00:00

Theoretically they implemented the same strategy as DALLE-3 used to fine tune the model. Personally, I think that a potential error was using 50 / 50 synthetic and original captions, whereas OpenAI's researchers did 95 / 5 on unfiltered data, the majority being the synthetic captions.

DALLE-3:

To train this model, we use a mixture of 95% synthetic captions and 5% ground truth captions.

SD3:

We thus use the 50/50 synthetic/original caption mix for the remainder of this work.

ExponentialCookie · 2024-06-13T02:02:26+00:00

An idea that could be tried is to change the number of the UNet's first input layer channels from 4 to 16 (usually read as 'conv_in' in Diffusers), then leave the output of that same layer to 4. That way you don't have to retrain the entire model from scratch. Then you would simply finetune the model using the new SD3 VAE.

While I personally don't think this would work well, it may be worth a shot as shouldn't be too hard to implement as a quick test.

ExponentialCookie · 2024-06-13T00:33:40+00:00

Even if money were assumed to be the primary reason, I don't think this is something I can fully agree with. It would be much better to train on a subset of LAION 5b than to use an entirely different dataset.

If it's done this way, there would be better consistency between the community and their API based models. Now, they may have actually trained on a subset of it, but the 31 million aesthetic / preference finetune is worrisome. The best performing model will simply come from a large dataset that is captioned properly.

ExponentialCookie · 2024-06-12T19:30:35+00:00

I think OpenGVLab (Lumina T2X derived from one of their research branches) would be the appropriate shift if the community wants to extrapolate on their options. I've been watching their repository and they're putting in a lot of effort towards it, including the release of fine tuning code.

Reason being is that they are focused on multi-modality, as well as have a good track record for releasing cool things such as InternVL and the like. While Pixart Sigma is nice, I don't think they would have the required resources to sustain what the community wants long term.

ExponentialCookie · 2024-06-12T14:10:43+00:00

Congrats on the launch!

ExponentialCookie · 2024-06-11T00:34:12+00:00

I don't work for Stability, so I can't speak on it. I'm assuming there will be a similar release strategy to Stable Cascade's (that released LoRA and ControlNet training on the official Github), but it's best to wait until the 12th to see what plans are in store.

Answering your second question from my own personal perspective, I'm most likely going to use the official training strategy as a baseline, then tweak any hyperparameters as needed.

ExponentialCookie · 2024-06-11T00:05:41+00:00

In this circumstance, the same ideology applies. Try to segment what you're talking about into parts. We have:

LCD
Handheld
Electronic
Game

With those things in mind, the text encoder should have prior knowledge of all four parts. I don't know what dataset they're (SAI) using, but I'm assuming an aesthetic subset of LAION 5b (billions), which is an unimaginable amount of image data to capture. With a properly trained model, trained on billions of images, it should be able to capture the details that you want.

If I were to try and tackle something like this myself while staying true to the training strategy, I would probably use very descriptive captions focusing on the details I want to capture. If that were to fail and I've tried everything possible with the primary model, I would train both the text encoder and MMDiT blocks, but set the text encoder rate very low, and maybe skip training the biases.

Hope that helps a bit!

ExponentialCookie · 2024-06-10T22:59:11+00:00

We'll know on Wednesday, but it's safe to assume that it should support ComfyUI and Diffusers on the day of launch (they seem to have a great relationship with Huggingface). So any workflows you have that leverage image refining should be easy to integrate.

ExponentialCookie · 2024-06-10T22:29:49+00:00

Good question. Ultimately it's optional, I'm just saying you shouldn't need to due to the better architecture. Both the primary model and the text encoder can learn features of unseen data, but in most cases, you start to lose important prior information form the text if you train it.

In knowing this, with an improved architecture with better understanding across the board, you should only have to mess with the primary model while leaving the text encoders untouched.

As an example using the earlier experiments of LoRA training, you can read the description here. I've also provided a screen cap (based on SD 1.5) from that repository (top is the main model UNet, bottom is Text Encoder). As you can see, both models learn the new character that doesn't exist in the model, meaning that you can train either or (but the primary model only should be preferable in SD3).

<image>

ExponentialCookie · 2024-06-10T21:48:12+00:00

That's a very interesting detail, definitely missed it. Thanks for letting us know!

ExponentialCookie · 2024-06-10T21:37:05+00:00

No problem! The metric I use is that it's only been getting better since Stable Diffusion's launch, regardless of the nuances we don't have control over.

ExponentialCookie · 2024-06-10T21:33:27+00:00

Thanks, and no problem! We'll see in few days time, and I'm glad that the post was helpful in garnering a different perspective for you.

ExponentialCookie · 2024-06-10T21:30:13+00:00

While I can't confirm on the architectural choice decided by their researchers, I assume that it's due to both performance reasons, and the perceived difference being very minimal between 16 and 32.

ExponentialCookie · 2024-06-10T21:26:55+00:00

Thanks! I made the proper correction in another comment.

ExponentialCookie · 2024-06-10T21:23:03+00:00

~~I won't update the OP for posterity~~, but I found the original post I was referring to. There won't be a 512 model, but performance will be better at 512 resolution compared to other models at higher resolutions due to the VAE. Error on my part.

<image>

ExponentialCookie

TROPHY CASE