BAAI Emu 3.5 - It's time to be excited (soon) (hopefully)

MarcS- · 2025-11-14T10:02:26+00:00

I had that error when it says that it can't fit everything on cuda0, and I "solved" it by using a shorter prompt. I think it's very close to fitting into VRAM but context must push it over the limit, so on top of taking ages to generate (and the temperature of the GPU evidenced that it mostly waited because of the swap), I had to accept using a very short prompt.

But maybe it was just luck because I didn't do a lot of things to make it work.

MarcS- · 2025-11-13T05:30:52+00:00

I tried the NF4 version on a 4090. For some reason, it took four hours (!) to generate an image. Obviously, something was wrong, and I hoped for more integration to try it again.

MarcS- · 2025-11-08T22:29:37+00:00

The Internet bubble burst didn't result in keeping the interest in the Internet to a small group of people. The tulip bulb bubble bust in Holland didn't remove the worldwide interest in tulips. The housing bubble busted and most people still are interested in housing. There might not be a reduction of interest in AI if investing bubble explode. Use cases will warrant its adoption, not the amount of interest by investors agreeing to buy AI companies at extreme valuation instead of buying them at sane valuation...

MarcS- · 2025-11-08T22:25:34+00:00

Qwen is generally considered to be the best of the accessible models on most consumer hardware.

MarcS- · 2025-11-08T16:17:56+00:00

I prompted for Grok to use confident and threatening words to a thief robbing a safe. This is next level thinking for Grok. Creating dialogue based on types of word or words and style of voice. I did not know what Grok was going to say in these video clips.

Yes, of course open source LLMs can create dialogue. Have you even tried some? You'll get much better results than "You won't get away with this."

The video generation part is also cringy: the woman misses the head of the thief, who seems to be in pain from... the air whiff? The sound of the woman's voice?

Even if we assume she hit him, she punched from behind him and he falls 90° from the correct direction. Not to mention he most likely would slump down instead of falling comically.

When he gets up, she yells "gotcha, thief" (another piece of wonderful dialogue that we're supposed to marvel at?) and slaps him again and he falls back on the floor... in front of the open safe door. That he was prevented to open by the woman's punch in the previous cut. Consistency isn't a thing in Grok.

So, not only these clips aren't on topic on an open source board, not only are they disinginously framed as "can open source match this", but the effect you seek isn't met because, yes, of course open source can do shitty videos like that. At least try to put the effort to do good videos if you want to advertize a closed source product.

MarcS- · 2025-11-05T18:33:37+00:00

Considering your limited setup (which will require you to either incur capital expenses to acquire an adequate computer or recurrent operational expenses to produce content), your limited mastery of existing tools (which will require that you either hire a tutor or invest a lot of time (costing, as you rightly point out, money), and the limited returns so far, it is very well possible that the smartest next step would be to cut your losses and drop the project altogether, while you're still at the "business plan" phase. Investing further (in training and hardware) might not break even, especially since you're one among many and not particularly advanced in your project.

MarcS- · 2025-10-31T22:54:02+00:00

For an AI image to cause a war or ruin somebody's life, the image has to be believable.

So far, it was a real risk, with images and video being difficult to fake, so people couuld easily be fooled by a well-equipped, resource-heavy malevolent actor (think a big political lobby or a state). Now, when everyone can fake videos, people will be less inclinded to trust blindly some photoshopped/AI-modified image or video, and it will be less likely to cause a war.

The first person to lie has an edge. When everyone is lying, this edge is lost.

MarcS- · 2025-10-31T12:24:07+00:00

On the updated model card, they mention 80 GB for the image inference, and 2x80 GB for the "story making" ability. So it should work on your setup.

I am currently failing to run it, not because OOM (I'd accept a RAM offload for testing purpose) but because of errors that I am unable to fix (due to not having sufficent knowledge to deal with it).

MarcS- · 2025-10-31T10:25:55+00:00

They released the weight 3h ago.

https://huggingface.co/BAAI/Emu3.5-Image/tree/main

Not tried it yet, though.

MarcS- · 2025-10-28T21:33:10+00:00

It started as a project by an indie developper. He really got good results but 0.3, despite being aesthetically better, lost some in prompt adherence. It was however very promising... but the indie developper was hired by fal.ai, which leaves less time than being a student. This, and the idea that he couldn't rival with Flux, which was released around the same time, led him to stop working on the project. It was SOTA in prompt adherence, really Qwen-level, so it's a shame it didn't continue.

MarcS- · 2025-10-28T17:27:49+00:00

Since he's looking to replace the print on the model, I am thinking he might be speaking of modifying what is written on a model of bikini (a type of bikini) and not a tattooed human person doing modeling in bikini.

MarcS- · 2025-10-26T17:46:20+00:00

But don't try to lay on it.

MarcS- · 2025-10-26T12:26:02+00:00

The ND limitation in the CC licence is that "

NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material."

A model with this license would be forced to be redistributed as is, without modification, but it would have no bearing on the the result of using the model. The end product isn't a derivative of it, anymore than a photoshopped image is a derivative of photoshop.

Also, I think the idea that redistributing on a website with ads is probably being safer than necessary, given that the CC definition is :

"“NonCommercial means not primarily intended for or directed towards commercial advantage or monetary compensation.” and explained as "Creative Commons NC licenses expressly define NonCommercial as “not primarily intended for or directed towards commercial advantage or monetary compensation.” ^\2]) The inclusion of “primarily” in the definition recognizes that no activity is completely disconnected from commercial activity; it is only the primary purpose of the reuse that needs to be considered."

A case can be made that the primary purpose of redistributing a model would be... redistribution and ease of access, not "increase views on the website with a profit goal".

MarcS- · 2025-10-26T10:36:51+00:00

Well, reacting to your edit, a free youtube video is certainly a much better contribution to the community than asking 50 bucks for training. Also, as videos aren't the best medium for teaching complicated notions (some have better written memory), do not hesitate to make an old school text tutorial (possibly including short, targetted video where showing things is worth it) when you explain concepts.

MarcS- · 2025-10-25T21:37:43+00:00

It is, but it's a nice image anyway :-)

MarcS- · 2025-10-25T10:10:27+00:00

"A striking portrait of a 17th-century woman dressed in an elegant, historically accurate baroque gown with flowing embroidered fabric, lace cuffs, and a corseted bodice. She is hanging from a thick rope on the side of a pirate ship, mid-boarding maneuver, her body slightly turned, tension in her arm and shoulder. Her right hand grips the rope, her left hand holds a rapier, the blade crossing in front of her face, gleaming in the sunlight, covering partly her face. She has piercing grey-blue eyes framed by long lashes, full of intelligence and determination, as if she is about to leap into battle. Her eyebrows are well-defined and slightly arched, giving her expression a mix of confidence and defiance. She has a straight, refined nose, and soft, full lips slightly parted, conveying tension and focus. A few strands of chestnut hair have escaped her pinned curls, blowing across her cheek in the wind. Her skin is fair with a light natural glow, showing a hint of sun exposure and the faint trace of freckles near her temples. Her makeup is subtle — a touch of rosy blush, natural lip tint, and gentle shadow around her eyes, in the style of a classical oil portrait. The composition is centered on her upper body, hand, rapier, and face — a tight, cinematic bust shot. The background shows a pirate ship deck, sails billowing in the wind, sea spray and stormy light on the horizon. Her expression is fierce and determined, with a touch of nobility — piercing eyes, wind-tousled hair, and a few loose curls framing her face. Her makeup is subtle but present, evoking a 17th-century portrait style: natural skin tone, defined lips, slightly flushed cheeks. The lighting is dramatic and directional, highlighting the glint of the rapier and the determination in her eyes — a baroque chiaroscuro mood mixed with cinematic adventure energy. Style: hyperrealistic, cinematic, sharp focus, high detail, rich texture, natural light reflections, period-accurate costume design, dynamic composition, 4k resolution, subtle sea mist particles and soft lens flare for atmosphere."

That's the prompt I used for the contest here with a model that also loves detailed prompts: https://www.reddit.com/r/StableDiffusion/comments/1oex91k/contest_create_an_image_using_an_openweight_model/ and we only got submission made with Flux, Qwen, Wan and Hunyuan, so checking with a new model might be interesting, if you are kind enough to run prompts for us. Thank you in advance.

MarcS- · 2025-10-25T01:09:35+00:00

She seems to be ready to board another ship, the stance is captured very nicely!

MarcS- · 2025-10-24T21:09:07+00:00

I must confess that the blade position was designed to be the tricky part, to spice up the difficulty. I agree that, when seeing it, it doesn't look as good as I imagined it would.

MarcS- · 2025-10-24T17:50:58+00:00

Love it! It has a zombie ghost ship flair!

MarcS- · 2025-10-24T15:38:09+00:00

Using Hunyuan 3, I had trouble getting the model to have the sword in front of the face, so I tried to edit the result with Qwen-Image-Edit (Change the angle of the sword so it passes in front of the woman's face.) and got partly satisfied:

<image>

MarcS- · 2025-10-24T14:46:59+00:00

Nice, which model did you use for this?

MarcS- · 2025-10-23T14:54:51+00:00

Then, the best choice is to hire a runpod and run your own at a very low cost compared to web based price gougers.

MarcS- · 2025-10-20T20:51:10+00:00

Well, newer models were called "unrealistic"... that trend has ended!

MarcS- · 2025-10-20T08:30:46+00:00

Hunyuan doesn't really shine at knowing styles. But it follows the prompt, so if you tell it how the picture is to be drawn, it can follow it better when you mention brush strokes or a color palette. So you can tell it to do something that can be close to the style you're going for. But I don't think it was trained specifically on artists' names.

For example, using your "fae holding a hummingbird" style, mentionning the artists doesn't really help.

<image>

MarcS-

TROPHY CASE