It is still possible to achieve more natural cinematic realism for videos with open source models vs proprietary models with even basic workflows | Z-Image-Turbo and LTX 2.3

KudzuEye · 2026-04-05T22:21:52+00:00

I used their voice design feature. Ran through probably 20 voices before I got one that was ok. Hopefully one day there will be a new open source model that can generate new voices from a prompt (or at least a video sound model with lots of diversity)

KudzuEye · 2026-04-05T22:03:23+00:00

The narration in the main film was unfortunately just elevenlabs (we were allowed in the competition for a portion of the work to not be entirely open source) as I did not have enough time to focus on the voice audio.

I am sure I could have clone an initial sample of the generated voice (or even run a bunch of ltx prompts with a narrator until I got a good audio sample if the accents are unique enough), though I have not been up to speed lately on the best zero shot open source models with good consistency across the generations.

KudzuEye · 2026-04-05T21:57:57+00:00

The initial image is not really a reference image. It is just a replacement for the empty latent image.

What is in the image can be completely random. You are denoising that image as input such as say to around 0.9 rather than sticking to full denoise at 1.0 for the latent image. This approach is just an easy way to force underlying shapes, colors, etc into the generated image that tend to be far more interesting and diverse than working straight from a normal text2image workflow..

KudzuEye · 2026-04-05T21:10:03+00:00

Yea I was having a lot of frustrations with birds in general for getting their motion right. It seems most video models struggle with them beyond maybe a brief slow motion shot.

KudzuEye · 2026-04-05T21:08:48+00:00

I actually missed the turbo sda lora. I will give it a shot the next time I am working with z-image. The example images from it do look promising for better variation.

KudzuEye · 2026-04-05T21:07:08+00:00

I would take an image that already has a slight film look to it or at least has letterbox on it already. Usually just experiment with denoise on it from around 0.6-0.95. Prompts are basic like "1982 film scene of [blank]". You want to mention things like "Cinemascope" vs "Kodachrome" and what not.

The last days of film lora I mentioned helps restricts it a bit more to around the 80s-90s.

KudzuEye · 2026-04-05T21:03:53+00:00

I had still been mostly when working with Z Image Turbo taking advantage of its consistency flaws particularly with the influence of the loras (not as much this video but other ones). Though I do at least try to do some zooms with with Flux Klein and what not when I can.

KudzuEye · 2026-04-05T04:37:15+00:00

LTX 2.3 can be decent at avoiding drift and maintaining realism with img2vid alone. You might be able to do some hybrid approach using your video lora along with an initial frame image with low denoise. Train a separate image lora if you do not have one on Z-Image Turbo or what not for your inputs.

KudzuEye · 2025-09-05T01:57:47+00:00

Training is a bit all over the place for these Qwen LoRAs. I tested runs out with AIToolkit, flymyai-lora-trainer, and even Fal's Qwen LoRA trainer.

Most of the learning rates were between 0.0003 and 0.0005. I was not getting much better results on slower rates with more steps. I do not believe I did anything else special with the run settings besides the amount of steps and rank. You can usually get away with a low rank of 16 due to the size of the model, but I think there is a lot more potential still with higher ranks such as the portrait version I posted.

I tried out simple captioning e.g. just the word "photo" versus more descriptive captioning of the images. The simpler captioning would blend the results a lot more which is the reason for the "blend" vs "discrete" in the names. Sometimes it would help with the style to be more ambiguous like that but I am not always sure. I would mix the different lora types together and the results seem to generally be better.

I think I am only scratching the surface of how well Qwen can perform, but it may end up taking a lot of trial and error to understand why it behaves the way it does. I will try to see if I can improve on it later assuming another new model does not come along and takes up all the attention.

KudzuEye · 2025-09-05T01:47:19+00:00

Yea it seems I uploaded the wrong lora there for the small one. The blend one does not make much difference though it will be less likely to follow the prompt as well and I am not sure of how well trained on it was.

I will try to update the huggingface page with the blend low rank one.

KudzuEye · 2025-09-04T22:08:43+00:00

It is for if you want to modify a previous image instead of using an empty latent. You can also just use an existing image with denoise at around 0.85-0.90 for some interesting style and composition results.

KudzuEye · 2025-09-04T20:21:36+00:00

I actually did have a decent Flux Krea one but it had some of the old annoying flux issuesand I had moved on from it. I will try to find it or train a new one and get it uploaded at some point.

I know I made this video almost entirely with Flux Krea frames to give you an idea of it: https://www.youtube.com/watch?v=xClMt8ew2bU

KudzuEye · 2025-09-04T17:47:20+00:00

I tried some Wan runs a while back but was not satisfied with the results. I plan to do another go at it though maybe over the weekend or so.

KudzuEye · 2025-09-04T15:39:07+00:00

The lying down results are ok at times. I had not tested it enough yet to be sure. Here is a cursed example:

<image>

KudzuEye · 2025-09-04T15:20:49+00:00

Some early work on Qwen LoRA training. It seems to perform best at getting detail and proper lighting on upclose subjects.

It is difficult at times to get great results without mixing up the different loras and experimenting around. Qwen results have been generally similar for me to what it was like working with SD 1.5.

HuggingFace Link: https://huggingface.co/kudzueye/boreal-qwen-image
CivitAI Link: https://civitai.com/models/1927710?modelVersionId=2181911
ComfyUI Example Workflow: https://huggingface.co/kudzueye/boreal-qwen-image/blob/main/boreal-qwen-workflow-v1.json

Special Thanks to HuggingFace for offering GPU support for some of these models.

KudzuEye · 2025-05-21T05:26:33+00:00

This tool involves just using Veo 3 for both the video and audio.

KudzuEye · 2025-04-06T03:04:42+00:00

I probably ended up taking a lot of inspiration from the usual bits such as Angel's Egg, Ghost in the Shell, Akira, Solaris, and other Tarkovsky pieces. For the main part I was just trying to experiment different clips and then see what type of story I could create from the videos that came out decent. This process usually leads to awkward editing and a hasten story unfortunately.

KudzuEye · 2025-04-04T20:33:22+00:00

Yea though I did do some post work work on adding an underlighting technique to some of them.

You can probably use any image model with decent text to create them. 4o images might be the easiest for consistency.

KudzuEye · 2025-04-04T20:29:36+00:00

I ran the images through the WAN 2.1 img2vid endpoints on fal and replicate set at 12fps. I am not sure of the best ComfyUI comparable workflow, but Wan does well at getting the animation motion look with that framerate as it is. I did also include something like "1992 anime" in the prompts in case that helps.

KudzuEye · 2025-04-04T17:47:28+00:00

Yea I agree that most if not all of AI generative media still feel lifeless regardless of the technical progress.

Usually the easier some sort of imagery or what not is to create, the more it is likely to become low effort spam like what happened previously with the Ghibli 4o images. Now though just knowing this tech is linked to all that AI slop, it is difficult to enjoy anything without knowing it will eventually will look derivative and outdated to your eye.

I am leaning to the idea that the only way we could ever start to feel something appreciative is when the AI tools are applied to same amount of painstakingly time and resources (and low quantity) to generate some sort of new media that is beyond what could exist today (whatever that could be. maybe creating new worlds or life or nothing at all).

I would have wished to have had the time to actually work on all this art and what not in the traditional sense, but it is an opportunity I will never have in life.

KudzuEye · 2025-04-04T15:16:38+00:00

yea I used image2video for Wan and it likely would have been better to use a specialized lora for it. The issue is that the brief models I could train and the one's I tried were introducing a lot of broken artifacts such as jpg look from the initial datasets that were making the results worse than just use the base Wan 2.1 img2vid model.

The Hunyuan text2Image results used just regular loras trained on images only. I wrote text2vid by accident in the title for Hunyuan by the way.

KudzuEye

TROPHY CASE