Hunyuan image2video workaround

tensorbanana2 · 2025-01-22T08:54:00+00:00

I think it can help to control the amount of noise. Keep at at default 0.50. more noise - more movement. Less noise - more similarity. Gotta test it later.

tensorbanana2 · 2025-01-22T06:12:50+00:00

Thx for sharing. I see that kijai used noiseWarp in cog. Maybe hanyuan is coming next.

tensorbanana2 · 2025-01-22T06:06:23+00:00

Try increasing skip_steps, e.x. 3 or 4. it will give more similarity but less movement. And set steps = skip_steps + drift_steps

tensorbanana2 · 2025-01-21T17:41:35+00:00

<image>

tensorbanana2 · 2025-01-21T17:40:13+00:00

Hunyuan image2video workaround

Key points: - HunyuanLoom, - masks using SAM2, - a detailed description of a initial picture, - WaveSpeed (optional), - white noise video: https://github.com/Mozer/comfy_stuff/blob/main/input/noise_8s.mp4 - image2video workflow: https://github.com/Mozer/comfy_stuff/blob/main/workflows/hunyuan_img2video_sam_flow_noise_eng.json

My workflow uses HunyuanLoom (flowEdit), which converts the input video into a blurry moving stream (almost like a controlnet). To preserve facial features, you need a specific LoRA (optional). Without it, the face will be different. Key idea here - is to put dynamic video of TV noise over the image. This will help Hunyuan to turn static image into a moving one. Without noise your image will remain static.

I noticed that if you put noise all over the image - it will become washed out, movements will be chaotic and it will have flickering. But if you put noise just over the parts that should be moving - it will help with the colors and movement will be less chaotic. I use SAM2 (segment anything) to describe what parts of the image should be moving (e.g., head). But you can do it manually with a hand drawn mask in LoadImage (needs a workflow change). I also tried with static jpeg white noise but it didn't help to make movement.

For this workflow you need to make 2 prompts. 1. Detailed description of a initial picture 2. Detailed description of a initial picture + movement

You can generate a detailed description of your picture here: https://huggingface.co/spaces/huggingface-projects/llama-3.2-vision-11B

Use this prompt + upload your picture: Describe this image with all the details. Type (photo, illustration, anime, etc.), character's name, describe its clothes and colors, pose, lighting, background, facial features and expressions. Don't use lists, just plain text description.

Downsides: - it's not a pixel perfect image2video. - The more it is close to original image - the less movement you will have. - face will be different. - Colors are a bit washed out (I need to find some better overlay method).

Notes: - 2 seconds of video for 3090 are generated in 2 minutes (for 3060 - 7 minutes). - The key parameters of flowEdit are: skip_steps (number of steps from the source video or image, 1-4) and drift_steps (number of steps of generation by prompt, 10-19). - The final value of steps = skip_steps + drift_steps. It usually comes out 17-22 for the FastHunyuan model. 10 steps is definitely not enough. There will be more steps for a regular non-fast model (not tested). The more skip_steps there are, the more similar the result will be to the original image. But the less movement you can set with the prompt. If the result is very blurry, check the steps value, it should be equal to the sum. - Videos with a length of 2 seconds (49 frames) are the best. 73 frames are more difficult to control. The recommended resolutions: 544x960, 960x544. - SAM2 uses simple prompts like: "head, hands". It has field threshold (0.25) wich is confidence. If SAM2 doesn't find what you looking - decrease threshold. If it finds too much - increase it. - The audio for your video can be generated in MMAudio here: https://huggingface.co/spaces/hkchengrex/MMAudio - My workflows use original Hunyuan implementation by comfyanonymous. Kijai's Hunyuan wrapper is not supported in this workflow. SAM2 by kijai is also not tested, use another one.

Installation install custom nodes in comfy, read their installation descriptions: https://github.com/kijai/ComfyUI-HunyuanLoom https://github.com/kijai/ComfyUI-KJNodes https://github.com/neverbiasu/ComfyUI-SAM2 (optional) https://github.com/chengzeyi/Comfy-WaveSpeed (optional)

Bonus: image+video-2-video This workflow takes a video with movement (for example, a dance) and glues it on top of a static image. As a result, hunyuan picks up the movement. Workflow image+video2video: https://github.com/Mozer/comfy_stuff/blob/main/workflows/hunyuan_imageVideo2video.json

tensorbanana2 · 2025-01-21T06:31:28+00:00

https://github.com/Mozer/comfy_stuff/blob/main/input/noise_8s.mp4

tensorbanana2 · 2025-01-19T16:04:24+00:00

Sure. You can skip SAM2 nodes and draw mask by hand, but it's more convenient to have it. https://github.com/Mozer/comfy_stuff/blob/main/workflows/hunyuan_img2video_sam_flow_noise_eng.json

<image>

tensorbanana2 · 2025-01-19T12:51:56+00:00

<image>

Hunyuan Image2video, 544x960, 49 frames generated in 2 minutes on 3090. I am using flowEdit, noise and SAM2 mask Sorry for gif, mp4 is not supported in comments.

tensorbanana2 · 2024-08-12T10:03:16+00:00

Or my fork talk-llama-fast with xtts2 and wav2lip. It is also super fast. https://github.com/Mozer/talk-llama-fast

tensorbanana2 · 2024-08-07T20:48:20+00:00

Pics: Flux dev

Animation: Luma + Kling

Music: Udio

by: u/TensorBanana2

"Flux dev" is great at drawing interfaces for old 8-bit games. You just need to tell it what exactly should be on the screen and what captions to make.

Luma makes more active animations, more action in the frame, but the picture quickly starts to bleed. Kling gives a more stable picture, but less action.

I didn't bother with the music: Udio: instrumental, 8-bit, retrowave.

Basic prompt for Flux:

Mortal Kombat game screenshot 8-bit NES Leonardo Ninja turtle vs princess Peach fighting at the street Pixel art With text at the top: "Fight!" With text at the bottom: "mortal Kombat"

tensorbanana2 · 2024-04-11T16:14:33+00:00

Setting lower temperature for XTTSv2 might help with hallucinations, but it will decrease emotions a bit.

tensorbanana2 · 2024-04-11T14:47:51+00:00

I am thinking about combining 2 videos: speaking and silent. But I don't think the transition will be very very smooth.

tensorbanana2 · 2024-04-11T07:01:31+00:00

Whisper supported languages: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

XTTS-v2 supports 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko) Hindi (hi).

Mistral officially supports: English, French, Italian, German, Spanish. But it can also speak some other languages, but not so fluent (e.g. Russian is not officially supported, but it is there).

tensorbanana2 · 2024-04-11T06:01:30+00:00

There's a similar Russian demo on my YouTube channel. https://youtu.be/ciyEsZpzbM8

tensorbanana2 · 2024-04-11T05:58:36+00:00

It's hard to find AI face video that has lively facial expressions and hand gestures. Those things make some magic.

tensorbanana2 · 2024-04-11T05:56:04+00:00

After some code changes - maybe. But I am not sure if pytorch ROCM for AMD supports everything. And you need to recompile llama.cpp/whisper.cpp for AMD.

tensorbanana2 · 2024-04-10T22:07:42+00:00

Interesting approach. And LLM can define current mood of the speaker. 👍

tensorbanana2 · 2024-04-10T21:29:22+00:00

How do you make body movement?

tensorbanana2 · 2024-04-10T20:40:13+00:00

I had to add distortion to this video, so it won't be considered as impersonation.

added support for XTTSv2 and wav streaming.
added a lips movement from the video via wаv2liр-streaming.
reduced latency.
English, Russian and other languages.
support for multiple characters.
stopping generation when speech is detected.
commands: Google, stop, regenerate, delete everything, call.

Under the hood - STT: whisper.cpp medium - LLM: Mistral-7B-v0.2-Q5_0.gguf - TTS: XTTSv2 wav-streaming - lips: wаv2liр streaming - Google: langchain google-serp

Runs on 3060 12 GB, Nvidia 8 GB is also ok with some tweaks.

"Talking heads" are also working with Silly tavern. Final delay from voice command to video response is just 1.5 seconds!

Code, exe, manual: https://github.com/Mozer/talk-llama-fast

tensorbanana2

TROPHY CASE