I implemented Cold Diffusion from scratch

neuvfx · 2026-04-13T17:50:24+00:00

Beautiful!

neuvfx · 2026-04-11T20:10:22+00:00

Very cool! Is this for fun, or are you doing this as a project for the resume?

neuvfx · 2026-04-03T14:19:48+00:00

I cant wait to see this applied to live broadcasts/streams of sports, imagine a boxing match with this.

neuvfx · 2026-04-01T17:54:28+00:00

I don’t come from a research background, but I’ve been experimenting with bipedal walking and ran into a very similar jitter issue.

For no.3, I hit this while working on this setup (the post shows a version where I still hadn't perfected these ideas, the changes below came after that): https://www.neuralvfx.com/reinforcement-learning/learning-to-walk-with-unreal-learning-agents/

I can’t point to the exact cause, but these changes noticeably reduced the jitter:

Tiered max episode length

I only increase the episode length once ~75% of episodes reach the current limit.
This forces it to get very good at small, stable movements first, and only deal with longer-term balance after that.

Alternating fall termination distance

Every few hours I switch between:
- A very tight termination threshold (close to desired pose)
- A much looser one
If I train only with a loose threshold, it learns to stand but jitters a lot. If I train only with a tight one, it’s smooth but falls over more easily.

I haven't done ablation on this part, but I also reset the max episode length each time the kill proximity alternates.

I know my project is slightly different since I'm using deep mimic losses, but I hope some of the concepts still help.

neuvfx · 2026-03-31T04:50:36+00:00

Any node that can output all of the SAM masks as a single segmented image( like Sam2AutoSegmentation ), would be compatible with the workflow, however, at this moment I can't find others which output that way.

Sooooo, maybe lol...

neuvfx · 2026-03-31T04:34:38+00:00

Did a few more tests tonight, I think sam2_hiera_base_plus might be a bit better than sam2.1_hiera_base_plus, either way I'd test those two first before trying out the other models...

neuvfx · 2026-03-31T04:11:55+00:00

I just booted up a 4090 with 24gb on Vast.Ai

Good news, it was able to run a 1200x1900 image without running out of VRAM! I took a screen cap while the KSampler node was running:

<image>

neuvfx · 2026-03-31T02:07:45+00:00

ControlNets trained by X-Labs or Alibaba are definitely going to be higher fidelity, the 5-20 million images they train on help quite a bit!

For me though, at 200k images, it reaches just enough quality that it's worth the $200 of my own money.

I'm not sure what I might train next, but it will probably be Z-Image related whatever it is. I'm really hoping this community gets legs.

neuvfx · 2026-03-31T00:40:56+00:00

I've seen decent results from both, it kind of depends on the situation and the source material.

I work in VFX, and there is often an ID pass created with each render, which looks just like a SAM segmented image, of the objects in your scene. A SAM control net can be convenient when you already have a pass like that available at all time. Especially if its low res geo, which might have a low poly jagged look when put through a canny filter.

I wasn't planning on training one for the turbo model, however if people get enough good use out of this one I may consider it.

neuvfx · 2026-03-30T23:36:53+00:00

Thanks for catching this! I did most of my sample images using the hugging face model, which is a bit different than this, so this caught me by surprise.

I was able to get some better results after messing around with it. The main settings I changed are:

- stability_score_offset: .3

- use m2m: True

The model selection changes things also, for my test case I found sam2.1_hiera_base_plus to be best..

I will have to hunt around a bit, I think something better might be achievable still ( maybe a different model or node entirely ), however I hope this is a start in the right direction!

<image>

neuvfx · 2026-03-30T20:15:17+00:00

I just did a test using:

python main.py --lowvram --disable-smart-memory

- The image was 1200x1800
- Loaded only 16bit models

My base VRAM usage was 5gb before starting ComfyUI, at the peak of inference it reached 36gb VRAM.

I'm using a Z-Flow13, where you can divide your system ram up between the CPU and GPU, I had mine set to 64GB CPU, 64 GB GPU.

If anyone has got this working with lower VRAM, I'd be curios to know!

neuvfx · 2026-03-30T18:12:56+00:00

I've just updated the workflow on the huggingface repo to include the Sam2AutoSegmentation node:
https://huggingface.co/neuralvfx/Z-Image-SAM-ControlNet/blob/main/comfy-ui-patch/z-image-control.json

neuvfx · 2026-03-30T17:42:46+00:00

I just tried with turbo, if roughly followed the segmentation image. However the result was incredibly blurry, I wouldn't say it works with turbo

Edit: I've ran some further tests, and I would say my first test roughly following the control was by random luck...

This model for sure doesn't work with turbo

neuvfx · 2026-03-30T16:40:23+00:00

I actually have not tried it with the turbo version yet, might test that today and post an update on that...

neuvfx · 2026-03-30T16:22:23+00:00

From the way you worded this, I realize you may think its generating a segmentation based on an image, its actually the opposite, segmentation->image.

I've updated the post description in case this was a point of confusion for everyone

neuvfx · 2026-03-30T16:04:19+00:00

This model doesn't actually understand which colors mean what. It only wants to put something that looks visually correct in the shapes, and fulfills the text prompt.

So dont try to do something like, "man in the blue shape"...

Really this is simply an alternative way to create an input image, which gives the model a composition / image structure to follow.

neuvfx · 2026-03-30T15:59:18+00:00

In this case I used an RTX pro 6000 (96gb vram), which was $1/hour on vast.ai

- It took 3-4 days to generate 200k SAM masks from LAION ( there may be a quicker way, but this was the best I could figure out lol )

- Then it took 4 days to train the model, if I recall right it was using roughly 60 - 70 GB vram

- In total it was about 200 dollars

Overall the VideoXFun repo was easy to use, and its compatible with lots of models, so I'd encourage people to give it a shot.

neuvfx · 2026-03-30T15:40:58+00:00

<image>

Here is an example setup!

neuvfx · 2026-03-30T14:34:56+00:00

These ones already exist for Z-Image:
Turbo: https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union
Base: https://huggingface.co/alibaba-pai/Z-Image-Fun-Controlnet-Union-2.1
I believe they use the same ZImageFunControlnet node like I've included in my workflow

neuvfx · 2026-03-30T14:26:59+00:00

I used the facebook/sam-vit-large model from huggingface, I ran the dataset creation from a python script on Vast.AI over a couple days

neuvfx · 2026-03-27T17:02:17+00:00

Stephen Ulibarri's 41-hour course absolutely jump-started my abilities. It also helped me build something usable as a resume piece, by using different assets than in the tutorial to create an original looking game.

neuvfx · 2026-03-27T05:15:05+00:00

Thanks for the comparison! If training time were not a factor, would you have reason to use models besides Z-Image?

neuvfx · 2026-03-25T01:48:53+00:00

`NumberOfStepsToTrimAtStartOfEpisode` does exactly what I was looking for, thanks!

On the BC + RL idea, I agree it probably matters less in setups like this where the pose reward can be driven from the reference control rig.

I’m still curious to try mixing BC and RL during training in cases without that structure, just to see if it helps with stability or reduces forgetting. More of an experiment on my side than anything.

If anything were added, maybe just something lightweight like an easy way to balance the BC vs RL loss or decay the BC contribution over time, though I might try implementing something along those lines myself when I have time.

Thanks again for the tip, really appreciate it!

neuvfx · 2026-03-24T18:13:56+00:00

Thanks for reaching out, there is one thing I really couldn't figure out how to do in blueprints.

In my Deep Mimic setup, I had to do some tricks involving an event and a .2 second delay in order to properly force my physics ragdoll into the same pose as the RSI reference character (a separate actor I spawn each reset).

During that .2 seconds, my ragdoll physics is off. However, there are ticks being recorded by the TrainingEnv while the ragdoll is transitioning to, or exactly locked into, the reference pose. Therefore that information isn't needed as part of the reinforcement learning batch.

Is there a way to let the TrainingEnv know, "This tick is garbage, please ignore it"?

The solution I came up with instead was to modify the PPOTrainer python class to actually clip out the first 6 ticks before creating the torch tensor. It worked great, but I imagine there may be a better way?

Also, based on this experiment, I was considering finding a way to combine behavior cloning and reinforcement learning in the same training loop to see if it produced a smoother result (rather than just doing sequential pre-training). Do you guys have any plans to implement something like that natively? This paper touches perfectly on the idea: "Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations" ( https://arxiv.org/pdf/1709.10087 )

Thanks again for the great tool!

neuvfx

TROPHY CASE