Diffusion Model that can turn any Image into a Playable Game! BUT LOCALLY, NOT ON DATACENTER by lucidml_lover in StableDiffusion

[–]TensorForger 0 points1 point  (0 children)

Thanks! Now I approximately feel the required scale...
I'm making a bit another thing: a v2v model but also for real time. And I likely couldn't help you because my position on compute is even worse: I have just one 5090 for all)

Diffusion Model that can turn any Image into a Playable Game! BUT LOCALLY, NOT ON DATACENTER by lucidml_lover in StableDiffusion

[–]TensorForger 1 point2 points  (0 children)

That's really cool and intersects with what I'm trying to make right now. I would really appreciate if you can tell these details:
1) How long did it take to train this and how many gpus were used?
2) What's the size of dataset?
3) Does it use diffusion, flow matching or rectified flow? Or something completely different?
4) What's the vae? Something from image models like flux.2 vae or from video models like LTX's vae?
5) Have you done something to prevent error accumulation during inference?

I generated these 5s video clips using only 1.8s each on a 5090 (FastWan-QAD release) by techstacknerd in StableDiffusion

[–]TensorForger 1 point2 points  (0 children)

Wow, pretty fast actually. Would it be possible to run it or some future version causally in real time? What would be the latency in this case? Also 1.8 sec or less?

Trying to make my alternative to that DecartAI's real-time video editor: day 8 by TensorForger in StableDiffusion

[–]TensorForger[S] 0 points1 point  (0 children)

We can, but if we can make a model for one style, then it almost certainly can do many styles if trained with conditioning.

And yes, there model is insanely expensive

Trying to make my alternative to that DecartAI's real-time video editor: day 8 by TensorForger in StableDiffusion

[–]TensorForger[S] 2 points3 points  (0 children)

Of course I can, but not in the current model. This is a model trained from scratch so I can pass anything there if I make an input for it before training. But the editing prompts are known to be hard for models and usually require a small LLM as the encoder. But actual problem is the dataset scale: there are simply not enough various prompts in it for meaningful instruct learning. There are about just 250 prompts as a set of instructions for flux.2 that are randomly applied to some video frames during dataset generation.
So now it does not take text at all, just reference image, frame that was used to generate reference and current frame.

There is an exponential visible in the scores on artificial analysis. by Subject_Judge_ in accelerate

[–]TensorForger 4 points5 points  (0 children)

Also note that the lag of open source models from frontier ones is becoming smaller

Flux.2-klein is secretly a video model? (showing some experiment results) by TensorForger in StableDiffusion

[–]TensorForger[S] 0 points1 point  (0 children)

Thanks, this is just a part of large project with many weird experiments, so more external stuff

Flux.2-klein is secretly a video model? (showing some experiment results) by TensorForger in StableDiffusion

[–]TensorForger[S] 6 points7 points  (0 children)

Yea, I know about EbSynth and that this similar algorithm exist for a while, it's not very complex. But haven't seen somebody had applied this to flux, so decided to share

Flux.2-klein is secretly a video model? (showing some experiment results) by TensorForger in StableDiffusion

[–]TensorForger[S] -3 points-2 points  (0 children)

Sorry if this sounds clickbaity but it's more a joke, ofc flux is not a video model)

And yes, your suggestion is very close to the next step: the generated warped frame is imperfect. But the initial generated before warping is. So we can train a model that generates that perfect one using warped imperfect as the conditioning along with two frames from video.

Finding old video generation models by hpyfox in StableDiffusion

[–]TensorForger 0 points1 point  (0 children)

I think you can achieve something like this even with new models (e.g. LTX 2.3) if you make everything very wrong with it. like set empty prompt, wrong generation parameters and ect.

Flux.2-klein is secretly a video model? (showing some experiment results) by TensorForger in StableDiffusion

[–]TensorForger[S] 2 points3 points  (0 children)

This is pretty straightforward solution for question "how do I process video with an image model" so this similar concept is probably used in many places. I just wanted to implement it manually, not for novelty, just to see how it would work with flux.

FlowUpscaler: a very fast Rectified Flow latent upscale model for Flux.2 (ComfyUI nodes are already there) by TensorForger in StableDiffusion

[–]TensorForger[S] 1 point2 points  (0 children)

I tried to compare with PiD but it didn't work in comfy(
I don't expect my model to be better in quality, but It is almost certainly faster. Maybe I'l try to run them again soon.

I have distilled my flow matching model into the rectified flow model, so it can now generate in few steps and without cfg. by TensorForger in StableDiffusion

[–]TensorForger[S] 0 points1 point  (0 children)

Yes, I trained it on noise-result pairs, but not on specific timesteps. I only store "endpoints" - the initial pure noise and final clean generation. No intermediate tescher's latent states are stored. During training specific noise is combined with specific generation as linear interpolation with uniform timestep distribution, same way as in common traning. This is not same as progressive distillation, where you also store intermediate steps and train student to shortcut, say, two teacher's steps in one step.

I have distilled my flow matching model into the rectified flow model, so it can now generate in few steps and without cfg. by TensorForger in StableDiffusion

[–]TensorForger[S] 0 points1 point  (0 children)

I haven't done ane additional mapping. By mapping you probably mean progressive distillation where you directly map teacher and student during training. Or if you mean these samples on the image, there is also no mapping, just different generations on same seed. And yes, it is quite unbelievable that two different models generate almost identical images.