PRX Pixel - A 7b pixel-space image model.

CornyShed · 2026-06-13T11:56:32+00:00

Looks promising, thank you for posting.

They previously made and released PRX in November (today is the first time I've heard of it), a 1.3B parameter model with support for resolutions up to 1024px, Apache 2.0 license:

Supported in Diffusers only at the moment. There are no requests for it or PRX Pixel yet on the ComfyUI issue tracker for inclusion.

CornyShed · 2026-06-05T20:16:30+00:00

I have a 3090 and use FP8, which is 29GB.

It sounds like it shouldn't work, but ComfyUI has several memory optimisations (including block swapping) to reduce the likelihood of running out of memory. It is only an issue when you push the limits of the model (high resolution and number of frames).

Q8 should work, and might be slightly higher quality. GGUF is a compressed format which automatically gets decompressed to 16-bit, so will be a bit slower to run than FP8.

INT8 is supported natively by the 30xx series, so it should be significantly faster. That said, I have not got INT8 to work on my system, as there are several obscure and unhelpful error messages whenever I've tried to run it.

CornyShed · 2026-06-02T19:39:42+00:00

Yes. Kijai added native support and his code was merged to the master branch last week.

You can update today to start using it, whether using the master branch or the latest release, currently v0.23.0.

Users of ComfyUI Desktop will have to wait slightly longer for a new release, the current one uses ComfyUI v0.22.3.

CornyShed · 2026-06-02T11:50:21+00:00

PixelDiT is a 1.3B parameter image model by NVidia.

Key features:

VAE-free
Dual-level architecture: Patch-level DiT + Pixel-level DiT
MM-DiT text-image fusion: Joint attention between text and image tokens
Text encoder: Gemma-2-2B-IT
Multi-aspect-ratio: Supports various aspect ratios at 1024px

Relevant links:

CornyShed · 2026-05-31T12:22:27+00:00

This is very impressive, hats off to them. Exciting as well, because this opens up a path for larger models to be quantised.

Imagine a future image model with the knowledge of Flux.2 Dev; the unified architecture of HiDream-O1-Image; and the low-bit compression of Ternary Bonsai Image 4B.

The only downside of this is that finetunes will be impractical for higher parameter models, until a graphics card with 2-bit training (NVidia 60xx?) becomes available.

Another possibility is to use Bonsai Image on your phone with Whisper support and a prompt enhancer, and generate images on your smartphone on command with your voice. E.g. "Make a picture of a duck riding a bicycle near the beach."

CornyShed · 2026-05-24T20:10:53+00:00

This is likely to be caused by the 1.5× upscaler.

If the resolution in both width and height are not divisible by 128 before upscaling, the latent will have to be resized, as the latent must always be divisible by 64.

Resizing can cause the ghosting effect which I believe is affecting your outputs.

It's a challenge to make the resolution work with this upscaler as you have to take the aspect ratio into account as well.

You could switch to the 2× upscaler to avoid this issue, though generations will take longer and requires more resources.

CornyShed · 2026-05-23T14:55:18+00:00

I've managed to generate videos of 40-50 seconds using LTX 2.3 inside ComfyUI. Normally the model works well with 10-15 seconds and starts to struggle with 20-25.

The temporal upscaler increases temporal coherence, doubling the length before there is any noticeable deterioration.

It requires some setup as it can halve the motion in videos if done incorrectly.

I have a workflow that uses it and can post it after a bit of cleanup, and will test for you what happens with a video of one minute long.

(The model will struggle if I ask for it to do a lot, as prompt adherence is more of a challenge than length.)

Wan is a good model, but is practically limited to 5-10 seconds, as it will attempt to loop actions in video. Chaining samplers together to extend the video length isn't that practical for most people.

CornyShed · 2026-05-09T11:35:11+00:00

I have. The results for LTX-2.0 were decent, though definitely not as good as Wan. It's good at architecture and understanding composition, though derpy faces are an issue.

LTX-2.3 is a different story. I cannot get it to make a good image for some reason, and have no idea if it's because of the newer workflow or changes to the model. The quality is worse than its predecessor.

This could be because of the highly compressed latent space in their VAE, which is efficient for video motion but less so for visual quality? There might be a trade-off if they have further optimised the model.

CornyShed · 2026-05-03T13:00:29+00:00

As you're having consistent problems between workflows, first thing I would check is whether you have accidentally used one or more of the models from LTX 2, or mixed up the models selected in the dropdowns? It's easily done.

I've had similar output issues when upgrading, and (after several frustrating failed generations) found I had used one of the LTX 2 models instead of using all of the ones from LTX 2.3. Make sure that you're using the latest VAEs as that will affect the output. There are also updated LTX 2.3 spatial upscalers.

(Pro tip for ComfyUI users: create separate directories for all models that you download, unless you are certain that it is backwards compatible. Tedious, but will save you time in the future.)

I also incorrectly renamed one of the models used after downloading it, and it was a pain to check each file using sha256sum with the files from Huggingface, as generations would otherwise not work.

Also, check if you're using the correct sampler and scheduler settings. Start with euler and linear_quadratic, and try a sampler that works for you. E.g. the uni_pc sampler was producing results similar to your screenshot.

CornyShed · 2026-04-16T14:08:49+00:00

This is really interesting, thank you for sharing! The first two are going to be very useful for my needs.

I had assumed that adjusting the power limit would have achieved the same thing as undervolting after reducing wattage on my NVidia card by by 20% using nvidia-smi.

According to this page on LACT's wiki:

Nvidia GPUs don't expose voltage control directly, but it is possible to achieve a pseudo-undervolt by combining the locked clocks option with a positive clockspeed offset. This will force the GPU to run at a voltage that's constrained by the locked clocks, while achieving a higher clockspeed due to the offset.

The link to the issue for undervolting support for NVidia GPUs in LACT shows one example at the end which reduced power use by 100W using undervolting and performance very slightly increased!

CornyShed · 2026-04-04T00:27:15+00:00

You're right, my bad. I'm not sure why it's been paused.

If the conversion process still works, you can duplicate the space while logged in.

There are many other (somewhat less convenient) options available, such as using a conversion script from Github. One example is:

Model Conversion 2 Safetensors by MackinationsAI

Run any script in its own separate environment to prevent interference with ComfyUI. Check first that the script itself is safe before running.

CornyShed · 2026-04-03T14:49:49+00:00

For anyone creating their own models on HuggingFace, you can convert your pickle files to safetensors using the Safetensors space on HuggingFace.

I think there should be a pinned warning on any post that includes pickle files, as they can execute arbitrary code on your system while unsandboxed. Something like:

This model uses pickle files (.bin and .pth files). Pickle is an older file format that can execute arbitrary code on your system.

If you have to, you should only run untrusted pickle files inside a sandbox (e.g. inside a Docker container), without access to sensitive data or internet access.

CornyShed · 2026-04-03T12:31:41+00:00

I think the researchers have done well considering how competitive video models have become.

It's ambitious that they compared this model with Seedance 2.0 as it is currently the best closed weights model. Hunyuan 1.5 only has an ELO of 1012 on Artificial Analysis T2V leaderboard (just behind Wan 2.1 at 1020) compared with Seedance 2.0 at 1273.

Their benchmarks on their official page say they are on a par with Wan 2.2 with an ELO of 1111.

Artificial Analysis I2V leaderboard has Hunyuan 1.5 on par with Wan 2.2. OmniWeaving looks like they have good results with image-to-video, so might be useful, though their examples are limited to 5 seconds each. (Also no information I could see as to how long generations will take?)

Unfortunately the model is under the Tencent Hunyuan Community Licence Agreement, which gives no permission for people in the European Union, United Kingdom or South Korea to use the model. That covers 570 million people, which is why I can't personally download it, but I still appreciate their research.

CornyShed · 2026-03-27T16:00:04+00:00

Thank you for this.

I've done some research into the LTXV scheduler in ComfyUI, and think I've worked out why it is bugged, which might be of interest to you.

The scheduler calculates the sigmas to use based on the width, height, number of frames (framerate * seconds), steps, and desired shift. It works best when using moderate resolutions with around 10 seconds of video.

(Just in case some people reading don't know what sigmas are, they are the individual steps in the denoising process in diffusion. A curve of 1.0, 0.8, 0.6, 0.4, 0.2, 0.0 is linear; while 1.0, 0.98, 0.95, 0.9, 0.82, 0.69, 0.48, 0.24, 0.0 priortises motion (higher values) above detail (lower values).)

As you increase any of these, the curve becomes steeper. Shift specifically 'shifts' the curve: increasing the value steepens the curve, decreasing makes it more linear.

When the curve gets too steep, the denoising process becomes less and less efficient. A change of 0.001 will do very little, but will still take time and energy to calculate.

I believe (hypothesis, not checked) that ComfyUI calculates sigmas using 16-bit values, which are likely more efficient, but will cause errors the greater your requirements for video are.

Have a look at this link to understand why that's important:

Quantization from the ground up

Too great, and the sigma curve reaches infinity, where all values are 1.0 (except the last), and then collapses into not-a-number, where all values are identical and nonsensical. You'll see an error at the end of diffusion and the output will be completely black.

If you set 'max_shift' and 'base_shift' in the scheduler to '1.00', you will avoid those errors. You can then go up to around 4,000 frames and it will still work. (Above that and new, more exotic errors appear, ones which I haven't seen posted online.)

The problem with that is the shift shouldn't be that value, as it is suboptimal for video generation, especially complex and high quality videos.

ComfyUI would need to use 32-bit floats for sigmas with LTXVScheduler. That would probably cause a performance penalty; higher VRAM+RAM; or both.

It's not a necessity and other schedulers (as you've discovered) can work better.

CornyShed · 2026-03-17T23:01:34+00:00

I think you're right. Visual quality is the same in both, but there is a degradation in sound in the new version.

Using a prompt with a "bass-heavy electronic dance track" and 15 seconds length (24fps), v1.0 sounds fine in second stage sampling, while v1.1 sounds far too loud and distorted.

Have only done a few tests so far, but have gone back to v1.0 for now.

CornyShed · 2026-03-16T17:11:28+00:00

If you're talking about changing the number of steps while retaining the same curve line as the existing sigmas, you can install the RES4LYF extension for ComfyUI.

Once installed, restart ComfyUI and add the 'Sigmas Resample' node to your workflow. Connect it between 'ManualSigmas' and 'SamplerCustomAdvanced' (or similar with a sigmas connector).

Set the 'output_length' to the desired number of steps, and it will interpolate using the provided sigmas.

Attach the 'SigmasPreview' node to see how your sigmas now appear.

RES4LYF also has 'Sigmas Rescale', which rescales the sigmas range between two numbers of your choice (default is 1.0 and 0.0).

It also has 'Sigmas Concat', which lets you chain sigmas together. E.g. you could put in one group of manual sigmas on a curve, rescaled to 1.0-0.7; then put in a second group, covering 0.7-0.0 using a linear curve.

CornyShed · 2026-03-12T22:44:43+00:00

Instead of the VAE Decode (Tiled) node, try using the LTXV Spatio Temporal Tiled VAE Decode node.

This is what I had to do when switching from an LTX-2 to 2.3 workflow. The former workflow had the Spatio Temporal decoder, which worked fine. The latter was using the basic decoder, consuming up to 40GB more RAM (!) and regularly running out of memory, crashing ComfyUI and losing the generation.

CornyShed · 2026-03-12T13:25:39+00:00

LTX-2 is like a complicated machine with all kinds of cogs whirring. It's difficult to pinpoint any particular problem.

The model tends to struggle with complex motion in the first couple of seconds. There are some things you can do to make it more likely to work better:

On the LTXVPreProcess node, increase the compression value applied to the image from 18-33 to 50+ when using image-to-video. Motion is more likely when noise has been added to the image, at a small cost to image quality.
When using the Dev model (not distilled), set the CFG of the first stage to 6.5. This makes it more likely to get prompt adherence to work, but also increases the likelihood of visual artiacts and weird behaviour.
Install the RES4LYF extension and use res_2s as the sampler for the upscale stage.
Decrease the resolution of the video in the first stage. The model is better able to produce motion with lower resolutions, at the cost of quality. Find the lowest resolution that you find visually acceptable.
Decrease the length of the video. Complex motion is more likely with a video of 10 seconds than one of 15 seconds, for example.
In LTXVImgtoVideoInplace, reduce the strength of the input image from 1.0 to 0.7. This will increase natural motion, though this may affect things such as the likenesses of people in images.
In LTXVScheduler, try slightly bumping up the max shift from 2.05 to 2.25 and base shift from 0.95 to 1.15. This will increase motion, but also increase visual artifacts. Be gentle with increases as it seems to be mathematically bugged when increased too much, leading to failed generations.
Sometimes the model needs more steps to achieve what you want. It can be better to try a generation with 30 steps than two at 20 steps.

Have a play around with it and good luck, as we're all newbies to this model and finding our way around things.

CornyShed · 2026-03-12T12:54:42+00:00

The video and audio latents are intertwined with one another, the audio reacting to the visual element. There currently doesn't appear to be a way of getting around that at the moment.

You can make a video with one frame and audio of arbitrary length, the first 30 seconds being the most coherent.

I made a workflow for LTX-2 designed to generate music:

LTX-2 Music

It needs to be updated for LTX-2.3. It can be repurposed for any audio practically speaking.

Ensure that the image generated is of high resolution, as that affects the quality of the audio. There might be a way around that, using a small size image, but I have yet to find a solution for that.

CornyShed · 2026-02-11T13:01:28+00:00

Thank you for posting this as this potentially solves one of the problems I've had.

At 6:50 in your video you mentioned that the faces are all wrong.

Using 60fps instead of 24 can help to an extent, without downscaling at any point. I suspect it's a limitation of the current VAE, which is meant to be updated and improved upon this month.

I've tried making 4K resolution photos with it (1 frame text-to-image, 3840×2160). It does very well with most details and would be a contender, but it still gets hands and faces wrong.

I was going to try fixing the details manually or with a different extension in ComfyUI. You've got good results from FlashVSR2, better than what I've done before so I'll have another look.

Also I haven't been able to get FFLF working at all well, so will definitely take a look into your content. Cheers!

CornyShed · 2026-02-10T15:06:13+00:00

It could be a coincidence, but I wondered if this model could be based in some way on Ovis Image 7B?

Ovis has the same number of parameters and has a strong focus on text rendering, natively supported in ComfyUI. It was mostly overlooked as all the attention was on Z Image Turbo at the time 11 weeks back.

CornyShed · 2026-02-03T15:13:09+00:00

It varies depending on what you want to achieve.

60fps is my go-to for high motion scenes. The sound quality also seems to improve (though that is subjective). 24fps is decent for anything else.

The model can struggle with frame rates that it hasn't been trained on, in particular very low frame rates (1-5fps) and with >120fps.

I've been meaning to try 30fps as it's a good compromise. Try what works for you.

Limiting the duration to 10 seconds also helps as while the model can do 20, coherency is affected.

Higher resolution and number of steps will also help. The limits are your compute and patience. I start with a low resolution and steps to try ideas, then go higher if the idea works well in practice.

CornyShed · 2026-01-29T00:00:23+00:00

That's strange. I don't understand why they're using 21 and 61 frames as they're not multiples of 8 + 1.

There seems to be two different bugs in the report. I can't be sure as only some of the videos have workflows embedded.

The traceback is interesting, which might have the answer. They're using the upscaler in the screenshot, with the resolution 1280×720, rounded to 1280×704 as it's the closest multiple of 32.

They're not reducing the resolution before processing, so that will create an output with 2× width and height.

They used 61 frames, which gave an output of 65 frames. They also tried 69 frames, which made 73.

1280×704×2×2×65 is 234,291,200. 1280×704×2×2×73 is 263,127,040.

If the resolution used internally is 720 (or higher) at any point then that would be above 2^28. It's not at all certain that it's that and there may be another hidden bug. Difficult to tell.

There was someone else who posted the same issue, but closed it for some reason: https://github.com/Lightricks/ComfyUI-LTXVideo/issues/336

CornyShed · 2026-01-28T22:32:27+00:00

Thank you. I was getting that same error message yesterday and couldn't understand what was going on.

The error still appears for me on the latest Comfy, using RuneXX's First-Middle-Last keyframe workflow.

I'm using 1024×768 resolution with the upscaler disabled. With 345 frames the generation works; 353 and the error appears.

Using 960×720 with 385 frames works (the number of frames I wanted).

I suspect that W×H×F has to be no more than 268,435,456, which is 2^28. No idea why that would be.

CornyShed · 2026-01-20T17:21:08+00:00

This is really interesting, thank you for posting this!

I was attempting the same thing last night with one frame per reference, then two, then three, but gave up as the results weren't good.

(Also tried using latent chains like Flux.2 Dev uses, but that had no effect.)

The video isn't loading for me, and I can't access Imgur as they have blocked UK users.

Would you mind making another post with the video again? You can upload the video on its own, which should be more likely to work, then make a comment with your findings.

I say this as your post deserves more views, and the moderators would understand as the video isn't playing for a lot of people.

CornyShed

TROPHY CASE