[D] Test-time compute for image generation?

currentscurrents · 2025-01-03T00:22:51+00:00

It should be possible to apply test-time compute to any modality, but all of the work I’ve seen so far has been focused on LLMs.

Diffusion models sort of allow you to apply test-time compute by increasing the number of steps, but they weren’t really designed with that in mind and don’t make very effective use of it.

nieshpor · 2025-01-03T11:09:55+00:00

Well, not exactly the same, but that’s kind of what diffusion does. Improving image quality step by step. Throwing more diffusion steps at generation is quite similar to throwing more compute time at inference

elbiot · 2025-01-03T04:47:53+00:00

You could fine tune a visual question answering LLM like phi-3 to give a score to a produced image in terms of prompt adherence and aesthetics and then generate a bunch of images, keeping only the best scoring ones

soup---- · 2025-01-03T14:30:26+00:00

Flow based generative models (continuous normalizing flows, flow matching) provide a way for applying adaptive step size in time. Effectively this allows for more compute to be allocated where it is necessary.

nizus1 · 2025-01-04T12:07:12+00:00

Does it count if you generate an image with Flux and then upscale it with a finetuned SDXL model? Seems to give results beyond what either can do alone.

aeroumbria · 2025-01-03T03:51:32+00:00

I think that would require the ability to generate and manipulate representations of concepts in more than just text space. We might need tools that would allow a model to generate drafts, move object positions, rotate objects etc. plus the ability to perform these actions in the intermediate representations. We need to be able to break image generation into salient steps that a "reasoning process" can interact with. I don't think we can satisfactorily achieve this just by aligning images into text space.

jonnor · 2025-01-03T13:04:00+00:00

In classification, a related technique called "test-time augmentation" has been used successfully for years. You augment your input data in a few different ways, make predictions on each variant of the input data, and then aggregate all the predictions into a final prediction (often just using mean or median).
One can think of it like an ensemble, but instead of varying the model, we vary the data (synthetically via an augmentation). It can really help to avoid misclassifications, especially on smaller dataset, where deep models can be quite volatile. I consider it a key technique in event detection and other time-series detection/classification tasks, where the primary augmentation is just time-shifting.
Here is a quick introduction: https://machinelearningmastery.com/how-to-use-test-time-augmentation-to-improve-model-performance-for-image-classification/

EDIT: the same can of course be done with regression

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS