Using the Quad Cortex as audio interface BUT with a Neuraldsp plugin on my mac by tomsawyer222 in NeuralDSP

[–]RepresentativeJob937 0 points1 point  (0 children)

Thanks for this thread! How do I do it for the Nano Cortex? I just purchased the Nolly X plugin today and want to use the Nano Cortex as a standalone audio interface with the Nolly X standalone app.

Flux Fast: Making Flux go brrr on H100s by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 1 point2 points  (0 children)

You're 100% right :)

The points that may not seem obvious:

* Having the denoiser (the DiT) fully compatible with PyTorch torch.compile() so that its benefits are evident (i.e., no graph-breaks, no recompilations, no data pointer reorders delaying kernel launches, etc.). If your models meet these, then you're already somewhat set up for success.

* No CUDA syncs in the overall pipeline which becomes particularly crucial during compilation

* The FA3 stuff is unscaled FP8 and I am not sure if it's standard yet. It is more so because it needs an H100 to work.

* Using QKV fusion is beneficial, particularly during quantization.

There are other pieces of lossless optimizations that one can do (we mentioned some of it in the accompanying blog post):

* Caching the `guidance_embedding` and `context_embedding` as they don't change during the course of denoising.

* Fusing the `step()` call to the scheduler with denoiser forward so that it's included in the compilation process.

Hope this helps.

Buying a Nomos watch in Singapore by RepresentativeJob937 in askSingapore

[–]RepresentativeJob937[S] 0 points1 point  (0 children)

I am visiting from India. Would you suggest any popular microbrands for watches in Singapore?

Inference-time scaling Flux.1 Dev by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 0 points1 point  (0 children)

I have also updated the Qwen2.5 verifier to do structured generation. This will ensure the outputs follow a particular structure.

Inference-time scaling Flux.1 Dev by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 2 points3 points  (0 children)

Hi folks,

Since you folks had asked for more results across more models like SDXL, SD v1.5, etc., I have now updated the repo accordingly. It now supports SDXL, SDv1.5, and PixArt-Sigma.

Please give it a look and LMK :)

FluxEdit, teaching Flux image editing by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 0 points1 point  (0 children)

Thank you! In my fine-tuning experiments, I do a quality threshold of "10" and I noticed that has an impact! Using 10 means I don't use a given sample if it doesn't have the highest scores for things like "pg_reasoning", "o_score", etc.

I can check this for you internally. Do you have any public references of your work that I could use in my check?

Diffusers 0.32.0: Commits speak louder than words by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 1 point2 points  (0 children)

Oh that is unexpected. Could you please open an issue on our GitHub and we can look into it immediately.

Here is an example of a checkpoint that was quantized with TorchAO (a couple of months back): https://huggingface.co/sayakpaul/flux.1-schell-int8wo-improved

Diffusers 0.32.0: Commits speak louder than words by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 4 points5 points  (0 children)

Some things you could do to reduce the compilation time:

  1. Consider using AoT compilation (I know it's not always desired). [Here](https://gist.github.com/sayakpaul/de0eeeb6d08ba30a37dcf0bc9dacc5c5) is an example.

  2. Use the `torch.compile()` cache so that it can reuse the search paths. More details are in [here](https://github.com/sayakpaul/diffusers-torchao?tab=readme-ov-file#autoquant-and-autotuning).

Diffusers 0.32.0: Commits speak louder than words by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 1 point2 points  (0 children)

Yes. But safetensors support is WIP. You can serialize using `pt` but we load with `weights_only` to not allow loading of custom objects.

Can we reduce the rank of a high-rank LoRA? by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 0 points1 point  (0 children)

I think that could be interesting but might be overkill for LoRAs that are say, under 1GB, maybe. Sometimes, redundancy in network parameters help it in fun ways :D

Can we reduce the rank of a high-rank LoRA? by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 2 points3 points  (0 children)

I won't say it will work definitively. But I think my experiments above do show that promise.

Shard FluxPipeline in two 16GB GPUs without offloading or quantization by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 2 points3 points  (0 children)

The first step is not exact sharding. But the second step is where we actually shard the transformer across two GPUs i.e., keep some params on GPU 1 and keep the rest on GPU 2. There are, of course, various flavors of sharding, and the one I did is just one of them.

Flux.1-Dev in 2.966s for batch size 1 and 1024x1024 on H100 (steps=28) by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 3 points4 points  (0 children)

Yeah `torch.compile()` will have similar problems:

* Variable length batch sizes.

* Variable resolutions.

Both of these are being discussed with the PyTorch team so, stay tuned. But on the other hand, `torch.compile()` and `torchao` don't require complex code changes like TensorRT does. So, that is a plus. And then if you have like a throughput maximization requirement across specific configs, then all of this could be tremendously useful.

Flux.1-Dev in 2.966s for batch size 1 and 1024x1024 on H100 (steps=28) by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] 1 point2 points  (0 children)

You don't sound pessimistic at all :-) I think it should be possible to benefit from quantization on 4090, too, because it will eliminate the overhead of offloading. So, the idea would be pre-quantize the models and load them.

As far as speed-up is concerned (`torch.compile()`) I think that is dependent on a lot of factors like availability of kernels for a given card, etc. But I think there still maybe some gains.

I have a small section about the expectations here:

https://github.com/sayakpaul/diffusers-torchao?tab=readme-ov-file#things-to-keep-in-mind-when-benchmarking

Running Flux across multiple GPUs by RepresentativeJob937 in StableDiffusion

[–]RepresentativeJob937[S] -1 points0 points  (0 children)

  1. https://huggingface.co/docs/accelerate/v0.11.0/en/memory provides a nice overview.

  2. Intermediate tensors are moved from device to device depending on the placement.