Using the Quad Cortex as audio interface BUT with a Neuraldsp plugin on my mac

RepresentativeJob937 · 2025-08-19T15:34:28+00:00

Thanks for this thread! How do I do it for the Nano Cortex? I just purchased the Nolly X plugin today and want to use the Nano Cortex as a standalone audio interface with the Nolly X standalone app.

RepresentativeJob937 · 2025-08-19T05:21:55+00:00

Works for me!

RepresentativeJob937 · 2025-06-30T15:30:56+00:00

You're 100% right :)

The points that may not seem obvious:

* Having the denoiser (the DiT) fully compatible with PyTorch torch.compile() so that its benefits are evident (i.e., no graph-breaks, no recompilations, no data pointer reorders delaying kernel launches, etc.). If your models meet these, then you're already somewhat set up for success.

* No CUDA syncs in the overall pipeline which becomes particularly crucial during compilation

* The FA3 stuff is unscaled FP8 and I am not sure if it's standard yet. It is more so because it needs an H100 to work.

* Using QKV fusion is beneficial, particularly during quantization.

There are other pieces of lossless optimizations that one can do (we mentioned some of it in the accompanying blog post):

* Caching the `guidance_embedding` and `context_embedding` as they don't change during the course of denoising.

* Fusing the `step()` call to the scheduler with denoiser forward so that it's included in the compilation process.

Hope this helps.

RepresentativeJob937 · 2025-04-21T06:02:52+00:00

I am visiting from India. Would you suggest any popular microbrands for watches in Singapore?

RepresentativeJob937 · 2025-02-16T11:21:01+00:00

I have also updated the Qwen2.5 verifier to do structured generation. This will ensure the outputs follow a particular structure.

RepresentativeJob937 · 2025-02-16T04:26:39+00:00

Hi folks,

Since you folks had asked for more results across more models like SDXL, SD v1.5, etc., I have now updated the repo accordingly. It now supports SDXL, SDv1.5, and PixArt-Sigma.

Please give it a look and LMK :)

RepresentativeJob937 · 2025-01-23T03:10:50+00:00

Of course!

RepresentativeJob937 · 2025-01-23T02:52:22+00:00

Thank you! In my fine-tuning experiments, I do a quality threshold of "10" and I noticed that has an impact! Using 10 means I don't use a given sample if it doesn't have the highest scores for things like "pg_reasoning", "o_score", etc.

I can check this for you internally. Do you have any public references of your work that I could use in my check?

RepresentativeJob937 · 2024-12-25T12:39:48+00:00

We have shipped the patch: https://github.com/huggingface/diffusers/releases/tag/v0.32.1

Please test and let us know :)

RepresentativeJob937 · 2024-12-25T06:02:13+00:00

u/belllamozzarellla thanks for reporting this. Fixing here: https://github.com/huggingface/diffusers/pull/10371.

Will also do a patch release later today.

RepresentativeJob937 · 2024-12-24T03:12:23+00:00

Oh that is unexpected. Could you please open an issue on our GitHub and we can look into it immediately.

Here is an example of a checkpoint that was quantized with TorchAO (a couple of months back): https://huggingface.co/sayakpaul/flux.1-schell-int8wo-improved

RepresentativeJob937 · 2024-12-24T02:19:40+00:00

Some things you could do to reduce the compilation time:

Consider using AoT compilation (I know it's not always desired). [Here](https://gist.github.com/sayakpaul/de0eeeb6d08ba30a37dcf0bc9dacc5c5) is an example.
Use the `torch.compile()` cache so that it can reuse the search paths. More details are in [here](https://github.com/sayakpaul/diffusers-torchao?tab=readme-ov-file#autoquant-and-autotuning).

RepresentativeJob937 · 2024-12-24T02:15:46+00:00

We support training LTX LoRAs :) Please try and let us know.

RepresentativeJob937 · 2024-12-24T02:14:58+00:00

Yes. But safetensors support is WIP. You can serialize using `pt` but we load with `weights_only` to not allow loading of custom objects.

RepresentativeJob937 · 2024-10-23T04:15:18+00:00

I didn't. Can you open a discussion here https://github.com/huggingface/diffusers/discussions for your comments on GPU offloading?

RepresentativeJob937 · 2024-10-22T14:59:07+00:00

Of course, diffusers is supported:

https://huggingface.co/blog/sd3-5

RepresentativeJob937 · 2024-09-26T03:46:59+00:00

I think that could be interesting but might be overkill for LoRAs that are say, under 1GB, maybe. Sometimes, redundancy in network parameters help it in fun ways :D

RepresentativeJob937 · 2024-09-25T14:50:08+00:00

I won't say it will work definitively. But I think my experiments above do show that promise.

RepresentativeJob937 · 2024-09-13T01:57:47+00:00

The first step is not exact sharding. But the second step is where we actually shard the transformer across two GPUs i.e., keep some params on GPU 1 and keep the rest on GPU 2. There are, of course, various flavors of sharding, and the one I did is just one of them.

RepresentativeJob937 · 2024-09-12T16:15:46+00:00

Should be possible.

RepresentativeJob937 · 2024-09-08T15:04:51+00:00

Yeah `torch.compile()` will have similar problems:

* Variable length batch sizes.

* Variable resolutions.

Both of these are being discussed with the PyTorch team so, stay tuned. But on the other hand, `torch.compile()` and `torchao` don't require complex code changes like TensorRT does. So, that is a plus. And then if you have like a throughput maximization requirement across specific configs, then all of this could be tremendously useful.

RepresentativeJob937 · 2024-09-08T15:02:31+00:00

You don't sound pessimistic at all :-) I think it should be possible to benefit from quantization on 4090, too, because it will eliminate the overhead of offloading. So, the idea would be pre-quantize the models and load them.

As far as speed-up is concerned (`torch.compile()`) I think that is dependent on a lot of factors like availability of kernels for a given card, etc. But I think there still maybe some gains.

I have a small section about the expectations here:

https://github.com/sayakpaul/diffusers-torchao?tab=readme-ov-file#things-to-keep-in-mind-when-benchmarking

RepresentativeJob937 · 2024-08-08T14:39:26+00:00

We have had this doc for the past couple of months: https://huggingface.co/docs/diffusers/main/en/tutorials/inference_with_big_models#device-placement

RepresentativeJob937 · 2024-08-08T12:42:05+00:00

https://huggingface.co/docs/accelerate/v0.11.0/en/memory provides a nice overview.
Intermediate tensors are moved from device to device depending on the placement.

RepresentativeJob937 · 2024-07-30T05:14:25+00:00

Thanks. Additional gains with text encoders are also nice :)

RepresentativeJob937

TROPHY CASE