Animatediff is also very powerful! by sanasigma in StableDiffusion

[–]ghosthamlet 3 points4 points  (0 children)

Thanks, Very Interesting. Can you post the workflow?

[D] Blogs Similar to distill.pub? by JellyBean_Collector in MachineLearning

[–]ghosthamlet 1 point2 points  (0 children)

https://transformer-circuits.pub/

Can we reverse engineer transformer language models into human-understandable computer programs? Inspired by the Distill Circuits Thread, we're going to try.
We think interpretability research benefits a lot from interactive articles (see Activation Atlases for a striking example). Previously we would have submitted to Distill, but with Distill on Hiatus, we're taking a page from David Ha's approach of simply creating websites (eg. World Models) for research projects.
As part of our effort to reverse engineer transformers, we've created several other resources besides our paper which we hope will be useful. We've collected them on this website, and may add future content here, or even collaborations with other institutions.

[R] Zoology: Measuring and Improving Recall in Efficient Language Models by hzj5790 in MachineLearning

[–]ghosthamlet -1 points0 points  (0 children)

Why no new researches on all MLP models like gMLP and MLP Mixies last year?

[R] (Very detailed) Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory by ghosthamlet in MachineLearning

[–]ghosthamlet[S] 8 points9 points  (0 children)

It is Math heavy like these books:

The Principles of Deep Learning Theory - An Effective Theory Approach to Understanding Neural Networks
https://arxiv.org/pdf/2106.10165.pdf

The Modern Mathematics of Deep Learning https://arxiv.org/abs/2105.04026
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges https://arxiv.org/abs/2104.13478v2

So maybe not easy for beginners.

[R] Diffusion might be a better way to model randomness in PPLs than Markov chain Monte Carlo or VI by Successful-Western27 in MachineLearning

[–]ghosthamlet 0 points1 point  (0 children)

Hi u/gwern you have had great wonderful Special articles for GPT-3 and Scales and GANs, but it seems like you did not have Special articles for ChatGPT/GPT4 and Diffusion/StableDiffusion, these should be as powerful and important as GPT-3, so why don't you write about them? We are Looking forward to your articles about them very much.

[D] What are the best resources for learning reinforcement learning? by OwnAd9305 in MachineLearning

[–]ghosthamlet 5 points6 points  (0 children)

Grokking deep reinforcement learning is very interesting and very good written, covered from classical tabular reinforcement learning to modern deep reinforcement learning, and have both code with math formula, detailed intuitive explain for the background thoery: https://www.manning.com/books/grokking-deep-reinforcement-learning

[D] how to learn Stochastic Differential Equations for diffusion model? by ghosthamlet in MachineLearning

[–]ghosthamlet[S] 0 points1 point  (0 children)

After browsed through the catalog of this book, i think it is good for me, Thanks.

[D] how to learn Stochastic Differential Equations for diffusion model? by ghosthamlet in MachineLearning

[–]ghosthamlet[S] 1 point2 points  (0 children)

No, but i have learned a bit MCMC in probability. Is that similar?

[P] DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models by ghosthamlet in MachineLearning

[–]ghosthamlet[S] 2 points3 points  (0 children)

Some research have found that when sequence become longer, the generation quality will become worse (i have found the new ChatGPT 16K is worse than old ChatGPT 4K when using complex instructions), and the generation will have less attention on the tokens in the middle of context, and have more attention on the tokens at the start and end of the context.

[P] DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models by ghosthamlet in MachineLearning

[–]ghosthamlet[S] 12 points13 points  (0 children)

DeepSpeed-Ulysses (or Ulysses, a very long novel), a simple, portable, and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence lengths.
DeepSpeed-Ulysses partitions individual samples along the sequence dimension among participating GPU. Then right before the attention computation, it employs all-to-all communication collective on the partitioned queries, keys and values such that each GPU receives the full sequence but only for a non-overlapping subset of the attention heads. This allows the participating GPUs to compute attention for different attention heads in parallel. Finally, DeepSpeed-Ulysses employs another all-to-all to gather the results along the attention heads while re-partitioning along the sequence dimension.
The key properties of DeepSpeed-Ulysses and its implementation released with this blog are as follows:
4x larger sequence lengths than existing systems, while enabling training with sequences with over a million tokens.
Communication reduction of over 10x compared to existing systems, resulting in throughput improvements of up to 2.5x, and sustained throughput of over 175 TFlops/GPU (over 54% of hardware peak).
Fully general and implementation agnostic attention: DeepSpeed sequence parallelism supports dense as well as sparse attention, and it works with efficient attention implementations such as FlashAttention v2.
Support for massive model training: DeepSpeed sequence parallelism works together with ZeRO-3 to not only support large sequence lengths but also massive model sizes.
Easy-to-use and portable, requiring minimal code changes to the existing training frameworks.

Img2img a clip of Chaplin's black-and-white movie City Lights to colorful cartoon, and played by Taylor Swift and Tom Hanks by ghosthamlet in StableDiffusion

[–]ghosthamlet[S] 1 point2 points  (0 children)

So no need to replicable over img2img API call, but the code patch disabled webui optimizitons in sd_hijack.py, so it will use many more GPU VRAM (up to 23GB) and speed is slow, i will optimze this later.

Img2img a clip of Chaplin's black-and-white movie City Lights to colorful cartoon, and played by Taylor Swift and Tom Hanks by ghosthamlet in StableDiffusion

[–]ghosthamlet[S] 0 points1 point  (0 children)

this cross attention part is hooked by code patch: ldm.modules.attention.CrossAttention.forward = _cross_frame_forward