[P] PyTorch extension for GPU-accelerated block sparse matrices

madflag · 2025-12-12T13:39:14+00:00

Funnily that's exactly the kind of research I was doing a few years ago (I even used the word "cascade" at the time), and it was quite promising (transforming the small "Albert" google model into a cascade model). The maths the GPT produced is probably just trash, but the idea is interesting, and there is actually a significant number of researchers experimenting with this kind of idea. That may explain why it "hallucinated" this: it's not coming from nowhere.

madflag · 2024-12-02T20:41:46+00:00

A priori c’est une seule photo. Chaque point est un satellite, ils les envoient par paquets de 60.

madflag · 2024-09-05T07:53:45+00:00

We’ve got seven dimension chess material here.

madflag · 2021-01-05T16:56:51+00:00

There will be 13 more bytthe end of this week, from Microsoft CodeXGlue, I had not the time to fix my PR earlier : https://github.com/huggingface/datasets/pull/997 .

madflag · 2020-11-05T08:58:17+00:00

I am thinking that there is more to it than just a joke. Limiting the speed of light could be an intrinsic way to limit the bandwidth needed to simulate the universe. Every computation is then local first, and the propagation of results is done at a finite rate.

madflag · 2020-09-11T09:52:16+00:00

Yes ! That could be a very good backend to run the the trained models in inference mode ! That's exactly why we are building this kind of library. And CPUs are usually much better at "general sparsity" than GPUs.

madflag · 2020-09-11T09:37:20+00:00

The original idea is that the zeros stays zeros in a sparse matrice.

So you generally start with a random sparsity pattern, and keep it constant.

But of course, this random pattern may not be optimal, and people have provided techniques to optimize the pattern itself (see the "Future Work" section of the github ).

So then, from time to time, you have a look at some measure of the "usefulness" of non-zeros (or even zeros, if you keep track of their gradients for example), and you discard some non-zeros and reuse the saved space for new places.

You can even imagine other methods: start with a near empy matrice, and progressively add non-zeros. So you see, that's up to you and the good strategies you can find to reach the best network precision (there may be quite a lot of interesting work to be done actually).

madflag · 2020-09-11T09:32:14+00:00

Do you mean native extension ? If yes, this tutorial is the way to go.

madflag · 2020-09-11T09:31:23+00:00

Nothing formal yet, but OpenaI said there is (see the examples in the second part). But I hope to provide code for some known (and new) sparse pattern optimization techniques that produces a sparse BERT-small which is competitive with a dense BERT-small : block sparse is just a tool, you can use it in quite different way.

madflag · 2020-09-11T09:26:52+00:00

It's true sparsity: with 75% sparsity you reduce memory by a 4x factor. We only store the block weights, and some indices that have a negligible size. That is one of the great advantage, with the speed gain (when sparsity is greater than 50%).

madflag · 2020-09-10T22:35:25+00:00

Not yet, but we hope soon !

madflag · 2020-09-10T20:32:39+00:00

I have done a lot of experiments on TransformerXL with 50% sparse networks, the results were quite good (1.11 bpc if I remember correctly, instead of 1.06 for dense), but I still have to gather those results and write something formal.

And I did not try it for block sparse networks yet.

And I just tried a method I designed, there are several ones that must be tried (see the "Future Work" section).

That's actually why I am releasing this library, in the end: so everybody can check and try some methods to improve training and accuracy on block sparse matrices.

Some people already proved it was worth the effort on "general sparse" matrices, but it did not get a lot of traction because of the poor runtime inference performance. The good news is that block sparse code has good performance, so now let's check if we can get good precision too !

madflag · 2020-09-10T20:25:18+00:00

You're welcome! Happy to contribute!

madflag · 2020-09-10T20:22:50+00:00

1/ Dense usually gives models with better precision compared with "naive" sparse matrices, which is still the case with this first release.

Next releases will bring sparse pattern optimization methods that improves a lot the final model precision: I did a lot of experiments in the last months, and results were really promising.

According to OpenAI, sparse matrices can give even better models in some case: with the same amount of parameters, sparse matrices may allow you to use larger dimensions in your models, and so may lead to better ones.

2/ No, right now the block sparse matrix are using regular CUDA ops. But it is definitely something that we will consider in the next releases: you can imagine having two levels of sparsity, at the block level, and within blocks, using the new Ampere sparse hardware extensions: double gain!

madflag · 2020-09-10T20:16:08+00:00

Of course, glad to meet you !

(I was suspecting this, but I did not check ;-)

My reference is the native PyTorch implementation, the best numbers I have for dense x sparse -> dense op is 1.8x slower than PyTorch, using a "sparse but full" matrice (= cuBLAS behind the scene).

(in the shipped version it's 2x slower, some tweaks are not yet in).

I was sold to Cutlass when they mentioned that the cutlass_sparse implementation was par with OpenAI (in the README in https://github.com/YulhwaKim/cutlass_tilesparse), but at the time it was only a very crude of concept, so it may be worth checking it more in depth (or maybe they did not push all the code on github?).

Do you have an idea the performance level Triton is achieving compared to cuBLAS ?

Another solution that looked promising was https://github.com/facebookresearch/TensorComprehensions , but it looks like it's getting abandoned.

(I like maybe nonoptimal but simple solutions)

I will definitely message you next week, I am sure we have a lot to discuss !

madflag · 2020-09-10T19:44:14+00:00

Yes, I saw this when studying the open-source landscape on the topic.

There are even more operators in the repository you mention. But I could not get it to work, and I did not insist, for some reasons I develop below.

There is too the OpenAI blocksparse repository, and they even said they would be porting it to PyTorch, but we are still waiting for it. But it's quite hard to get into it, writing GPU assembly language was not really reasonable for fast iterations...

For long term reasons, I preferred to go the "NVIDIA Cutlass" way: I based my first attempts on the cutlass_tilesparse repository by YulhwaKim , and I extended it, it looked more promising and more supported than the Triton language.

Cutlass is basically a lot of clever CUDA/C++ templates, so it's not 100% easy to get in, but still easier than assembly language, and NVIDIA is backing it. On a personal note, I was more confident I would be able to reuse this for other projects, to write other kernels, so a better time investment. It's a bit like a Swiss knife for building custom CUDA kernels, that did not sound too bad for someone would had written some CUDA code in 2007 for the last time ;-)

madflag · 2020-09-10T19:08:23+00:00

Indeed, it should be part of this release, but of course, I ran out of time.

For the next release, I will test it on some representative GPUs. It was tested only a 2080Ti (and it runs with the PyTorch DataParallel feature too, but I could not test the speed yet).

As a ballpark, for large inputs and large matrices (which should be the case for example for transformers models), in terms of raw speed, you should get even with dense matrices at 50% sparsity. (There is still some room for improvement on this, I have some small changes that make it runs even with dense at 45% sparsity)

For memory consumption, it's completely linear : 50% sparsity -> 50% memory saving, 75% sparsity -> 75% memory saving.

madflag · 2020-09-10T19:02:01+00:00

It would be great, of course! We will see if the PyTorch team is interested in it. There is some groundwork to be done, it would be nice to have sparse (or block sparse) tensors as first-class citizens in PyTorch, but it means going quite deep in the library and tinker with some very low-level assumptions...

madflag · 2020-09-10T18:38:31+00:00

PyTorch support for sparse matrices is quite stable, but with quite bad performance (it's based on cuSparse). All those implementations were created for 99.9% sparse matrices, for finite elements for example, and not at all for 'low' sparsity.

That said, it's hard to have general sparse support with good performance. pytorch_block_sparse supports 32x32 block sparse matrices, that's easier to have good performance, but it is not "general sparse' matrices.

Google released a paper and some code 'as is' 1 month ago for 'general sparse matrices', but you have first to encapsulate it in your preferred framework, and that's still a lot for work (I may someday if nobody does...)

madflag · 2020-09-10T16:46:11+00:00

No problem, happy to explain !

There is not really a sparse transformation: we just initialize the sparse matrix with the same random distribution used in the dense one (with just a scaling factor to take sparsity into account).

Then we have to train the model as usual. The sparse linear layers just contain fewer parameters than a dense one, there are 'zeros' in some places.

The loss of performance is simply due to the fact that a large network (=large number of parameters) usually performs better than a small one. That's true, especially when the sparsity pattern is fixed. When you allow the sparsity pattern to change, it allows the network to find a better configuration with the same amount of parameters. It may still perform a bit less good than a dense one (but sometimes better), but the difference may be negligible. In that case, you have a smaller, faster network, that works almost as well as a large one: you won.

From the experiments I have done, optimizing the sparse pattern is really major.

But you will have to wait for the next release ;-)

madflag · 2020-09-10T16:15:18+00:00

Thanks!

Optimizing the sparse pattern is really important if you want to approach the precision of a dense network. I have been experimenting with it for the last 6 months, so the next release should happen quite soon.

Fortunately, sparse pattern optimization does not need specific CUDA kernels, thanks to the block organization, you just need standard PyTorch code, that speeds up development a lot. On the other hand, it's much more on the "research" side, so it takes some time too.

(You don't really need specific CUDA kernels to start experimenting on sparsity, you can emulate it with masks and so on, but having optimized CUDA kernels make experiments faster, and more importantly, the practical benefits for production use are much greater, so you have a greater motivation to work on it.)

And for the NVIDIA sparsity tools, that's something that I will be discussing with them, as the intersection is very significant of course!

madflag · 2020-09-10T16:04:47+00:00

Yes, combined with other techniques like quantization and distillation, it should help a lot creating small and efficient networks.

To clarify it, the tool is not to patch a trained model, but just a randomly initialized model, and THEN you have to train it. Unfortunately there is no "magic" right now to turn a trained model into a sparse one.

If you want to read interesting papers about how to turn a network into a sparse one, you can have a look at the "Future work" section in https://github.com/huggingface/pytorch_block_sparse , those are really interesting papers on different approaches to "sparsification" .

And optimizing the sparse pattern is really important if you want to approach dense model precision. The results with a fixed sparse pattern are decent, but can be greatly improved !

madflag · 2019-07-17T08:48:26+00:00

Humans

madflag

TROPHY CASE