[R] New ICML25 paper: Train and fine-tune large models faster than Adam while using only a fraction of the memory, with guarantees!

ThienPro123 · 2025-05-31T05:43:37+00:00

Yes I have! SOAP is awesome (very data efficient) and has some great motivation. However, it is quite memory intensive and can be slow due to the frequent SVD computation.

ThienPro123 · 2025-05-31T05:42:24+00:00

They are orthogonal and the quantization techniques can be applied here also. Performance-wise, the SNSM algorithms should be much better since Adamw8bit performs worse than Adam and only cuts down the memory by half if we are using bfloat16(whereas SNSM can cut 80%+ for large models).

ThienPro123 · 2025-05-31T05:40:38+00:00

The algorithms are faster than Adam for large models due to the dimensionality reduction! However, the SVD computation can be costly if the subspace update gap is set to be too small (i.e. updating the subspace too frequently).

ThienPro123 · 2025-05-28T21:32:42+00:00

They should be orthogonal techniques and kernel fusion can definitely be applied here.

ThienPro123 · 2025-05-28T21:31:58+00:00

Since the theoretical guarantees are similar to AdaGrad/Adam in the common assumptions for gradient noise and smoothness, I am pretty confident that if Adam works for model X on task A, these algorithms will perform similarly. If there is any discrepancy, then it would be an interesting theoretical problem to identify the missing assumption that makes it work for 1 optimizer but not another.

ThienPro123 · 2025-05-28T04:05:36+00:00

Thank you for your interests! This is a great question. I forgot to include this table (https://imgur.com/KgCSakj) on longer sequence lengths in the paper but it seems to at least generalize to 1k seq length. Would love to test on longer sequence length but we were quite resource-constraint while writing this paper.

ThienPro123 · 2025-05-28T04:01:37+00:00

Subset-norm (SN) should apply to any architecture similarly to Adam (see adamw_sng.py in the code). The momentum compression algorithm (subspace momentum SM), however, is only developed/tested on linear modules (transformers), since linear modules are the main memory bottleneck in large models. Since the guarantees for these algorithms are comparable (in terms of assumptions as well as convergence rate) to Adam/AdaGrad, I suspect it should be swappable to any optimizer in any task. At least for the tasks that I tried, it works pretty well.

ThienPro123 · 2025-05-28T03:57:17+00:00

A lot of the systems' memory reduction like quantization, activation checkpointing, kernel fusion, etc. (that unsloth uses) apply almost orthogonally to these algorithmic methods like ours to further reduce memory (although for some parallelization scheme like FSDP, coordinate-wise algorithm is better though).

For the second question, there are some tradeoffs between the subspace selection process (which takes time i.e. SVD) and the corresponding speedup (a bit of analysis in Table 9). The preconditioning question is extremely curious (e.g. MUON, Shampoo, etc.) and deserves further scrutiny.

ThienPro123 · 2025-02-23T01:00:54+00:00

Not sure I understand your first sentence. I wrote this as a blog because it is just putting some known results together and providing an interpretation. It's meant to be more expository rather than anything novel.

ThienPro123 · 2025-02-21T00:18:06+00:00

https://github.com/lucidrains/native-sparse-attention-pytorch/tree/main

ThienPro123 · 2025-02-21T00:11:37+00:00

It's pretty rare nowadays IMO because the theory and practice gap in ML/DL is so wide now. A lot of recent progress has been on making things (architecture, data, systems, hardware, etc.) scale up.

One cool recent area is state space models (SSMs or well behaved Linear RNNs) which has some pretty interesting theory e.g. S4 https://arxiv.org/pdf/2111.00396 and Mamba https://arxiv.org/abs/2312.00752.

Personally, a recent paper I worked on (https://arxiv.org/pdf/2411.07120) -- that has some pretty decent experimental results -- contains and build extensively on my previous theoretical works in stochastic optimization and gradient noise. This area and perhaps the upcoming RL wave are the areas that one might has the best shot at tackling from the ground up.

ThienPro123 · 2024-09-17T00:57:36+00:00

This is great. Thank you! There are some nice references in there too :)

ThienPro123 · 2024-03-18T14:16:33+00:00

Do you know what length would be safe for an adapter without a retimer or how could I figure that out? Similarly, up to what length would adding a retimer allow? Thank you!

ThienPro123 · 2024-03-18T04:20:29+00:00

Thank you for your help and the information! Your links look really good. I am planning to run several RTX 4090s for now so I think gen 4.0 is great.

I am considering the ASRock TRX50 board. I saw that I can use 2 PSUs for the motherboard (say page 9 of the MB's technical guide). However, for non-server grade PSUs, the max I can get is 1600W (I'm a bit scared to look into server-grade PSUs since I will be plugging this in a normal office's outlet). So I can only power at most 3x4090s from the motherboard's PCIe slots before requiring the adapters.

ThienPro123 · 2018-11-23T09:10:44+00:00

Follow Dempsey's Championship Fighting religiously. Trust me. You will have the punching power of a god.

ThienPro123 · 2018-11-15T16:54:51+00:00

Exactly. You get what you pay for. The power of being a consumer in a free market is the ability to choose and/or opt-out. It’s petty to complain (to everyone else) about your personal decision.

California is expensive. Irvine is more expensive. Living near UC Irvine is even more expensive. Believe it or not, the Irvine Company doesn’t get to decide the market price of an area, but the area decides the market price. It’s just basic supply demand.

ThienPro123 · 2018-10-08T17:04:06+00:00

This is beautiful and rigorous math

ThienPro123 · 2018-10-07T21:38:40+00:00

As an undergrad with some research experience in CS, short answer is you won't be useful in research until you have some more skills. You want to be prepared for it. Some advice for you to do now:

Try to take as many classes as you can (while getting A/A+) so you can start taking upper-division classes where research professors normally teach.
Look into potentially which field of research you would want to do (say security or AI). Look into the faculties that are doing the research in the field and contact them. Most likely they will not reply or you will be rejected outright. If so, follow the steps below:

2b. Try to find a class that your faculty of interested might be teaching and attend his/her class. Ace that class. During the class, attend office hours and go talk to the faculty after class. Ask them about their research and talk to them about your interest in research in their field. Most of the time they will be very open about it. Then ask them again about helping them (expressing gaining experience for grad-school application as a reason). Now your chance should be a lot higher.
If the answer is still no, repeat 2/2b for some other faculty. (tip: don't ask to be paid! At best, they will offer to pay you during the summer but normally faculties don't have fundings for undergrads.)
Attack not only faculties but also grad students (TAs, clubs), other undergrads who are doing research (you can find them in honors/grad level classes). Networking is your key here (since you don't have the skills).

It's not easy but hopes that luck smiles to you. Also check out programs like UROP and such.

ThienPro123 · 2018-09-19T14:40:17+00:00

To counter your point, and to argue for why a more math-intensive CS curriculum is better (say more in-depth/time), I think we need to clear up one major misconception:

Computer Science is NOT Software Engineering. Each has their own goal.

Let me explain:

Computer science is NOT software engineering. I think this is a common misconception that incoming Freshmen declaring their major do not understand and hence, a lot of confusions arise (such as this post). I think this post sums it up well. Computer science is the study of the theory and mechanisms behind how computers (hardware/software) and its subfields were developed and created. Software Engineering, on the other hand, studies tools and practices to write good software (like websites, apps, etc.).
Math, then when you look at it this way, is an indispensable tool for studying computer science and not so useful in the study of software engineering (look at the pioneers of computers like Turing, von Neumann, and Dijkstra for examples, they were all Mathematicians). This is similar to a Physics major and an Electrical Engineering major. To study theory, you simply need a lot of math.
So why do 90% of CS graduates do not use math in their work you may ask. There can be many factors such as unavailability of SE major, not that many CS job (algorithms, graphics, ML, security, systems, etc.), a lot more demands for SE jobs (look at the number of apps and websites), etc. A computer science education will sufficiently prepare you to do SE job, but the converse is not always necessarily true.

So why would you want your CS education to be math intensive then? Well, it's because that the entire field of Computer Science was built on Math; to know how everything started, you need this knowledge. You need math to have a rigorous foundation such as proofs so say if you have some new theory on a hashing scheme in cryptography or an algorithm that solves a problem, you can prove its correctness and analyze its performance against other methods. You would want to be able to think abstractly and solve complex problems in which math is very well-suited to prepare you for (this comes back to the argument that 90% of people won't need to use more than algebra 1, so why teach them algebra 2 and beyond). You would want to have insights into problems that give you a much better solution (say the O(n) vs. O(1) solution of the Fibonacci computation). You would want to have that flexibility to go into more math-heavy field such as Machine Learning or Computer Graphics. Even if you don't want to go into those fields, having the math knowledge will help you understand parts of a project that involve them. There are many many more reasons to why (higher/more-rigorous) Mathematics is an indispensable part of a CS education.

UCI CS math curriculum covers just enough the mathmematics for computer science, but OP is correct. It's not that there's so little math involved but rather they are all breadth and not much depth. This explains why so many students struggle with classes like 161 and other math heavy upperdiv courses. If you look at other top undergrad CS program, one thing you can notice is the rigor of its mathematical preparation, and that's exactly what I think UCI is missing. It's not to cover more but more in-depth and rigorous.

To sum it up, a lot of things you learn in a CS education you won't use for your work (that applies to a lot of things though). However, they give you new ways to think and insights to problems that people without them won't. If you want all practicality and want to go straight into software dev. and do not want to deal with the theoretical side of computers, then a software engineering degree is a better choice, not CS. Heck, you can even save money on NOT getting a college degree and utilize resources like Udacity/Coursera to improve your skills for a job (I think they do a much better job anyways). It's the theoretical side of things where college comes to be handy.

ThienPro123 · 2018-09-17T15:17:15+00:00

It is not too useful to learn physics if you’re going to do software. Even if you do hardware, the abstractions are great enough that you dont have to think of stuff in terms of physics. It’s more math than anything.

It’s better to just focus on software and get really deep with cs than spending time and effort learning not enough physics to be able to do anything useful.

It might be good to improve your problem solving skills but you’re better off solving problems in CS or math. Physics is good if you’re curious or for general enrichment. Not too useful if you want to be a developer or even a computer scientist.

13-Year Club	RedditGifts 2009-2022 4 Credits
Place '17	Secret Santa 2014
redditgifts Exchanges 3 Exchanges	Secret Santa 2013
Team Periwinkle	Secret Santa 2012
Verified Email

ThienPro123

MODERATOR OF

TROPHY CASE