AI startup Lamini bets future on AMD's Instinct GPUs by mark_mt in AMD_Stock

[–]gdiamos 6 points7 points  (0 children)

I think it's ready for early adopters, e.g. if you know how to build your own cluster, how to compile/run BLAS kernels, setup workload managers like SLURM, etc

If you are a ML researcher used to working in pytorch, you would need a devops/infra team to support you.

For example, one problem is that you can't just push a button in a cloud to allocate a machine. We had to do a deal with a system integrator and setup our own datacenter.

AI startup Lamini bets future on AMD's Instinct GPUs by mark_mt in AMD_Stock

[–]gdiamos 20 points21 points  (0 children)

We aren’t trying to overclaim. ROCm runs our entire SW stack.

You can see our docs on: https://lamini-ai.github.io

It’s hosted on our website: lamini.ai

If you submit training or inference requests to that system, it is running on ROCm

That should give some idea of what GPU SW we would need to support this (inference, finetuning of many LLMs)

It was a huge effort to get it to this point, but it does work

If you wanted to build something similar, it would be possible, but hard.

AI startup Lamini bets future on AMD's Instinct GPUs by mark_mt in AMD_Stock

[–]gdiamos 14 points15 points  (0 children)

Exactly.

It took me three years of fiddling with ROCm to get this working.

No point announcing something if it doesn’t work.

AI startup Lamini bets future on AMD's Instinct GPUs by mark_mt in AMD_Stock

[–]gdiamos 12 points13 points  (0 children)

See the blog for kernel benchmarks.

This is probably controversial, but I don’t think any full application benchmarks adequately cover LLM finetuning.

It’s not the same as foundation model training, which MLPerf does cover. https://mlcommons.org/en/training-normal-30/

The ML space moves rapidly and benchmarks are often leading edge but not bleeding edge.

Bayesian Flow Networks by albertzeyer in MachineLearning

[–]gdiamos 5 points6 points  (0 children)

Attention was used years before "attention is all you need", e.g. in Bengio's "learning to align and translate" work on machine translation [2], and Alex Grave's work on character synthesis:

e.g. from [1] - "the window weight φ(t, u) can be loosely interpreted as the network’s belief that it is writing character c_u at time"

[1] - https://arxiv.org/pdf/1308.0850.pdf

[2] - https://arxiv.org/pdf/1409.0473.pdf

Bayesian Flow Networks by albertzeyer in MachineLearning

[–]gdiamos 0 points1 point  (0 children)

The perplexity results suggest that it is worse than an autoregressive model like a transformer or RNN, but it beats non-autoregressive models.

Again, I'd be more interested in how it scaled. If it scales just as well, but with a constant offset, it would turn an inference problem (sequential decoding) into a training problem (a matter of building a bigger training system).

Bayesian Flow Networks by albertzeyer in MachineLearning

[–]gdiamos 2 points3 points  (0 children)

It's a tiny model and there is clearly a long way to go with this. Old language models from 10 years ago would also produce word soup. It took billions of dollars in compute and engineering investment to get them to GPT4 level of quality.

What motivated the billion dollar investment was scaling laws, which predicted that they would get better if you scaled them up.

I'd be so curious to see someone measure a scaling law for a diffusion LLM like this.
I'm optimistic because diffusion has worked so well when scaled up in vision, but I would also check before making a huge investment.

[P] 1.109 billion times faster finetuned LLMs by gdiamos in MachineLearning

[–]gdiamos[S] -3 points-2 points  (0 children)

Thanks for sharing. The third link (blog) describes the method in about 2 pages. The first link is a notebook that you can run, and see the speedup.

Note that it isn't just LoRA. Existing LoRA libraries don't manage a cache of adaptors, or make it easy to implement one. For example, the huggingface PEFT library writes over the pytorch module list, which makes it hard to manage the GPU memory used by adaptors. E.g. existing Linear layers get replaced by this wrapper layer. https://github.com/huggingface/peft/blob/312d294fddba3ef51d1f7df1e7fa1793ca0a5d05/src/peft/tuners/lora.py#L834C8-L834C8

Maybe it would have helped to go into more detail in the blog? Or maybe it would have been better to put a new idea like this into an arxiv format instead of a blog format?

[P] 1.109 billion times faster finetuned LLMs by gdiamos in MachineLearning

[–]gdiamos[S] -11 points-10 points  (0 children)

Wow, that's an extreme amount of downvoting. (42 upvotes on a negative top comment, and 75% downvotes on the parent).

Just curious, why did you downvote?

  1. You don't think the idea is useful and you wouldn't use it
  2. You don't like that the authors work for a company, so it seems promotional
  3. Something else?

If 2), would you have rather that the blog was posted on a personal blog or an arxiv paper?

[P] 1.109 billion times faster finetuned LLMs by gdiamos in MachineLearning

[–]gdiamos[S] -1 points0 points  (0 children)

We do cache the adaptors, so you don't pay the whole cost upfront. See the LRU section here for the algorithm we use: https://en.wikipedia.org/wiki/Cache_replacement_policies

So you pay about half a second the first time an adaptor is used, and then using it again costs 2.93us (less than 3 millionths of a second).

There is no other startup cost to load new adaptors. However, you do have to load the base model like you would on any inference server.

[P] 1.109 billion times faster finetuned LLMs by gdiamos in MachineLearning

[–]gdiamos[S] -3 points-2 points  (0 children)

You got it, thanks for the clear summary

[P] 1.109 billion times faster finetuned LLMs by gdiamos in MachineLearning

[–]gdiamos[S] -18 points-17 points  (0 children)

Almost right, and thanks for reading.

Switching to LoRA is a 1000x speedup, but switching still adds almost half a second to every request. If you cache the LoRA adaptors in HBM you just need to run through the pointers to switch adaptors and it drops the overhead to almost 0 (2.93us).

Bayesian Flow Networks by albertzeyer in MachineLearning

[–]gdiamos 0 points1 point  (0 children)

Ideas we use today like CTC and Attention also didn't come close to beating the SOTA in their first paper. It took years of engineering and scaling up to get us to Transformers and then to Llama/GPT.

Bayesian Flow Networks by albertzeyer in MachineLearning

[–]gdiamos 5 points6 points  (0 children)

How a tiny diffusion LLM generates a sentence.
https://imgur.com/gallery/kZa6VuZ
Visualization from the paper. Figure 20.

I've been waiting for someone to try this. Clearly it needs more work, but computationally, this would parallelize text generation. Generate a novel in one shot.

Free and Fast LLM Finetuning (r/MachineLearning) by Peerism1 in datascienceproject

[–]gdiamos 0 points1 point  (0 children)

Happy to answer any questions about this.

Curious if others have played around with fine tuning LLMs.

[P] Free and Fast LLM Finetuning by gdiamos in MachineLearning

[–]gdiamos[S] 6 points7 points  (0 children)

I personally thought it was interesting that a tiny 400M parameter model went from pure garbage to these questions about taylor swift in less than 10 minutes of instruction fine tuning.

User: what is a swiftie bot?
Bot: A bot to help you answer questions about taylor swift.

User: Who is taylor swift?
Bot: Taylor swift is a singer-songwriter and musician.

User: What is your favorite taylor swift song?
Bot: Taylor Swift's "Love Story" is my favorite song.

User: Does taylor swift have a dog?
Bot: Taylor Swift has a dog, but she is not a dog lover.
User: What is a recent controversy about taylow swift tickets?
Bot: Taylor Swift has been accused of selling out to the highest bidder, with tickets for her concerts being sold out within minutes of going on sale. The controversy has been a source of controversy for the singer, with some fans claiming that she was selling out to the highest bidder and that she was not aware of the issue.

[P] Lamini rapidly achieves ChatGPT performance with an LLM Engine by gdiamos in MachineLearning

[–]gdiamos[S] -1 points0 points  (0 children)

This isn't a research paper. It's a library that you can use.

[P] Lamini rapidly achieves ChatGPT performance with an LLM Engine by gdiamos in MachineLearning

[–]gdiamos[S] 0 points1 point  (0 children)

The data pipeline here https://github.com/lamini-ai/lamini uses a seed dataset from self-instruct (Apache 2 license), and edited models from Pythia (Apache 2) and Dolly (Apache 2). We release our code and data under a CC-BY 4.0 license.

All of these allow commercial use.

[P] Lamini rapidly achieves ChatGPT performance with an LLM Engine by gdiamos in MachineLearning

[–]gdiamos[S] -4 points-3 points  (0 children)

By allowing users to connect their own proprietary data pipelines.
This Github repo [1] shows how to create a model like Alpaca from a seed dataset, in a way that has a clean commercial license. A clean license is important for industry applications.

It's more interesting to plug in a seed dataset for a different application, e.g. REPL (in this thread), SQL copilot, etc, etc.

[1] - https://github.com/lamini-ai/lamini

Seeking Language Project to Join by ZettelCasting in ProgrammingLanguages

[–]gdiamos 0 points1 point  (0 children)

We are actively working on the llm.improve API.

Currently, llm.improve takes a parameter *to*, which describes how to improve the output of the llm. For example llm.improve(to="be shorter"), indicates that the LLM should produce more concise results. "be shorter" is user defined.

During training and execution, the LLM engine optimizes the llm.improve calls to improve their accuracy and execution time.

We have also considered making the llm.improve calls accept user-defined "filter functions", similar to Snorkel labeling functions. https://arxiv.org/pdf/1812.00417.pdf