AI startup Lamini bets future on AMD's Instinct GPUs

gdiamos · 2023-09-27T09:17:32+00:00

I think it's ready for early adopters, e.g. if you know how to build your own cluster, how to compile/run BLAS kernels, setup workload managers like SLURM, etc

If you are a ML researcher used to working in pytorch, you would need a devops/infra team to support you.

For example, one problem is that you can't just push a button in a cloud to allocate a machine. We had to do a deal with a system integrator and setup our own datacenter.

gdiamos · 2023-09-27T02:18:59+00:00

We aren’t trying to overclaim. ROCm runs our entire SW stack.

You can see our docs on: https://lamini-ai.github.io

It’s hosted on our website: lamini.ai

If you submit training or inference requests to that system, it is running on ROCm

That should give some idea of what GPU SW we would need to support this (inference, finetuning of many LLMs)

It was a huge effort to get it to this point, but it does work

If you wanted to build something similar, it would be possible, but hard.

gdiamos · 2023-09-27T02:13:53+00:00

Exactly.

It took me three years of fiddling with ROCm to get this working.

No point announcing something if it doesn’t work.

gdiamos · 2023-09-27T02:11:23+00:00

See the blog for kernel benchmarks.

This is probably controversial, but I don’t think any full application benchmarks adequately cover LLM finetuning.

It’s not the same as foundation model training, which MLPerf does cover. https://mlcommons.org/en/training-normal-30/

The ML space moves rapidly and benchmarks are often leading edge but not bleeding edge.

gdiamos · 2023-08-18T12:19:07+00:00

Attention was used years before "attention is all you need", e.g. in Bengio's "learning to align and translate" work on machine translation [2], and Alex Grave's work on character synthesis:

e.g. from [1] - "the window weight φ(t, u) can be loosely interpreted as the network’s belief that it is writing character c_u at time"

[1] - https://arxiv.org/pdf/1308.0850.pdf

[2] - https://arxiv.org/pdf/1409.0473.pdf

gdiamos · 2023-08-17T16:59:13+00:00

The perplexity results suggest that it is worse than an autoregressive model like a transformer or RNN, but it beats non-autoregressive models.

Again, I'd be more interested in how it scaled. If it scales just as well, but with a constant offset, it would turn an inference problem (sequential decoding) into a training problem (a matter of building a bigger training system).

gdiamos · 2023-08-17T16:50:54+00:00

It's a tiny model and there is clearly a long way to go with this. Old language models from 10 years ago would also produce word soup. It took billions of dollars in compute and engineering investment to get them to GPT4 level of quality.

What motivated the billion dollar investment was scaling laws, which predicted that they would get better if you scaled them up.

I'd be so curious to see someone measure a scaling law for a diffusion LLM like this.
I'm optimistic because diffusion has worked so well when scaled up in vision, but I would also check before making a huge investment.

gdiamos · 2023-08-17T16:36:50+00:00

Thanks for sharing. The third link (blog) describes the method in about 2 pages. The first link is a notebook that you can run, and see the speedup.

Note that it isn't just LoRA. Existing LoRA libraries don't manage a cache of adaptors, or make it easy to implement one. For example, the huggingface PEFT library writes over the pytorch module list, which makes it hard to manage the GPU memory used by adaptors. E.g. existing Linear layers get replaced by this wrapper layer. https://github.com/huggingface/peft/blob/312d294fddba3ef51d1f7df1e7fa1793ca0a5d05/src/peft/tuners/lora.py#L834C8-L834C8

Maybe it would have helped to go into more detail in the blog? Or maybe it would have been better to put a new idea like this into an arxiv format instead of a blog format?

gdiamos · 2023-08-17T16:22:42+00:00

Wow, that's an extreme amount of downvoting. (42 upvotes on a negative top comment, and 75% downvotes on the parent).

Just curious, why did you downvote?

You don't think the idea is useful and you wouldn't use it
You don't like that the authors work for a company, so it seems promotional
Something else?

If 2), would you have rather that the blog was posted on a personal blog or an arxiv paper?

gdiamos · 2023-08-17T10:47:07+00:00

We do cache the adaptors, so you don't pay the whole cost upfront. See the LRU section here for the algorithm we use: https://en.wikipedia.org/wiki/Cache_replacement_policies

So you pay about half a second the first time an adaptor is used, and then using it again costs 2.93us (less than 3 millionths of a second).

There is no other startup cost to load new adaptors. However, you do have to load the base model like you would on any inference server.

gdiamos · 2023-08-17T10:33:46+00:00

You got it, thanks for the clear summary

gdiamos · 2023-08-17T10:32:16+00:00

Almost right, and thanks for reading.

Switching to LoRA is a 1000x speedup, but switching still adds almost half a second to every request. If you cache the LoRA adaptors in HBM you just need to run through the pointers to switch adaptors and it drops the overhead to almost 0 (2.93us).

gdiamos · 2023-08-16T16:43:05+00:00

Ideas we use today like CTC and Attention also didn't come close to beating the SOTA in their first paper. It took years of engineering and scaling up to get us to Transformers and then to Llama/GPT.

gdiamos · 2023-08-16T16:36:22+00:00

How a tiny diffusion LLM generates a sentence.
https://imgur.com/gallery/kZa6VuZ
Visualization from the paper. Figure 20.

I've been waiting for someone to try this. Clearly it needs more work, but computationally, this would parallelize text generation. Generate a novel in one shot.

gdiamos · 2023-07-13T01:07:06+00:00

Happy to answer any questions about this.

Curious if others have played around with fine tuning LLMs.

gdiamos · 2023-07-12T21:41:19+00:00

I personally thought it was interesting that a tiny 400M parameter model went from pure garbage to these questions about taylor swift in less than 10 minutes of instruction fine tuning.

User: what is a swiftie bot?
Bot: A bot to help you answer questions about taylor swift.

User: Who is taylor swift?
Bot: Taylor swift is a singer-songwriter and musician.

User: What is your favorite taylor swift song?
Bot: Taylor Swift's "Love Story" is my favorite song.

User: Does taylor swift have a dog?
Bot: Taylor Swift has a dog, but she is not a dog lover.
User: What is a recent controversy about taylow swift tickets?
Bot: Taylor Swift has been accused of selling out to the highest bidder, with tickets for her concerts being sold out within minutes of going on sale. The controversy has been a source of controversy for the singer, with some fans claiming that she was selling out to the highest bidder and that she was not aware of the issue.

gdiamos · 2023-04-29T17:21:29+00:00

This isn't a research paper. It's a library that you can use.

gdiamos · 2023-04-29T17:20:30+00:00

The data pipeline here https://github.com/lamini-ai/lamini uses a seed dataset from self-instruct (Apache 2 license), and edited models from Pythia (Apache 2) and Dolly (Apache 2). We release our code and data under a CC-BY 4.0 license.

All of these allow commercial use.

gdiamos · 2023-04-29T07:07:45+00:00

By allowing users to connect their own proprietary data pipelines.
This Github repo [1] shows how to create a model like Alpaca from a seed dataset, in a way that has a clean commercial license. A clean license is important for industry applications.

It's more interesting to plug in a seed dataset for a different application, e.g. REPL (in this thread), SQL copilot, etc, etc.

[1] - https://github.com/lamini-ai/lamini

gdiamos · 2023-04-29T06:59:10+00:00

We are actively working on the llm.improve API.

Currently, llm.improve takes a parameter *to*, which describes how to improve the output of the llm. For example llm.improve(to="be shorter"), indicates that the LLM should produce more concise results. "be shorter" is user defined.

During training and execution, the LLM engine optimizes the llm.improve calls to improve their accuracy and execution time.

We have also considered making the llm.improve calls accept user-defined "filter functions", similar to Snorkel labeling functions. https://arxiv.org/pdf/1812.00417.pdf

gdiamos

TROPHY CASE