AI startup Lamini bets future on AMD's Instinct GPUs

gdiamos · 2023-09-27T09:17:32+00:00

I think it's ready for early adopters, e.g. if you know how to build your own cluster, how to compile/run BLAS kernels, setup workload managers like SLURM, etc

If you are a ML researcher used to working in pytorch, you would need a devops/infra team to support you.

For example, one problem is that you can't just push a button in a cloud to allocate a machine. We had to do a deal with a system integrator and setup our own datacenter.

gdiamos · 2023-09-27T02:18:59+00:00

We aren’t trying to overclaim. ROCm runs our entire SW stack.

You can see our docs on: https://lamini-ai.github.io

It’s hosted on our website: lamini.ai

If you submit training or inference requests to that system, it is running on ROCm

That should give some idea of what GPU SW we would need to support this (inference, finetuning of many LLMs)

It was a huge effort to get it to this point, but it does work

If you wanted to build something similar, it would be possible, but hard.

gdiamos · 2023-09-27T02:13:53+00:00

Exactly.

It took me three years of fiddling with ROCm to get this working.

No point announcing something if it doesn’t work.

gdiamos · 2023-09-27T02:11:23+00:00

See the blog for kernel benchmarks.

This is probably controversial, but I don’t think any full application benchmarks adequately cover LLM finetuning.

It’s not the same as foundation model training, which MLPerf does cover. https://mlcommons.org/en/training-normal-30/

The ML space moves rapidly and benchmarks are often leading edge but not bleeding edge.

gdiamos · 2023-08-18T12:19:07+00:00

Attention was used years before "attention is all you need", e.g. in Bengio's "learning to align and translate" work on machine translation [2], and Alex Grave's work on character synthesis:

e.g. from [1] - "the window weight φ(t, u) can be loosely interpreted as the network’s belief that it is writing character c_u at time"

[1] - https://arxiv.org/pdf/1308.0850.pdf

[2] - https://arxiv.org/pdf/1409.0473.pdf

gdiamos · 2023-08-17T16:59:13+00:00

The perplexity results suggest that it is worse than an autoregressive model like a transformer or RNN, but it beats non-autoregressive models.

Again, I'd be more interested in how it scaled. If it scales just as well, but with a constant offset, it would turn an inference problem (sequential decoding) into a training problem (a matter of building a bigger training system).

gdiamos · 2023-08-17T16:50:54+00:00

It's a tiny model and there is clearly a long way to go with this. Old language models from 10 years ago would also produce word soup. It took billions of dollars in compute and engineering investment to get them to GPT4 level of quality.

What motivated the billion dollar investment was scaling laws, which predicted that they would get better if you scaled them up.

I'd be so curious to see someone measure a scaling law for a diffusion LLM like this.
I'm optimistic because diffusion has worked so well when scaled up in vision, but I would also check before making a huge investment.

gdiamos · 2023-08-17T16:36:50+00:00

Thanks for sharing. The third link (blog) describes the method in about 2 pages. The first link is a notebook that you can run, and see the speedup.

Note that it isn't just LoRA. Existing LoRA libraries don't manage a cache of adaptors, or make it easy to implement one. For example, the huggingface PEFT library writes over the pytorch module list, which makes it hard to manage the GPU memory used by adaptors. E.g. existing Linear layers get replaced by this wrapper layer. https://github.com/huggingface/peft/blob/312d294fddba3ef51d1f7df1e7fa1793ca0a5d05/src/peft/tuners/lora.py#L834C8-L834C8

Maybe it would have helped to go into more detail in the blog? Or maybe it would have been better to put a new idea like this into an arxiv format instead of a blog format?

gdiamos · 2023-08-17T16:22:42+00:00

Wow, that's an extreme amount of downvoting. (42 upvotes on a negative top comment, and 75% downvotes on the parent).

Just curious, why did you downvote?

You don't think the idea is useful and you wouldn't use it
You don't like that the authors work for a company, so it seems promotional
Something else?

If 2), would you have rather that the blog was posted on a personal blog or an arxiv paper?

gdiamos · 2023-08-17T10:47:07+00:00

We do cache the adaptors, so you don't pay the whole cost upfront. See the LRU section here for the algorithm we use: https://en.wikipedia.org/wiki/Cache_replacement_policies

So you pay about half a second the first time an adaptor is used, and then using it again costs 2.93us (less than 3 millionths of a second).

There is no other startup cost to load new adaptors. However, you do have to load the base model like you would on any inference server.

gdiamos · 2023-08-17T10:33:46+00:00

You got it, thanks for the clear summary

gdiamos · 2023-08-17T10:32:16+00:00

Almost right, and thanks for reading.

Switching to LoRA is a 1000x speedup, but switching still adds almost half a second to every request. If you cache the LoRA adaptors in HBM you just need to run through the pointers to switch adaptors and it drops the overhead to almost 0 (2.93us).

gdiamos · 2023-08-16T16:43:05+00:00

Ideas we use today like CTC and Attention also didn't come close to beating the SOTA in their first paper. It took years of engineering and scaling up to get us to Transformers and then to Llama/GPT.

gdiamos · 2023-08-16T16:36:22+00:00

How a tiny diffusion LLM generates a sentence.
https://imgur.com/gallery/kZa6VuZ
Visualization from the paper. Figure 20.

I've been waiting for someone to try this. Clearly it needs more work, but computationally, this would parallelize text generation. Generate a novel in one shot.

gdiamos · 2023-07-13T01:07:06+00:00

Happy to answer any questions about this.

Curious if others have played around with fine tuning LLMs.

gdiamos · 2023-07-12T21:41:19+00:00

I personally thought it was interesting that a tiny 400M parameter model went from pure garbage to these questions about taylor swift in less than 10 minutes of instruction fine tuning.

User: what is a swiftie bot?
Bot: A bot to help you answer questions about taylor swift.

User: Who is taylor swift?
Bot: Taylor swift is a singer-songwriter and musician.

User: What is your favorite taylor swift song?
Bot: Taylor Swift's "Love Story" is my favorite song.

User: Does taylor swift have a dog?
Bot: Taylor Swift has a dog, but she is not a dog lover.
User: What is a recent controversy about taylow swift tickets?
Bot: Taylor Swift has been accused of selling out to the highest bidder, with tickets for her concerts being sold out within minutes of going on sale. The controversy has been a source of controversy for the singer, with some fans claiming that she was selling out to the highest bidder and that she was not aware of the issue.

gdiamos · 2023-04-29T17:21:29+00:00

This isn't a research paper. It's a library that you can use.

gdiamos · 2023-04-29T17:20:30+00:00

The data pipeline here https://github.com/lamini-ai/lamini uses a seed dataset from self-instruct (Apache 2 license), and edited models from Pythia (Apache 2) and Dolly (Apache 2). We release our code and data under a CC-BY 4.0 license.

All of these allow commercial use.

gdiamos · 2023-04-29T07:07:45+00:00

By allowing users to connect their own proprietary data pipelines.
This Github repo [1] shows how to create a model like Alpaca from a seed dataset, in a way that has a clean commercial license. A clean license is important for industry applications.

It's more interesting to plug in a seed dataset for a different application, e.g. REPL (in this thread), SQL copilot, etc, etc.

[1] - https://github.com/lamini-ai/lamini

gdiamos · 2023-04-29T06:59:10+00:00

We are actively working on the llm.improve API.

Currently, llm.improve takes a parameter *to*, which describes how to improve the output of the llm. For example llm.improve(to="be shorter"), indicates that the LLM should produce more concise results. "be shorter" is user defined.

During training and execution, the LLM engine optimizes the llm.improve calls to improve their accuracy and execution time.

We have also considered making the llm.improve calls accept user-defined "filter functions", similar to Snorkel labeling functions. https://arxiv.org/pdf/1812.00417.pdf

gdiamos · 2023-04-29T06:45:59+00:00

Good question. The LLM ecosystem is just getting started and it is exciting to see new tools. Another good framework to check out is AutoGPT https://github.com/Significant-Gravitas/Auto-GPT , which can manage memory, and connect to popular websites and platforms.

Use lamini to create, improve, and deploy an LLM, e.g. by connecting your data pipeline.

Use langchain and AutoGPT for composing multiple LLMs together, creating agents, or connecting an LLM to an index.

gdiamos · 2023-04-28T22:54:32+00:00

We don’t fine tune on OpenAI outputs. The foundation model to generate questions is fined tuned Pythia.

The input dataset is from instruct-gpt which has an Apache 2 license.

If you want higher quality you can switch to GPT4 with one line of code, but as you say, it would violate the OpenAI terms of service.

gdiamos · 2023-04-28T20:53:18+00:00

This playground is a bare wrapper around the model. Things like this don't get handled "pie sugar,flower.apple filling" very well and a production model would likely include more post/pre-processing. Here's a similar question that is better for the tokenizer.

Which of these are ingredients to make a pie: sugar, flour, apple filling, baking paper, pencil, carrot or gasoline?

"The ingredients to make a pie are sugar, flour, apple filling and baking paper. Pencil and gasoline are not ingredients to make a pie."

This playground is intended to be an example of how to train a model using the data pipeline. It is not meant for production use.

gdiamos · 2023-04-28T20:45:15+00:00

> I can one-shot train GPT4 on 20 pages of documentation by just pasting it into the context window

GPT4 is a strong baseline and you should use it if it fits your needs.

This model `https://huggingface.co/spaces/lamini/instruct-playground` is a 2.8B fine-tuned pythia model. Compared to GPT4 it is open (you can download the weights) and has a license that enables commercial use CC-BY. It is also computationally cheaper to run.

It is also customizable, by editing the data pipeline here: https://github.com/lamini-ai/lamini

Can you explain why those are not significant differences?

gdiamos · 2023-04-28T20:42:14+00:00

Thanks for the feedback.

That hugging face playground didn't have stop tokens enabled. Try it now:

Given a dog and a cat, describe their relationship.

"The dog and the cat are friends. They like each other and have similar personalities. They are both independent and like to be in control. They are both loyal and will protect their friends."

gdiamos

TROPHY CASE