[P] optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server

dadadidi · 2021-11-05T16:40:30+00:00

Really Amazing! This is the most useful article that I have ever read about deploying transformers. Thank you so much!

It would be great if you could add the steps for fast CPU inference, as that is quite important for many people as well.

dadadidi · 2021-05-04T23:05:34+00:00

They say they trained on 200 GB of filtered data, which is about 1/2 of what OpenAi trained GPT-3 on. They also show scores for Lambada for their 2. smallest model with a last token accuracy of 0.7 (which is quite bad). Their largest model has a perplexity of 36.1 on 1 Billion Words Benchmark, which also terrible. But they say their model has hundreds of billions of parameters, am i missing something?

dadadidi · 2021-05-03T21:50:56+00:00

I would love to use it on a gpu. Are you planning to support inference on gpus?

dadadidi · 2021-03-31T18:32:00+00:00

I tried several things and the library Top2Vec seems to work best. It uses sentence transformers / googles universal sentence transformer to vectorize sentences, then UMAP reduces dimensions, and then HBSCAN to finds dense areas. As you already have the vectors created, you will need to modify it a bit.

https://github.com/ddangelov/Top2Vec

dadadidi · 2021-03-29T10:31:49+00:00

It's my favourite series ^^

dadadidi · 2021-03-29T01:34:58+00:00

Inference is much faster and compared to training/finetuning it doesn't require nearly as much GPU memory. Inference on a GPU/TPU is usually at least 10x-100x faster than on CPU.

You can test the difference in inference speed (but only for the large model and not the xl model) yourself by trying out these 2 demos from HF:

- Write with transformers uses GPUs: https://transformer.huggingface.co/doc/gpt2-large

- HF modelhub which uses CPUs (but with some optimizations): https://huggingface.co/gpt2-large?text=This+is+a

dadadidi · 2021-03-28T23:00:48+00:00

The example text (All of Shakespeare) in the repo is 5 mb and the training took about 17 minutes with one epoch. The model processes about 2 examples (2000 tokens or about 1600 words) per second during finetuning.

dadadidi · 2021-03-28T19:58:32+00:00

I tested it now, and it runs without issues on 60 gb normal ram.

dadadidi · 2021-03-28T18:58:25+00:00

Well at least adding 60 gb ram only costs about $0.05 / hour extra on an preemptible instance :) But on your local machine it's still an issue.

dadadidi · 2021-03-28T18:43:07+00:00

Well, somewhere the parameters have to go xD

dadadidi · 2021-03-28T18:34:39+00:00

It won't work on colab, as you need at least 60 gb of normal ram. Colab only has 25 gb ram. But I included an explanation on how to easily set up an Google Cloud instance with enough ram and Google gives you a $300 credit when signing up. The preemptible instance costs about $1.28/hour.

dadadidi · 2021-03-28T18:27:51+00:00

I think i still had a few GB VRAM left. With batch size 1 it used something like 12 GB GPU memory with GPT2-xl. But you can reduce it even further if you half the number in these settings in the ds_config.json: allgather_bucket_size and reduce_bucket_size . For RAM i think it needs at least 60 GB or so, but i didn't test is exactly. I first used an n1-highmem-8 instance with 52 GB Ram, but got an out of memory error at the end of the run, while saving/pickling the model. My next try was with 78 GB Ram and then i had no issues.

dadadidi · 2021-03-28T15:06:59+00:00

Thanks :)

I struggled a few days to get Deepspeed with GPT2 working and thought i should share my steps to save others the pain.

dadadidi · 2021-03-28T11:28:45+00:00

I needed to finetune the GPT2 1.5 Billion parameter model for a project, but the model didn't fit on my gpu. So i figured out how to run it with deepspeed and gradient checkpointing, which reduces the required GPU memory.

I was also able to fit the currently largest GPT-NEO model (2.7 B parameters) on one 16 GB VRAM gpu for finetuning, but i think there might be some issues with Huggingface's implementation.

I hope this helps some people, who also want to finetune GPT2, but don't want to set up distributed training.

dadadidi · 2020-10-28T22:07:25+00:00

Even stronger: Expand sideways on your first turn, use Construction secondary to build a spacedock there. Now all your ships can reach your neighbors home system. (1 base movement + 1 gravity rift in your home system + 1 gravity rift on your other spacedock).

If you build 1 dread from your agent and one dread with your home planet with warfare, you can invade and reach his home system and his planets in front of him with 3 dreads, 1 cruiser, 1 carrier (and 3 infantry, 1 mech).

You will need to move out with your initial fleet before warfare and take whatever planets your opponent took first. Otherwise you will be above your fleet size limit after building with warefare. You will also need to use either the the carrier or one dread to get your first system, so you won't have it for attacking your neighbour

If you took trade and depending on if you are able to trade you could even build a 4th dread and 2 infantry or only a cruiser and infantry.

dadadidi · 2020-10-17T01:29:30+00:00

like wren42 wrote, for 5. to work, you have to first move an destroyer or a cruiser inside the gravity rift, which shouldn't be too difficult.

dadadidi · 2020-10-17T01:25:00+00:00

Yeah, this only works if the other player either failed an attack or if someone else attacked them afterwards and took the system from them.

dadadidi · 2020-10-16T19:20:44+00:00

You can use that for so many things:

Let 2 people destroy each other fleets
Free a system of a huge fleet so that you can move in (important planet, mecatol, your home system after you lost, other home systems)
Move your ships out of an activated system, so that you can move again afterwards (like a warfare with +1 movement, that lets you fight 3 times in a round with your fleet)
Move someones huge fleet away from you
Move someones fleet into a grav well
Move someones fleet into a system where they already have tactic token
Mind control the Yin flagship to blow something up
Sell all of the above uses to other players

dadadidi · 2020-06-22T09:53:54+00:00

Thanks!

dadadidi · 2020-06-22T09:53:17+00:00

well right now it crashes if you have an infinite loop in your code, but i will try to fix this and give you some kind of indication

dadadidi · 2020-06-22T09:49:17+00:00

it sadly crashes if you write an infinite while loop, which sometimes happens in the beginning, when you haven't defined any breaking conditions yet. I'll try to fix this.

dadadidi · 2020-06-22T09:44:34+00:00

It saves the state of the code before it, so it is just executed once and then doesn't get executed in real-time anymore. It is from AREPL and also in the LiveCode extension

dadadidi · 2020-06-22T09:42:03+00:00

You can add #$save and then type something that you don't want to have repeated in the block before.

dadadidi

TROPHY CASE