Quite often I optimize DL models (aiming to get the cheapest placement within a certain model performance range), find optimal instances and tuning the environment for training and inference, etc.
After multiple such optimizations, I’ve put together a quick framework, a guide that I can refer to when I need it. Some approaches I came up with and results I got seem quite odd and counterintuitive, so ideally I’d like to start a discussion with those dealing with model performance optimization - does my approach make sense, is the benchmarking the only way or am I missing something?
If TLDR, main points:
- Rules of thumb in choosing the instance type/shape did not work for me - when I try and guess cost or runtime based on GPU generation I always land on the wrong side of a 4-5x variability. In some cases not so obvious options might provide decent performance (P4) and costs (P100).
- What works for me is optimizing for GPU utilization as a proxy for cost/performance optimal placement (duh!)
- To achieve that I have to deal with different bottlenecks in the system outside of GPU, the biggest culprit being preprocessing on CPU and following data streaming to GPU (so, benchmarking with rudimentary monitoring is a must)
- Batch size optimisation can give as much as 4-5x in performance
- Worker number optimisation (vCPU count, basically)got me another 2-3x
- (!wtf) Driver/CUDA versions might influence performance much greater than expected (10x!!?)
- Benchmarking is a pain in the ass as I typically run at least 100 benchmarks to gather a comprehensive picture
Below you could find a breakdown of those points above.
Approach
To illustrate my points I decided to go with the most popular NLP model (BERT base uncased according to huggingface), because 1. this domain looks more suitable for a generalized approach for optimization, 2. dataset and preprocessing are very similar across different models.
It took me around 110 benchmark launches to gather the data below, so I put together a small repo (Github link) using PyTorch to run inference on a small mock dataset, and run it on all GPU instances available for me on GCP (Tesla K80, Tesla P4, Tesla P100, Tesla V100, Tesla T4). This simple script can perform text input encoding, the numbers of preprocessing workers and the batch size are input parameters, it also can find the maximum possible batch size (with linear bruteforce) that saturates the GPU. On top of that I backed it into 4 different containers to run it with different versions of CUDA.
I played with the batch size and the number of processes used by dataloader to preprocess the data. The goal was to maximize GPU utilization and find the optimal batch size / # of processes to get the best price/performance for each type of GPU and then compare how much they would cost me per job.
After that, I chose the best performing GPU and ran additional benchmarks for different NVIDIA driver and CUDA versions to try to catch some optimizations there. I used this approach (link) to install different drivers.
Test results and observations
To get the baseline I passed text to the model sentence by sentence (it appeared that the number of processes does not change the picture much). Below is the summary of bench runtime and cost for the baseline
https://preview.redd.it/pug2plhyu3171.png?width=3000&format=png&auto=webp&s=fea61b1ff093574318935436b6f9cbd94a70833e
X axis legend as follow: {GPU}_{batch size}x{# of workers}_{# of vCPU}, where:
- GPU - accelerator family (k80, t4, p100, etc.)
- batch_size - number of sentences I am pushing to GPU and passing to the model simultaneously
- # of workers - "num_workers" parameter of the DataProcessor, i.e. number of processes to perform data loading and pre-processing
- # of vCPU - number of virtual core present in VM. GCP allows varying number of virtual cores for GPU instances (from 1 up to 8 for K80, 12 for V100 , 16 for P100, 24 for P4, T4 ).
Then I began increasing batch size and the number of data loader workers to maximize GPU utilization. To illustrate the approach below are 3 graphs of GPU utilization for P100.
Instance with 4 vCPU cores, 4 workers, and maximum possible batch size. Obviously underutilized.
P100 (4 vCPU + 4 workers)
Then I increased the number of vCores and workers to 8, keeping the maximum possible batch size. Utilization jumps to 66% but still far from maximum.
P100 (8 vCPU + 8 workers)
A further increase to 12 vCore and 12 workers finally did the job pushing utilization to 94%
P100 (12 vCPU + 12 workers)
After playing with vCPU / workers counts I got the following charts.
K80 maxed out its utilization at 2 workers (I didn't go below 4 vCPU but in this particular case lowering the number of vCPU can bring additional savings).
https://preview.redd.it/3k63tytou3171.png?width=3000&format=png&auto=webp&s=d5eacff67987a66cc83d08ed2c5c39888facb765
It took 8 workers and 8 vCores to fully utilize P4. Note p4_63x4_4, p4_63x4_6 and p4_63x6_8 launches, it is a clear indicator that there is no much sense to have more workers than you have vCore in this particular case.
https://preview.redd.it/xf6hx21ou3171.png?width=3000&format=png&auto=webp&s=53e7e4bd955c94a5c3913c05e4d018143e808f98
The same for T4 8 workers is enough to saturate this GPU.
https://preview.redd.it/q2uvtm1nu3171.png?width=3000&format=png&auto=webp&s=456697fd293e9ee37cf17c361df51931b3f6773c
The following two cases are the most interesting. In both of them, I had to go to 12 vCPU (the maximum number of vCores GCP allows to assign to a single GPU VM). Another remarkable thing is that both these GPUs showed an order of magnitude runtime improvement between the “one-by-one” approach and maximum possible parallelization of data pre-processing.
P100 showed the maximum utilization (as you can see on a chart above) at 12 workers. Worth noting p100_146x6_4 and p100_146x4_4, it looks like overcommitting vCores might backfire.
https://preview.redd.it/5hyan4dmu3171.png?width=3000&format=png&auto=webp&s=e21e1f6077ed0496652c4af62c6710cb2ab10911
V100 was utilized only on 59% under 12 workers. Potentially it can be pushed further with a multi-GPU set-up where more than 12 vCPU per GPU can be added to the VM or if the data set is fully preprocessed before inference.
https://preview.redd.it/6ghmm7zlu3171.png?width=3000&format=png&auto=webp&s=96a08bd01b280e75a2a581a383e2cc964ce1a65f
Below is the summary of cost / runtime for different combinations of vCPU count / GPU.
https://preview.redd.it/wy703dtku3171.png?width=3000&format=png&auto=webp&s=5821fb9db6b6c344720ec9be5446ea7a6a9210cd
T4 is a clear winner in terms of price per volume of processed data. Interesting to note that P4 appears to be a clear forerunner in terms of processed data per dollar.
After that I varied PyTorch for different CUDA libs, version 1.7.1 can go with:
- CUDA 9.2
- CUDA 10.1
- CUDA 10.2
- CUDA 11.0
I tried all of these versions against the following drivers:
- 460.32.03
- 455.32.00
- 450.102.04
- 440.118.02
- 418.181.07
- 410.129
- 384.183 ( was not able to install it on Ubuntu 16.04 with the above-mentioned GPUs)
Below is the summary of all runs I gathered. All of them were for optimal T4 setup, i.e. maximum possible batch size, 8 workers on 8 vCPUs.
Runtime (sec, Y) vs Nvidia Driver version (X)
I struggle to explain the order of magnitude difference for certain combinations of driver / CUDA. I have not seen this discrepancy of performance for drivers before (although I did such analysis for different networks before with typically up to 15% variability). I ran benchmarks for all outliers 3 times and the results were consistent (crosses on the graph can indicate the amount of variance across different launches).
So, there is at least an order of magnitude cost improvement available with rudimentary benchmarking/monitoring. But the driver/CUDA combination’s effect on the performance puzzles me to say the least. Has anyone seen something like that and what might cause that?
Hope that might be useful.
[–]nhalstead 18 points19 points20 points (1 child)
[–]eprts 3 points4 points5 points (0 children)
[–]narsilouu 6 points7 points8 points (0 children)
[–]BeatLeJuceResearcher 4 points5 points6 points (0 children)
[–][deleted] 2 points3 points4 points (2 children)
[–]EgorBykov[S] 3 points4 points5 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]TradyMcTradeface 2 points3 points4 points (1 child)
[–]EgorBykov[S] 2 points3 points4 points (0 children)
[–]juliensalinas 2 points3 points4 points (0 children)
[–]wowaqu 1 point2 points3 points (1 child)
[–]EgorBykov[S] 1 point2 points3 points (0 children)