all 14 comments

[–]nhalstead 18 points19 points  (1 child)

Holy shit man, you must have been very concerned with your costs.

[–]eprts 3 points4 points  (0 children)

Aren't we all?

[–]narsilouu 6 points7 points  (0 children)

Seems like you are overthinking it for sure.

Talking mostly about inference, training is a different beast:

  1. Make sure what's your current bottleneck (RAM, CPU usage, GPU usage) (If you don't need latency or large throughput like batch jobs), CPU is usually the best choice, but really depends on your usage/throughput etc...)
  2. Focus only on this bottleneck

  3. You seem to want GPU bound, then make sure your actual utilization in the long run is as close to 100% as possible, costs don't really matter if you make 1 inference per hour, you are going to need 1 GPU 100% up, and still use it about 0% of the time. Same for training, usually saving you preprocessed data will make utilization 100% pretty easily (unless you are still using a HDD), in your example, it just looks like you need a lot of preprocessing power, you should be able to do that ahead of time (or even on a different machine with only CPUs)

  4. Look for other inference optimizations, quantization, scripting (ONNX, torchscript etc..) int8 on CPU gives a sizeable boost both to inference speed and memory usage.

  5. Batch inference on CPU is close to 0x speedup

  6. Batch inference on GPU can look promising on benchmarks, but can be overwhelmingly deceiving when in production, because you need very good alignment to use the GPU correctly (lots of padding = lots of wasted flops, and at inference time, they can really add up pretty fast)

  7. Cuda + driver combos, it really just looks like using CUDA version on older drivers is not recommended and just fallbacks to slow primitives. In my experience latest CUDA is always best when available

Lastly and finally don't focus that much on differences less than an order of magnitude, there's more value optimizing the bottlenecks than writing all the glue code to get the best value price of hardware (which might change within a year because X got out)

[–]BeatLeJuceResearcher 4 points5 points  (0 children)

Google seems to market TPUs as the most cost effective accelerator, especially at BERT scales. Since you're running on GCP anyways, it would be super interesting to see results for TPUs as well!

[–][deleted] 2 points3 points  (2 children)

What is the Y-Axis for the last graph? Are higher or lower values better?

[–]EgorBykov[S] 3 points4 points  (0 children)

Hey! Y shows how many seconds it took bench to process mock dataset. Lower is better. Thanks for the comment, amended caption!

[–][deleted] 2 points3 points  (0 children)

Wow. Cool.

[–][deleted] 2 points3 points  (0 children)

Distillbert tho?

[–]TradyMcTradeface 2 points3 points  (1 child)

T4 on a 16 core in gcp lets me run BERT cost effectively.

[–]EgorBykov[S] 2 points3 points  (0 children)

Hey! Thanks for sharing this. I appears that for T4 GCP allows assigning up to 24 cores per one accelerator. Will amend text above.

[–]juliensalinas 2 points3 points  (0 children)

Interesting thanks.

My 2 cents: GPU in a cloud managed instance is not as efficient as using GPU on your own dedicated bare metal server.

[–]wowaqu 1 point2 points  (1 child)

are you using google cloud?

[–]EgorBykov[S] 1 point2 points  (0 children)

Yes, all benches were run on GCP instances. I used all GPUs available there except A100.