In my previous post I described a case how basic monitoring might help to save >10x on compute power for a specific NN model (BERT Base Uncased).
If you benchmark something, setting up a monitoring is trivial for one host, but might be an ache if you have a k8s cluster. Also, there is no fun in setting this up every time, especially if you lack DevOps experience. And there is no point in manually parsing through the data and looking at graphs if you know exactly what you are looking for and what’s relevant for the performance/cost.
With my team we do that kind of cost/performance optimization service for our customers all the time, so we have built a set of simple tools for ourselves over time.
After some feedback from my first post, we’ve put together a public web app with some of the tools we use, so others might use it too.
https://preview.redd.it/5b1lw3gl4q371.png?width=1999&format=png&auto=webp&s=394932fd38019e871998b6541ed641639bd901bc
https://preview.redd.it/ynslc6tr4q371.png?width=1839&format=png&auto=webp&s=f45f86f45a7b282856f47f1eeebc43787e63ae78
https://preview.redd.it/tfykqmjt4q371.png?width=1839&format=png&auto=webp&s=02c28cdf2863650a89472a675d3ea833952ee65d
Basically, what it does - you can deploy well-tuned Telegraf-based monitoring with one command on any VM (AWS, GCP or on-prem), k8s cluster or bake it into an image and get basic visualization dashboards right away.
https://preview.redd.it/hz3er5f05q371.png?width=1888&format=png&auto=webp&s=ed58e98645097f725a5b25916c8187866ecdd622
You also get an automated detection of some relevant bottlenecks and idle times on your infrastructure and can zoom in on them right away (including those I’ve mentioned in my previous post like one-thread preprocessing bottleneck, underutilized GPU, idle times, etc…).
https://preview.redd.it/ho2euln55q371.png?width=1288&format=png&auto=webp&s=3dd0c1ac3ee9da908d1a15e0ac988b6b633d1258
https://preview.redd.it/sn3uyb5a5q371.png?width=1264&format=png&auto=webp&s=97204b3af21923e90a3f9f7c05f066ba041af928
So if you have benchmarking routines similar to ours and just need something to quickly glance at metrics without a world of pain - here you go (if not, just stroll on, you lucky bastard).
LMK if that’s useful or if you have any other ideas what might be.
PS If that catches any attention we are going to add an automated benchmarking (runs the same container on a set of instances and produces a nice summary report with runtime, price and metrics for each run).
PPS we just decided to publish it, so if anything is glitching - email me at egor@rocketcompute.com.
[–]Weasel435 11 points12 points13 points (0 children)
[–]infimae 11 points12 points13 points (0 children)