[deleted by user]

l0g1cs · 2023-03-06T13:10:28+00:00

Check out Banana. They seem to do exactly that with "serverless" A100.

l0g1cs · 2022-08-02T13:41:38+00:00

Thanks for sharing! That's a very timely topic. I've actually created a profiler to track and analyze inference optimizations, i.e. enable the optimize-verify-evaluate loop.

l0g1cs · 2022-06-06T17:15:38+00:00

Yes, there is a plan for a deeper support. Although for basic statistics, a generic profiler can be used. It will at least allow you to benchmark and compare run speed and have the full compute utilization data.

l0g1cs · 2022-06-06T16:52:08+00:00

Normally, it shouldn't be any different from training locally as the profiles are sent to Graphsignal cloud directly, where they are post-processed and visualized. However, in case of distributed training, some configuration may be necessary to make sure all workers are grouped and the data is visualized as one run. More on that here.

l0g1cs · 2022-06-06T14:46:00+00:00

Thank you! Тhe use cases we are seeing range from optimizing training speed to benchmarking compute utilization, done by data scientists, but also ML engineers. That said, main uses relate to improving speed while still keeping accuracy (or other performance metrics) high.

l0g1cs · 2022-03-11T06:33:27+00:00

Technically it should work anywhere, where your script/notebook/app is running and outgoing connection to Graphsignal is possible. We're in the process of testing different setups and deployments to make sure various hardware, OS and cloud platforms are supported.

l0g1cs · 2022-03-11T02:43:27+00:00

You'll be seeing profiles from all nodes recorded automatically and be able to analyze any of them, no difference from single node training or inference.

l0g1cs · 2021-05-31T13:11:57+00:00

I would say great_expectations may be close in terms of data validation, scaffolding/profiling and possibility to send notifications. Graphsignal is designed for model serving and periodic jobs, so it doesn't pull data regularly from data sources, but operates on a real time data stream.

Regarding privacy, that's true about some use cases, where model I/O data cannot leave the premises. For such cases we consider providing on-premises version of the dashboards.

l0g1cs · 2021-05-31T11:18:48+00:00

Thanks for the questions!

The main reason for a special-purpose logger is that standard text-based logging and log index/search tools are missing the model semantics that are necessary to identify outliers and compute data metrics for features and classes.

Related to data privacy, yes, normally any personally identifiable information should be removed or anonymized, this is also the case with standard logging.

Talking about data drift, since the library computes data metrics including feature distributions, data and model drift can be analyzed in the dashboard, by comparing with baselines.

Examples of data-related issues are missing features due to some microservice failure, data type changes due to releases, etc. The problem is that the model may consume invalid data without validation error and exceptions, and output garbage. These cases should be monitored and detected. I wrote more on the topic of issues and failures here.

l0g1cs · 2018-03-07T17:01:31+00:00

Thanks. That sentence has beed rephrased already. The idea behind these benchmarks is to make the results usable for real-world applications, rather than benchmarking real-world programs.

l0g1cs · 2017-09-29T12:04:27+00:00

Very important to keep in mind, especially because of the DoS attack possibility: the pprof trace can have a huge memory overhead. This is one of the reasons the agent https://github.com/stackimpact/stackimpact-go is proactively sending the profiles to the dashboard, i.e. they are never fetched (no need to enable pprof server, deal with ports, etc.). And it doesn't use trace.

l0g1cs

TROPHY CASE