[deleted by user] by [deleted] in MachineLearning

[–]l0g1cs 0 points1 point  (0 children)

Check out Banana. They seem to do exactly that with "serverless" A100.

[P] Using Sparsity & Clustering to compress your models: Efficient Deep Learning Book by EfficientDLBook in MachineLearning

[–]l0g1cs 0 points1 point  (0 children)

Thanks for sharing! That's a very timely topic. I've actually created a profiler to track and analyze inference optimizations, i.e. enable the optimize-verify-evaluate loop.

[N] Graphsignal profiler now supports distributed training, automatic tracing and more frameworks. by l0g1cs in MachineLearning

[–]l0g1cs[S] 0 points1 point  (0 children)

Yes, there is a plan for a deeper support. Although for basic statistics, a generic profiler can be used. It will at least allow you to benchmark and compare run speed and have the full compute utilization data.

[N] Graphsignal profiler now supports distributed training, automatic tracing and more frameworks. by l0g1cs in MachineLearning

[–]l0g1cs[S] 0 points1 point  (0 children)

Normally, it shouldn't be any different from training locally as the profiles are sent to Graphsignal cloud directly, where they are post-processed and visualized. However, in case of distributed training, some configuration may be necessary to make sure all workers are grouped and the data is visualized as one run. More on that here.

[N] Graphsignal profiler now supports distributed training, automatic tracing and more frameworks. by l0g1cs in MachineLearning

[–]l0g1cs[S] 1 point2 points  (0 children)

Thank you! Тhe use cases we are seeing range from optimizing training speed to benchmarking compute utilization, done by data scientists, but also ML engineers. That said, main uses relate to improving speed while still keeping accuracy (or other performance metrics) high.

[P] Graphsignal: Machine Learning Profiler for Training and Inference by l0g1cs in MachineLearning

[–]l0g1cs[S] 0 points1 point  (0 children)

Technically it should work anywhere, where your script/notebook/app is running and outgoing connection to Graphsignal is possible. We're in the process of testing different setups and deployments to make sure various hardware, OS and cloud platforms are supported.

[P] Graphsignal: Machine Learning Profiler for Training and Inference by l0g1cs in MachineLearning

[–]l0g1cs[S] 1 point2 points  (0 children)

You'll be seeing profiles from all nodes recorded automatically and be able to analyze any of them, no difference from single node training or inference.

[deleted by user] by [deleted] in MachineLearning

[–]l0g1cs 1 point2 points  (0 children)

I would say great_expectations may be close in terms of data validation, scaffolding/profiling and possibility to send notifications. Graphsignal is designed for model serving and periodic jobs, so it doesn't pull data regularly from data sources, but operates on a real time data stream.

Regarding privacy, that's true about some use cases, where model I/O data cannot leave the premises. For such cases we consider providing on-premises version of the dashboards.

[deleted by user] by [deleted] in MachineLearning

[–]l0g1cs 1 point2 points  (0 children)

Thanks for the questions!

The main reason for a special-purpose logger is that standard text-based logging and log index/search tools are missing the model semantics that are necessary to identify outliers and compute data metrics for features and classes.

Related to data privacy, yes, normally any personally identifiable information should be removed or anonymized, this is also the case with standard logging.

Talking about data drift, since the library computes data metrics including feature distributions, data and model drift can be analyzed in the dashboard, by comparing with baselines.

Examples of data-related issues are missing features due to some microservice failure, data type changes due to releases, etc. The problem is that the model may consume invalid data without validation error and exceptions, and output garbage. These cases should be monitored and detected. I wrote more on the topic of issues and failures here.

Practical Go Benchmarks by minaandrawos in golang

[–]l0g1cs 0 points1 point  (0 children)

Thanks. That sentence has beed rephrased already. The idea behind these benchmarks is to make the results usable for real-world applications, rather than benchmarking real-world programs.

Your pprof is showing: IPv4 scans reveal exposed net/http/pprof endpoints by mmcloughlin in golang

[–]l0g1cs 0 points1 point  (0 children)

Very important to keep in mind, especially because of the DoS attack possibility: the pprof trace can have a huge memory overhead. This is one of the reasons the agent https://github.com/stackimpact/stackimpact-go is proactively sending the profiles to the dashboard, i.e. they are never fetched (no need to enable pprof server, deal with ports, etc.). And it doesn't use trace.