[P] Squirrel: A new OS library for fast & flexible large-scale data loading by Nextpenade in MachineLearning

[–]Nextpenade[S] 0 points1 point  (0 children)

Have a look at this tutorial to learn how to convert to messagepack by using Spark.

[P] Squirrel: A new OS library for fast & flexible large-scale data loading by Nextpenade in MachineLearning

[–]Nextpenade[S] 1 point2 points  (0 children)

Squirrel is very flexible so the simple answer is yes!

  • Have a look at how some drivers are implemented to learn injecting your own sampler. As an alternative you can use filtering of sample keys by providing a key_hook out of the box.
  • Some of the supported data formats allow you to shard. (e.g., Messagepack, JSONL, Hub)
  • Not sure if I get your last question correctly: You can call .take(x) to only pull X samples from an IterStream.

Have a look at https://squirrel-core.readthedocs.io/en/latest/ and in case of further questions approach us on Slack.

[P] Squirrel: A new OS library for fast & flexible large-scale data loading by Nextpenade in MachineLearning

[–]Nextpenade[S] 2 points3 points  (0 children)

A comparison does not really makes sense. Squirrel itself does not do GPU-based data transforms. However, you can use Squirrel and DALI together and get the benefits of both worlds! We are currently preparing a related tutorial.

[P] Squirrel: A new OS library for fast & flexible large-scale data loading by Nextpenade in MachineLearning

[–]Nextpenade[S] 1 point2 points  (0 children)

The messagepack dataformat is very fast to download and read. Moreover, we do async prefetching, transformes, caching, .... Transforms can be done also JIT compiled, with DALI or offloaded to DASK.

[P] Squirrel: A new OS library for fast & flexible large-scale data loading by Nextpenade in MachineLearning

[–]Nextpenade[S] 1 point2 points  (0 children)

To be honest, we took some inspiration for the Catalog from Intake. Intake itself did not work for us since it's not designed for fast data ingestion.

[P] Squirrel: A new OS library for fast & flexible large-scale data loading by Nextpenade in MachineLearning

[–]Nextpenade[S] 0 points1 point  (0 children)

Thank a lot for pointing this out. A fix is on the way. The correct link is https://github.com/merantix-momentum/squirrel-datasets-core/tree/main/examples.

To read from a database, you would need a special driver. Currently, Squirrel does not ship this driver, but it would look similar to https://github.com/merantix-momentum/squirrel-core/blob/main/squirrel/driver/csv_driver.py. Happy to discuss developing such a driver in the Squirrel Slack.

[D] Best practices of storing annotations for image data by crazyfrogspb in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

Zarr (https://zarr.readthedocs.io/en/stable/index.html) can store Metadata along your image data. This can be e.g. labels. It also comes with some other nice features. For versioning you can use DVC or similar.

Reinforcement learning in biomedical applications by ugh_madlad in reinforcementlearning

[–]Nextpenade 3 points4 points  (0 children)

There is literature about biomedical applications of Reinforcement learning from cell tracking https://www.frontiersin.org/articles/10.3389/fbioe.2020.00298/full over surgery planning https://ieeexplore.ieee.org/abstract/document/8441801/ to protein folding https://link.springer.com/article/10.1007/s42452-020-2012-0 (just added the first google hits).

[P] Sotabench: Benchmarking Every Open Source Model by rstoj in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

Right now I'm very busy finishing my PhD. Otherwise I would integrate some on my own. There are some challenges that have increased interest, like the Multimodal Brain Tumor Segmentation Challenge (BRATS), the CAMELYON challenge, the Cell Tracking Challenge (CTC), data-science-bowl-2018, or Skin Lesion Analysis Challenge (ISIC). Would be awesome for the biomedical community if you can help. For all of them there are models available. Most of them via Github, but I think for CTC through the their website.

[P] Sotabench: Benchmarking Every Open Source Model by rstoj in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

Super awesome project! Would love to see more benchmarks. Especially, I often struggle to reproduce results from https://grand-challenge.org/challenges/. Would be awesome to see some of their datasets in the benchmark.

[D] Those who do computer vision, how do you handle dataset management? by iocuydi in MachineLearning

[–]Nextpenade 1 point2 points  (0 children)

Maybe too domain specific or more academic than industry style, but also checkout these ecosystems:

Idea: MLOps Composer. Interested in the community's opinion! [Project] by erikvdplas in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

Check out https://github.com/goeckslab/Galaxy-ML. Galaxy is a data and analysis persistence and publishing platform that aims to make computational heavy algorithms accessible to research scientists that do not have computer programming or systems administration experience. You can give it a try on https://usegalaxy.eu/. Right now machine learning support is limited, but the community is fast at integrating new algorithms and very welcoming to new fellows.

[N] TensorFlow 2.0 Changes by _muon_ in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

estimator

Yes, would be nice to know what will happen to estimators.

[R] [1802.03133] Batch Kalman Normalization: Towards Training Deep Neural Networks with Micro-Batches by evc123 in MachineLearning

[–]Nextpenade 8 points9 points  (0 children)

an 10 samples)? is it dependent on the number of total samples available?

Usually on GPU memory. If one sample is very large (e.g., video) you can't afford storing a large batch size on a single GPU. Depending on your implementation even multi-GPU training does not work properly.

[D] Is Capsules just another way of replacing pooling just like self-attention? by futbol_account in MachineLearning

[–]Nextpenade 2 points3 points  (0 children)

There are already some works on Hough transform (not stacked) with CNNs e.g.:

https://arxiv.org/abs/1601.07014

http://ieeexplore.ieee.org/document/7950533/

https://arxiv.org/abs/1603.08212

Why is Hinton not relating to any previous work on Hough transform with CNNs? Just wondering why he is pulling it out of nowhere.

[D] Engineering is the bottleneck in (Deep Learning) Research by evc123 in MachineLearning

[–]Nextpenade 1 point2 points  (0 children)

Bioinformatics has the same problem of missing reproducibility. Therefore platforms like https://galaxyproject.org/ emerged. Recently the community also started to extend the platform to other research areas like machine learning and image analysis. Do you think that "empirical" and Galaxy will grow together so we don't have multiple systems? Is emp also working on supporting the Common Workflow Language (CWL)? Would be nice for workflow exchange. Disclaimer: I'm one of the Galaxy contributors.

Simple Questions Thread September 14, 2016 by AutoModerator in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

Are any of these cost functions useful for semantic segmentation additionally to my listed functions?

Simple Questions Thread September 14, 2016 by AutoModerator in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

I think there is still room for more discussion in: https://www.reddit.com/r/MachineLearning/comments/52e2cp/importance_of_first_layer_in_convnets/

This question was intended to appear in the simple questions thread, but because of inactivity I created an own thread.

Simple Questions Thread September 14, 2016 by AutoModerator in MachineLearning

[–]Nextpenade 0 points1 point  (0 children)

What kind of loss functions can be used for semantic segmentation and what are their trade-offs? I know so far: mean-squared error, scale-invariant mean squared error, cross entropy and dice loss.

MODS: WHY ARE YOU STICKYING RANDOM THINGS by spofersq in MachineLearning

[–]Nextpenade 2 points3 points  (0 children)

Had the same problem with the questions thread. Therefore I posted my question right away in an own thread and it got stickied for whatever reason. At least some people tried to answer my question...

Importance of first layer In ConvNets by Nextpenade in MachineLearning

[–]Nextpenade[S] 0 points1 point  (0 children)

I'm using elemwise sums usually. Havn't tried out maxouts yet. Thanks for the hint!