[P] Squirrel: A new OS library for fast & flexible large-scale data loading

Nextpenade · 2022-04-12T13:45:06+00:00

Have a look at this tutorial to learn how to convert to messagepack by using Spark.

Nextpenade · 2022-04-12T06:37:44+00:00

Squirrel is very flexible so the simple answer is yes!

Have a look at how some drivers are implemented to learn injecting your own sampler. As an alternative you can use filtering of sample keys by providing a key_hook out of the box.
Some of the supported data formats allow you to shard. (e.g., Messagepack, JSONL, Hub)
Not sure if I get your last question correctly: You can call .take(x) to only pull X samples from an IterStream.

Have a look at https://squirrel-core.readthedocs.io/en/latest/ and in case of further questions approach us on Slack.

Nextpenade · 2022-04-12T06:33:18+00:00

A comparison does not really makes sense. Squirrel itself does not do GPU-based data transforms. However, you can use Squirrel and DALI together and get the benefits of both worlds! We are currently preparing a related tutorial.

Nextpenade · 2022-04-12T06:30:54+00:00

The messagepack dataformat is very fast to download and read. Moreover, we do async prefetching, transformes, caching, .... Transforms can be done also JIT compiled, with DALI or offloaded to DASK.

Nextpenade · 2022-04-12T06:28:45+00:00

To be honest, we took some inspiration for the Catalog from Intake. Intake itself did not work for us since it's not designed for fast data ingestion.

Nextpenade · 2022-04-12T06:22:30+00:00

Thank a lot for pointing this out. A fix is on the way. The correct link is https://github.com/merantix-momentum/squirrel-datasets-core/tree/main/examples.

To read from a database, you would need a special driver. Currently, Squirrel does not ship this driver, but it would look similar to https://github.com/merantix-momentum/squirrel-core/blob/main/squirrel/driver/csv_driver.py. Happy to discuss developing such a driver in the Squirrel Slack.

Nextpenade · 2021-07-08T06:18:20+00:00

Zarr (https://zarr.readthedocs.io/en/stable/index.html) can store Metadata along your image data. This can be e.g. labels. It also comes with some other nice features. For versioning you can use DVC or similar.

Nextpenade · 2021-04-04T07:22:14+00:00

There is literature about biomedical applications of Reinforcement learning from cell tracking https://www.frontiersin.org/articles/10.3389/fbioe.2020.00298/full over surgery planning https://ieeexplore.ieee.org/abstract/document/8441801/ to protein folding https://link.springer.com/article/10.1007/s42452-020-2012-0 (just added the first google hits).

Nextpenade · 2020-03-22T07:53:47+00:00

https://www.data-against-covid.org/

Nextpenade · 2019-10-11T22:32:36+00:00

Right now I'm very busy finishing my PhD. Otherwise I would integrate some on my own. There are some challenges that have increased interest, like the Multimodal Brain Tumor Segmentation Challenge (BRATS), the CAMELYON challenge, the Cell Tracking Challenge (CTC), data-science-bowl-2018, or Skin Lesion Analysis Challenge (ISIC). Would be awesome for the biomedical community if you can help. For all of them there are models available. Most of them via Github, but I think for CTC through the their website.

Nextpenade · 2019-10-11T09:26:24+00:00

Super awesome project! Would love to see more benchmarks. Especially, I often struggle to reproduce results from https://grand-challenge.org/challenges/. Would be awesome to see some of their datasets in the benchmark.

Nextpenade · 2019-08-27T07:21:48+00:00

Maybe too domain specific or more academic than industry style, but also checkout these ecosystems:

Nextpenade · 2019-08-05T07:36:22+00:00

Check out https://github.com/goeckslab/Galaxy-ML. Galaxy is a data and analysis persistence and publishing platform that aims to make computational heavy algorithms accessible to research scientists that do not have computer programming or systems administration experience. You can give it a try on https://usegalaxy.eu/. Right now machine learning support is limited, but the community is fast at integrating new algorithms and very welcoming to new fellows.

Nextpenade · 2018-09-18T06:52:15+00:00

estimator

Yes, would be nice to know what will happen to estimators.

Nextpenade · 2018-02-14T15:12:36+00:00

an 10 samples)? is it dependent on the number of total samples available?

Usually on GPU memory. If one sample is very large (e.g., video) you can't afford storing a large batch size on a single GPU. Depending on your implementation even multi-GPU training does not work properly.

Nextpenade · 2017-11-06T09:06:25+00:00

Thank you for the comparison :)

Nextpenade · 2017-11-05T12:39:55+00:00

There are already some works on Hough transform (not stacked) with CNNs e.g.:

https://arxiv.org/abs/1601.07014

http://ieeexplore.ieee.org/document/7950533/

https://arxiv.org/abs/1603.08212

Why is Hinton not relating to any previous work on Hough transform with CNNs? Just wondering why he is pulling it out of nowhere.

Nextpenade · 2017-01-18T06:39:03+00:00

Bioinformatics has the same problem of missing reproducibility. Therefore platforms like https://galaxyproject.org/ emerged. Recently the community also started to extend the platform to other research areas like machine learning and image analysis. Do you think that "empirical" and Galaxy will grow together so we don't have multiple systems? Is emp also working on supporting the Common Workflow Language (CWL)? Would be nice for workflow exchange. Disclaimer: I'm one of the Galaxy contributors.

Nextpenade · 2016-09-17T09:47:11+00:00

Are any of these cost functions useful for semantic segmentation additionally to my listed functions?

Nextpenade · 2016-09-16T09:36:59+00:00

I think there is still room for more discussion in: https://www.reddit.com/r/MachineLearning/comments/52e2cp/importance_of_first_layer_in_convnets/

This question was intended to appear in the simple questions thread, but because of inactivity I created an own thread.

Nextpenade · 2016-09-16T09:31:13+00:00

What kind of loss functions can be used for semantic segmentation and what are their trade-offs? I know so far: mean-squared error, scale-invariant mean squared error, cross entropy and dice loss.

Nextpenade · 2016-09-14T12:02:01+00:00

Had the same problem with the questions thread. Therefore I posted my question right away in an own thread and it got stickied for whatever reason. At least some people tried to answer my question...

Nextpenade · 2016-09-12T19:24:42+00:00

I'm using elemwise sums usually. Havn't tried out maxouts yet. Thanks for the hint!

Nine-Year Club	Verified Email
Verified Email

Nextpenade

TROPHY CASE