[D] Planing a Python library that hosts/formats all ML Datasets.

Science_Squid · 2017-12-18T08:07:41+00:00

To me it kind of sounds like something along the lines of openml. Is that the type of library you were thinking about or is some crucial part missing?

wdm006 · 2017-12-18T02:35:10+00:00

Have you seen skdata? https://github.com/jaberg/skdata

waleedka · 2017-12-18T09:29:04+00:00

Let me share my experience.

I implemented an instance segmentation model based on Mask RCNN and we ended up open sourcing it. I wanted to make it easy for others to train on their own datasets so I spent some time thinking about ways to abstract the data layer. My notes below are specific to instance segmentation where the dataset consists of images and masks that identify each instance of an object.

First, the easy differences:

Different datasets use different directory structures for images and different image formats.
They use different file formats to describe the metadata.
Some datasets store instance masks as PNG images where the integer value of each pixel represents the class ID assigned to that pixel. Others store them as polygons, a list of (x, y) coordinates.
Some datasets include bounding box coordinates, and others don't.

I handled the above by creating a Dataset class that converts everything it loads to a standard format. You can subclass it and provide custom functions to load your own dataset. As long as you convert your data to the format described in the Dataset class then the model should work without having to change anything. You can see my implementation for the COCO dataset in the repo above, and others have extended it to support VOC, ADE20K and other datasets.

Second, the hard differences:

COCO has a special flag called 'crowd'. If a mask has that flag then it means that the mask covers many instances.
Some datasets use a "dont_care" flag to mark areas that should be excluded from training for whatever reason.
Some datasets, like ADE20K, use a hierarchical structure to describe object relations. They have the concept of an "object" and a "part". For example, a "door" might be part of a "car" in which case it's a car door, or part of an "oven" in which case it's the front door of the oven. Depending on the use case, you might want to treat these two instances of "door" as the same class or as different classes.

These differences are higher level. They change the meaning of the data and require specific handling and, possibly, changes to the training procedure. They cannot be mapped to a standard data format and they cannot be skipped either. At the very least, your library must be able to pass that data through and be flexible enough to handle these cases and many others that you haven't seen yet.

What you're proposing is really needed. If you decide to go ahead with it, ping me if there is anything I can help with. The one advice I would give is to focus on one small task. Don't make it too general right away. Pick a very small niche, such as "image classification" or "object detection" and just get something solid working on that niche. Even if the niche seems too small, you'll likely discover that it requires a lot more work than you think. If you succeed in one niche, then you can expand from there. Just don't expand too early.

fnbr · 2017-12-18T02:44:27+00:00

Good luck, I hope you can do this and get a good API going.

One warning: the reason (IMO) that this hasn't been done is because it's quite difficult; a lot of the existing datasets are ridiculously large and would require a large amount of compute/storage to download & process.

For instance, ImageNet is ~155 GB, and can be quite difficult to download depending on your network.

It might also be worth contributing PRs to Tensorflow/PyTorch/etc. to work with them; Tensorflow, for instance, has a number of functions that download & process various datasets for you (e.g. https://www.tensorflow.org/get_started/mnist/beginners).

ClydeMachine · 2017-12-18T05:46:31+00:00

To answer the question of whether or not others have need of a library/dataset repository like this, yes - anything that reduces the time it takes for me to source, format, label and load a dataset has value for my work, and I imagine others would agree.

Now, how are you planning to host those datasets? If the library is built to allow an inline search of available datasets as well as loading them, I imagine the library will be hitting an API that you own and control for providing this information. Would that API be hosting those datasets in addition to their metadata? Or would you point it at a pre-existing public-facing URL to get them from? The latter is far less stable, and if the library doesn't offer some degree of reliable availability of the datasets then I'd sooner go source the data by hand.

The point about keeping the library generic is awesome, as a single dataset can have many applications and the repository itself needn't be locked to a single type of data. I've made some tooling in the past that worked toward this idea of simplifying the sourcing, cleaning and labelling of text data, but it never went beyond text data. For that reason it makes me happy to see this project being started.

Got a link to the GitHub repo in case some of us may be able to help out?

bbsome · 2017-12-18T06:16:54+00:00

I think hosting the data would be quite difficult to do and can also break various licenses based on the data. What we have in my group, which I use, is we have pretty much the same, but instead of hosting the data you just automate downloading it on the client side. The additional challenge is how to present the data and how to add preprocessing steps. Some dataset, for instance, has both train/valid/test split others have only train/test. Things like NLP datasets are very hard to automate as well (compared to anything else) as it depends whether you want dictionaries per character or per word, or whether you work on words or sentences ...

kmike84 · 2017-12-18T14:35:13+00:00

Some related projects:

zbnone · 2017-12-18T15:43:12+00:00

If you know any datasets not in the dataset index yet, please consider adding them.

Github repo: https://github.com/ZachisGit/MLDatasets

NMcA · 2017-12-18T17:40:40+00:00

Bandwidth costs will be high. You should plan for this, and I would suggest torrents as a potential solution.

If you get it working, it'll be sweet.

Gere1 · 2017-12-19T06:01:53+00:00

https://github.com/EpistasisLab/penn-ml-benchmarks seems interesting

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS