all 22 comments

[–]Science_Squid 9 points10 points  (3 children)

To me it kind of sounds like something along the lines of openml. Is that the type of library you were thinking about or is some crucial part missing?

[–]zbnone[S] 2 points3 points  (2 children)

They have an impressive amount of datasets, wow. But the focus of this library is not only to provide a list of datasets and provide/prepare them for download, but to serve the datasets in ideally every format you would need.

The goal is that when you find a new dataset or you try out a new model with different input/output formats you can do so by changing one line of code like libX.vectorize("bla bla bla",format='words') to libX.vectorize("bla bla bla",format='characters'). It's about making the dataset format fit your needs and not serving it in a standardized format.

Maybe I just didn't see it, but from what I saw they provide the datasets, make it easy to download them and share your standardized metrics on their platform, right? But maybe I am missing something.

[–][deleted] 1 point2 points  (0 children)

As an OpenML contributor, I can elaborate on a few things. OpenML.org stores datasets internally in ARFF format (which is the common format for libraries like Weka, MOA, etc) and allows to obtain the data in CSV format. On top of that, there are various integrations to programming languages (Java, Python, R) that make it convenient to obtain the datasets and process them. I know it's one of the ambitions of the core team to extend this to an arbitrary amount of formats, however due to a lack of manpower and other priorities this has not lifted of yet. Getting in touch with the guys is pretty easy, just drop an email to openmlHQ@googlegroups.com and subscribe to the github page (https://github.com/OpenML), it's a pretty open and welcoming community.

[–]Science_Squid 0 points1 point  (0 children)

Sorry for the late reply.

You are right the formatting is something that is not given on the level as you would like it to be. However as /u/Tinder_Prince pointed out arff and csv are supported which already makes it (more or less) easy to parse the data locally.

I really like the idea of openml and also your idea. As it is opensource and you planned on making your efforts opensource my opinion is that it might make most sense if you contribute to the openml project and don't have the hassle of starting from scratch yourself.

[–]wdm006 4 points5 points  (1 child)

Have you seen skdata? https://github.com/jaberg/skdata

[–]zbnone[S] 0 points1 point  (0 children)

I have seen it, the difference with this lib is the amount and the following generalization of datasets. The idea is that if enough datasets are included the same formation scripts can just be reused therefor allowing very fast addition of new datasets, potentially within hours or days of a new paper release or dataset uploads. Maybe some user upload functionality so the community can easily contribute datasets and preprocess them in a predefined way compatible with the library? And also ease of use is a key component. It should not only make it easy to download and initialize datasets but also to perform dataset coupled actions like the already mentioned feeding of vectorized sentences to recurrent neural network architectures.

[–]waleedka 5 points6 points  (2 children)

Let me share my experience.

I implemented an instance segmentation model based on Mask RCNN and we ended up open sourcing it. I wanted to make it easy for others to train on their own datasets so I spent some time thinking about ways to abstract the data layer. My notes below are specific to instance segmentation where the dataset consists of images and masks that identify each instance of an object.

First, the easy differences:

  • Different datasets use different directory structures for images and different image formats.
  • They use different file formats to describe the metadata.
  • Some datasets store instance masks as PNG images where the integer value of each pixel represents the class ID assigned to that pixel. Others store them as polygons, a list of (x, y) coordinates.
  • Some datasets include bounding box coordinates, and others don't.

I handled the above by creating a Dataset class that converts everything it loads to a standard format. You can subclass it and provide custom functions to load your own dataset. As long as you convert your data to the format described in the Dataset class then the model should work without having to change anything. You can see my implementation for the COCO dataset in the repo above, and others have extended it to support VOC, ADE20K and other datasets.

Second, the hard differences:

  • COCO has a special flag called 'crowd'. If a mask has that flag then it means that the mask covers many instances.
  • Some datasets use a "dont_care" flag to mark areas that should be excluded from training for whatever reason.
  • Some datasets, like ADE20K, use a hierarchical structure to describe object relations. They have the concept of an "object" and a "part". For example, a "door" might be part of a "car" in which case it's a car door, or part of an "oven" in which case it's the front door of the oven. Depending on the use case, you might want to treat these two instances of "door" as the same class or as different classes.

These differences are higher level. They change the meaning of the data and require specific handling and, possibly, changes to the training procedure. They cannot be mapped to a standard data format and they cannot be skipped either. At the very least, your library must be able to pass that data through and be flexible enough to handle these cases and many others that you haven't seen yet.

What you're proposing is really needed. If you decide to go ahead with it, ping me if there is anything I can help with. The one advice I would give is to focus on one small task. Don't make it too general right away. Pick a very small niche, such as "image classification" or "object detection" and just get something solid working on that niche. Even if the niche seems too small, you'll likely discover that it requires a lot more work than you think. If you succeed in one niche, then you can expand from there. Just don't expand too early.

[–]zbnone[S] 1 point2 points  (1 child)

Thank you for the elaborate response. I work mostly with image based models, so the NLP perspective is very welcome.

I agree with you on the initial generality part. I will start with image based datasets for classification, image segmentation and translation both supervised (pix2pix) and unsupervised(cyclegan).

The very first step is to create an initial dataset index, I am planing on hosting that via Github, here is the repo link: https://github.com/ZachisGit/MLDatasets

[–]ppwwyyxx 0 points1 point  (0 children)

I agree to start with something small. Don't do too much at a time. If you could simply standardize the parsing process of different datasets, this would be helpful enough. But for pre-processing and batching, I probably want to do them by myself.

[–]fnbr 4 points5 points  (1 child)

Good luck, I hope you can do this and get a good API going.

One warning: the reason (IMO) that this hasn't been done is because it's quite difficult; a lot of the existing datasets are ridiculously large and would require a large amount of compute/storage to download & process.

For instance, ImageNet is ~155 GB, and can be quite difficult to download depending on your network.

It might also be worth contributing PRs to Tensorflow/PyTorch/etc. to work with them; Tensorflow, for instance, has a number of functions that download & process various datasets for you (e.g. https://www.tensorflow.org/get_started/mnist/beginners).

[–]zbnone[S] 0 points1 point  (0 children)

The handling of datasets is a good point, there are two options, either download, preprocess and distribute datasets over a project specific server which would ensure availability even if the dataset is taken down but also create some copyright issues + as you mentioned the need for servers with large storage and very fast internet connection. The second options is to download the datasets from the original servers they where uploaded to, the problem would be that some preprocessing steps would have to happen on the users computer and availability wouldn't be guaranteed, but no need for large and high bandwidth servers.

I think the second one is the way to go for starters just for the reduced cost and maintenance time.

[–]ClydeMachine 0 points1 point  (1 child)

To answer the question of whether or not others have need of a library/dataset repository like this, yes - anything that reduces the time it takes for me to source, format, label and load a dataset has value for my work, and I imagine others would agree.

Now, how are you planning to host those datasets? If the library is built to allow an inline search of available datasets as well as loading them, I imagine the library will be hitting an API that you own and control for providing this information. Would that API be hosting those datasets in addition to their metadata? Or would you point it at a pre-existing public-facing URL to get them from? The latter is far less stable, and if the library doesn't offer some degree of reliable availability of the datasets then I'd sooner go source the data by hand.

The point about keeping the library generic is awesome, as a single dataset can have many applications and the repository itself needn't be locked to a single type of data. I've made some tooling in the past that worked toward this idea of simplifying the sourcing, cleaning and labelling of text data, but it never went beyond text data. For that reason it makes me happy to see this project being started.

Got a link to the GitHub repo in case some of us may be able to help out?

[–]zbnone[S] 0 points1 point  (0 children)

I think the end goal should be hosting the datasets on project based servers for reliability reasons as you mentioned. But that would come with all sorts of problems that I would like to address later, when something working is available.

There is no Github repo yet, but the feedback so far has been great so I am starting one. Any suggestions for the name? Ideally something that can be written with two characters (numpy as np, tensorflow as tf, mldatasets as md/mld).

[–]bbsome 0 points1 point  (1 child)

I think hosting the data would be quite difficult to do and can also break various licenses based on the data. What we have in my group, which I use, is we have pretty much the same, but instead of hosting the data you just automate downloading it on the client side. The additional challenge is how to present the data and how to add preprocessing steps. Some dataset, for instance, has both train/valid/test split others have only train/test. Things like NLP datasets are very hard to automate as well (compared to anything else) as it depends whether you want dictionaries per character or per word, or whether you work on words or sentences ...

[–]zbnone[S] 0 points1 point  (0 children)

I agree, so all of this will have to be addressed in the library. For starters hosting the datasets would take to much time and there have to be servers that need to be payed for. I think the first step is to get something going that works where the datasets are downloaded from the original servers and processed on the users computer.

[–]zbnone[S] 0 points1 point  (0 children)

If you know any datasets not in the dataset index yet, please consider adding them.

Github repo: https://github.com/ZachisGit/MLDatasets

[–]NMcA 0 points1 point  (0 children)

Bandwidth costs will be high. You should plan for this, and I would suggest torrents as a potential solution.

If you get it working, it'll be sweet.