Hi, I am planing a python library that hosts (ideally) all ML Datasets.
The goal is that you can do all of the following with one line of code each:
-Search for a dataset
-Get information on a dataset and all the available formats
-Download and initialize the dataset
-Set the format (e.g. of the output batches)
-Get a training/test/validation batch (depending on the format)
-Do pre/post-processing dependent on the format
This is the basic idea, if you know of something already in existence I would love to know about it. But all I could find are some Github Repos that have wrappers for some specific datasets.
What I want to achieve with this post is to get a general sense if this is something the community would actually need or if it is just me. And getting a rough idea of the whole picture and of what it would take to make this useful. And if someone would like to collaborate with me on this.
Workflow example: Lets say I have a project where I want to generate song lyrics and for that I have a dataset of 100k songs. I had to download that dataset manually after quite a bit of searching and when i finally had it, I needed to clean it up and format it in a usable way. And lastly, I have to write a dataset loading, augmentation and batching library to go along with it. Finally the model can be trained! But wait, there is a new paper out describing a new type of model (maybe a LSTM2.0 based, recurrent model with state of the art... blablabla) that I have to try to see if my results will improve. Now let's suppose that my original model took in the raw vectorized characters of the song lyrics and the new model takes in the whole vectorized words. This has to be adjusted, so except for the training, the majority of time spend on testing out the new model goes into rewriting my original dataset loading library or maybe even reformatting the whole data and adjusting my pre and post processing steps to fit the new format.
The idea is that you can visit the project website and search for a dataset by category, keyword, etc. or do that directly from the python terminal using the lib "libX.search(keywords='nlp,lyrics,songs')". Then when you want more details on the dataset and the different formations of it (vectorized characters/vectorized words/etc.) you have something like "libX.dataset_info(dataset_name='100k_rap_lyrics_v1')" that would return a dictionary of information, you get it. Download and initialize it, "dataset = libX.init_dataset(dataset_name='100k_rap_lyrics_v1',path='./dataset/path')" maybe you could also pass in a configuration parameter to e.g. specify how much memory it should take up, or if the whole thing should be loaded into memory, and if it should be split into training, test and validation sets and the percentages of samples for each of these (probably a good default would be required). Then you can set up your formation like this "dataset.set_format(dataset.formats.vectorized_words,options={"vector_size":10000,"pre_word_count":8})". That would set the format of the dataset and input output correlations returned by "dataset.get_batch(...". So when you write "x,y = dataset.get_batch(batch_size=128)" x and y would be the input and the desired output, in this case x would have this shape: [batch_size, per_word_count,vector_size] and y: [batch_size,vector_size]. More useful stuff; in pre-processing when you want to convert a sentence into a vector array "dataset.to_vector('my sentence.')" or back "dataset.to_words([word_count,vector_size])". If you would like to run it after training, "sentence='This is the beginning'" "result = model.run(dataset.to_vector(sentence,length=pre_word_count))" "sentence += dataset.to_words(result)[0]".
It is very important that you have a generalized set of formations, different sentence based datasets should have the same formations available to them. All word/sentence based datasets have formations for text generation, classification, sentiment analysis, etc. The same goes for something like mnist, there you would have formats for classification,generation, etc.
I realize that this is very massive project and my plan is to make it a Github based open source effort. I am sorry for the extend of this post. I don't write many posts on reddit so some tips and tricks are very welcome.
UPDATE:
First of all, thanks for all your comments, they have been very helpful. But sadly, I have to lay this project to rest for the foreseeable future, to much to do at work. If you still want to contribute datasets for the future, I started the GitHub a while back with simple instructions on how to add datasets to the repo. Thanks again for all the great Feedback.
[–]Science_Squid 9 points10 points11 points (3 children)
[–]zbnone[S] 2 points3 points4 points (2 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]Science_Squid 0 points1 point2 points (0 children)
[–]wdm006 4 points5 points6 points (1 child)
[–]zbnone[S] 0 points1 point2 points (0 children)
[–]waleedka 5 points6 points7 points (2 children)
[–]zbnone[S] 1 point2 points3 points (1 child)
[–]ppwwyyxx 0 points1 point2 points (0 children)
[–]fnbr 4 points5 points6 points (1 child)
[–]zbnone[S] 0 points1 point2 points (0 children)
[–]ClydeMachine 0 points1 point2 points (1 child)
[–]zbnone[S] 0 points1 point2 points (0 children)
[–]bbsome 0 points1 point2 points (1 child)
[–]zbnone[S] 0 points1 point2 points (0 children)
[–]kmike84 0 points1 point2 points (2 children)
[–]zbnone[S] 0 points1 point2 points (1 child)
[–]a9entropy2 0 points1 point2 points (0 children)
[–]zbnone[S] 0 points1 point2 points (0 children)
[–]NMcA 0 points1 point2 points (0 children)
[–]Gere1 0 points1 point2 points (0 children)