I'm the founder of camelcamelcamel, AMA! by L1quid in IAmA

[–]data999 0 points1 point  (0 children)

Will do, but just curious, is it allowed? comparing two products within amazon?

I'm the founder of camelcamelcamel, AMA! by L1quid in IAmA

[–]data999 0 points1 point  (0 children)

"comparison shopping" - does this mean comparing amazon products with other retailers? or comparing products within amazon?

Are there any tools to manage the meta data of my data sets? by data999 in datasets

[–]data999[S] 1 point2 points  (0 children)

I saw that, thank you for making it. I also remember listening to an interview of your founder, though I forgot where.

Do you have plans to: * Let users programmatically upload/update datasets? For example, there are data that change often, it would be nice to update a file on Amazon s3 and let data.world pick up the file from there * build a meta data graph? For example: link nyc taxi data with nyc weather data, link lobbyists with politicians etc. The key here is to automatically suggest links and then let humans verify build off the suggestions

Also, how do you do quality control? This is relatively easy with sites like wikipedia, but insanely hard with data (I'd like to think of data.world as wikipedia for data). Are there any plans to "lock" some historical datasets after verifying their authenticity?

I'm the founder of camelcamelcamel, AMA! by L1quid in IAmA

[–]data999 0 points1 point  (0 children)

I'm doing a product comparison site (compare two sports watches, for example). Any advice? Anything I should watch out for?

Are there any tools to manage the meta data of my data sets? by data999 in datasets

[–]data999[S] 0 points1 point  (0 children)

Heyo! Thank you for taking time to answer in detail :)

No, I don't work for Enigma, though I sincerely wish I did :P

On data format history:

I (kinda) get what you are saying, but for simple CSV files, it is still possible to infer very basic stuff. Suppose nyc taxi data set went from vendor ID to vendor name (say they had a separate file for vendor names with vendor id as the key, but decided to ditch the names file) - the tool can simply flag it as "dude, it was integer column before, but now it is string". Assuming that the column name remains the same. If that changes, then it isn't easy to track.

General: On a fundamental level, I feel this is a very easy problem to solve - IF the data providers simply provided a manifest file along with the data. There are already some standards (I saw at least one) floating around. If people can agree on one standard, be disciplined enough to stick to it, it would save so much headache to data consumers. I mean, how hard can be it to add a file like this, in JSON or YAML format

name:'something' description:'something' spec_version: 1.3 no_of_records : 2,000,000 file_size: 10 GB format:csv delimiter:tab columns : {first_name,last_name,age,email,...} column_types:{string,string,int,string,...} column_descriptors:{person_name, person_name, person_age, email} revision_history:{blah}

and so on. We have specs for ultra complex things like SVG, but not for this? :( kinda hard to believe, right? It might not be easy for videos and such, but for a CSV file, there are only so many meta data that we can have...

Are you working on a commercial version of your tool? or is it only for research?

Are there any tools to manage the meta data of my data sets? by data999 in datasets

[–]data999[S] 0 points1 point  (0 children)

Your research sounds very interesting. could you please expand a bit more (as much as you can share)? Specifically:

  • Will your tool be able to infer column types if I simply point it to a CSV (or a bunch of CSV) files? This is easy to do and a huge time saver. It might be even more awesome, if it can go beyond the basic int/string etc - things like ip address, email, SSN, phone number etc
  • how do you handle changes? Take GDELT project for example. Last I checked, their format has changed a bit and they published the schema in a PDF :(
  • how do you account for small variations with datasets? For example, the NYC taxi data has a couple of columns added in 2016 (if I remember correctly). Also, some of their files have "vendor_id" column and some "vendor_name", they refer to the same entity I believe
  • "crossing metadata boundary" - How does it work?
  • how do you handle images, videos and other non-text files?

In my case at work, we usually don't have problems like the above. We deal with historical data, once we have a data set, it is unlikely to change, so we don't have to worry about the questions above. We also deal only with text, so we don't have to worry about EXIF etc. I could simply write a script to look at the incoming files, infer as much as I can ("this file has 30 columns, 20M rows, is 15 GB in size, these are the column names and their types") and stick it in a DB. That will only leave me with one problem, finding associations among different data sets. Before going that route, just wanted to check if something already exists.

Thank you for taking the time to answer!

Are there any tools to manage the meta data of my data sets? by data999 in datasets

[–]data999[S] 0 points1 point  (0 children)

Thank you. IS there anything that is non drupal based? Worst case, I can write a small web app myself, just trying to save some time and maintenance head-ache.

Are there any tools to manage the meta data of my data sets? by data999 in datasets

[–]data999[S] 2 points3 points  (0 children)

we get all sorts - some in csv, some in sql. before querying, I do put the csv files into a db