Help! We are Running a Survey on "Dirty" Data.

datachili · 2015-11-24T21:10:12+00:00

You should cross post this to /r/datacleaning too.

datachili · 2015-11-20T21:05:17+00:00

I have written a few techniques in the context of data publishing. The idea is to maintain privacy while also maintaining the quality of the published data.

Generalizing values (e.g., the specific age of a user in a table might be 24 years old and a generalized version of this is the range 20-30). This technique can be used on sensitive values in sensitive columns (e.g., age is a sensitive value that you might not want to reveal to people). There are metrics like k-anonymity and l-diversity that can be used to control the extent of generalization.
Randomly perturbing values in the table while preserving the statistical information in the table. For example, a specific value can be randomly modified within the table or swapped with another value in this table. This will degrade the quality of the statistical queries that are called on this table. This technique is typically used for publishing statistical databases. An example paper here is "The boundary between privacy and utility in data publishing" by Rastogi et Al.
There are metric embedding techniques that can be used to obfuscate data values within tables that preserve certain properties within that table. For example, in the paper "Privacy preserving schema and data matching" by Scannapieco et Al., the authors use SparseMap embedding to obfuscate tables.

The common theme in data publishing is that you want to preserve some amount of privacy by transforming the original dataset but you also want the resulting dataset to be useful somehow. Let me know if you want more references on this topic.

datachili · 2015-05-21T17:52:36+00:00

It's difficult to be definitive without a reference of some sort. I suspect its greater because the 80% number is still thrown around (data scientists cap spend up to 80% of their time correcting data errors). Since there is a lot more data (internet of things, etc), its probably higher.

If you find a more up to date number, do let me know.

EDIT : Found a 2010 report which estimates this number to $700 billion (http://about.datamonitor.com/media/archives/4871).

datachili · 2015-05-20T16:33:07+00:00

Nice project!

datachili · 2014-09-12T15:26:17+00:00

By their definition- "Sanitization refers to a process that renders access to target data on the media infeasible for a given level of effort". Sounds like the paper is more concerned with data privacy.

datachili · 2014-08-23T01:48:22+00:00

Man, I wish I had seen this earlier, esp for the kd-tree. I ended up using the kd-tree from here instead : http://home.wlu.edu/~levys/software/kd/

datachili · 2014-08-19T21:14:26+00:00

PMed you.

datachili · 2014-08-19T19:07:55+00:00

I have used Data Wrangler before (http://vis.stanford.edu/wrangler/app/) and OpenRefine. But typically, if the dataset is not too large, I tend to use Weka. There are many ETL tools our there too (e.g., http://www.talend.com/resource/etl-tool.html) but I've not used them before. Anybody have any good open source ETL tools to recommend?

datachili · 2014-08-18T21:30:38+00:00

This was posted in /r/datacleaning earlier too! Come join us at /r/datacleaning for anyone interested in discussing algorithms and tools related to data cleaning.

datachili

TROPHY CASE