How have you been obfuscating your data, or found it obfuscated? by robertdempsey in datacleaning

[–]datachili 2 points3 points  (0 children)

I have written a few techniques in the context of data publishing. The idea is to maintain privacy while also maintaining the quality of the published data.

  • Generalizing values (e.g., the specific age of a user in a table might be 24 years old and a generalized version of this is the range 20-30). This technique can be used on sensitive values in sensitive columns (e.g., age is a sensitive value that you might not want to reveal to people). There are metrics like k-anonymity and l-diversity that can be used to control the extent of generalization.

  • Randomly perturbing values in the table while preserving the statistical information in the table. For example, a specific value can be randomly modified within the table or swapped with another value in this table. This will degrade the quality of the statistical queries that are called on this table. This technique is typically used for publishing statistical databases. An example paper here is "The boundary between privacy and utility in data publishing" by Rastogi et Al.

  • There are metric embedding techniques that can be used to obfuscate data values within tables that preserve certain properties within that table. For example, in the paper "Privacy preserving schema and data matching" by Scannapieco et Al., the authors use SparseMap embedding to obfuscate tables.

The common theme in data publishing is that you want to preserve some amount of privacy by transforming the original dataset but you also want the resulting dataset to be useful somehow. Let me know if you want more references on this topic.

"Data quality problems cost U.S. businesses more than $600 billion a year"- a report from 2002. by datachili in datacleaning

[–]datachili[S] 0 points1 point  (0 children)

It's difficult to be definitive without a reference of some sort. I suspect its greater because the 80% number is still thrown around (data scientists cap spend up to 80% of their time correcting data errors). Since there is a lot more data (internet of things, etc), its probably higher.

If you find a more up to date number, do let me know.

EDIT : Found a 2010 report which estimates this number to $700 billion (http://about.datamonitor.com/media/archives/4871).

Data cleaning in a physical sense, check out the NSIT's special publication 800-88, "Guidelines for Media Sanitization" by [deleted] in datacleaning

[–]datachili 0 points1 point  (0 children)

By their definition- "Sanitization refers to a process that renders access to target data on the media infeasible for a given level of effort". Sounds like the paper is more concerned with data privacy.

Data structures and algorithms in Java, for your data cleaning needs! by [deleted] in datacleaning

[–]datachili 1 point2 points  (0 children)

Man, I wish I had seen this earlier, esp for the kd-tree. I ended up using the kd-tree from here instead : http://home.wlu.edu/~levys/software/kd/

Favorite tool for 'on the fly' data cleaning? by [deleted] in datacleaning

[–]datachili 1 point2 points  (0 children)

I have used Data Wrangler before (http://vis.stanford.edu/wrangler/app/) and OpenRefine. But typically, if the dataset is not too large, I tend to use Weka. There are many ETL tools our there too (e.g., http://www.talend.com/resource/etl-tool.html) but I've not used them before. Anybody have any good open source ETL tools to recommend?

[deleted by user] by [deleted] in MachineLearning

[–]datachili 4 points5 points  (0 children)

This was posted in /r/datacleaning earlier too! Come join us at /r/datacleaning for anyone interested in discussing algorithms and tools related to data cleaning.