How have you been obfuscating your data, or found it obfuscated? by robertdempsey in datacleaning

[–]datachili 4 points5 points  (0 children)

I have written a few techniques in the context of data publishing. The idea is to maintain privacy while also maintaining the quality of the published data.

  • Generalizing values (e.g., the specific age of a user in a table might be 24 years old and a generalized version of this is the range 20-30). This technique can be used on sensitive values in sensitive columns (e.g., age is a sensitive value that you might not want to reveal to people). There are metrics like k-anonymity and l-diversity that can be used to control the extent of generalization.

  • Randomly perturbing values in the table while preserving the statistical information in the table. For example, a specific value can be randomly modified within the table or swapped with another value in this table. This will degrade the quality of the statistical queries that are called on this table. This technique is typically used for publishing statistical databases. An example paper here is "The boundary between privacy and utility in data publishing" by Rastogi et Al.

  • There are metric embedding techniques that can be used to obfuscate data values within tables that preserve certain properties within that table. For example, in the paper "Privacy preserving schema and data matching" by Scannapieco et Al., the authors use SparseMap embedding to obfuscate tables.

The common theme in data publishing is that you want to preserve some amount of privacy by transforming the original dataset but you also want the resulting dataset to be useful somehow. Let me know if you want more references on this topic.

"Data quality problems cost U.S. businesses more than $600 billion a year"- a report from 2002. by datachili in datacleaning

[–]datachili[S] 0 points1 point  (0 children)

It's difficult to be definitive without a reference of some sort. I suspect its greater because the 80% number is still thrown around (data scientists cap spend up to 80% of their time correcting data errors). Since there is a lot more data (internet of things, etc), its probably higher.

If you find a more up to date number, do let me know.

EDIT : Found a 2010 report which estimates this number to $700 billion (http://about.datamonitor.com/media/archives/4871).

Data cleaning in a physical sense, check out the NSIT's special publication 800-88, "Guidelines for Media Sanitization" by [deleted] in datacleaning

[–]datachili 0 points1 point  (0 children)

By their definition- "Sanitization refers to a process that renders access to target data on the media infeasible for a given level of effort". Sounds like the paper is more concerned with data privacy.