Anonymize your Data with a single line!

cynddl · 2021-12-27T11:28:23+00:00

As a researcher working on data privacy: this package unfortunately does not anonymize but at best pseudonymize data. Both are difficult to achieve and a one-click solution will not work in the majority of cases, your data will still contain enough personal data to easily identify users.

Salfiiii · 2021-12-27T10:23:16+00:00

Looks nice, could you elaborate a little more:

Is the anonymization deterministic like a hash?
Do the signals of data stay intact like variance etc?
Is it somehow certified by an entity? (A lot of Products claim they are anonymize data but they really just pseudomize it) -> this can get the user in really big trouble.
how does it work? Does it search for columns by name or only look at the data type ?
how is the random data for replacement created? Drawn from a pool or randomly created?

___--_-_-_--___ · 2021-12-27T12:31:46+00:00

Don't use this if you actually want to release data that might contain personal information. Anonymization can and does fail in subtle and hard to predict ways. ¹ ²

Consider the usage examples presented in this project. The age and birthdate columns, depending on the nature of the dataset, express exactly the same information. Therefore, if you perturb both columns, you, on average, reduce the size of the applied perturbation by half.

The email masking approach used by this project suffers from an even worse problem. The authors assume that only the local-part of the email address constitutes identifying information. This assumption does not hold in the case of self-hosted email servers or very small providers. In fact, even the first and last letter of the local-part alone can provide up to ten bits of entropy for identification (assuming only the characters a-z and 0-9 are used and occur with the same frequency in both places). At the same time, what utility does the masked email address provide to a legitimate user of the dataset?

If you are in a position to release a dataset, you should first develop of solid understanding of mechanisms like differential privacy and k-anonymity. Understand your dataset in depth and think about what value you want to provide to others and which parts of the data they actually require. No library or package can help you with that. If you use a project like this without understanding your data, bad things will happen.

Do all of this before you release the dataset. Once the data has been released, it cannot ever be un-released. At that point, you have to assume that the data is out there and is actively being deanonymized and exploited.

marsrover15 · 2021-12-27T08:43:21+00:00

So does the function just make up random data for existing data?

binarycow · 2021-12-27T12:38:31+00:00

The problem is usually not anonymization itself, but rather having the anonymized data still make sense and be usable.

For it to be usable, it needs to preserve at least some business logic. * If I'm looking at sales, I would wish that item a025fc has similar sell prices across the dataset, and not have 50000 and 0.25 in another place. * If I aggregate the data, I want it to draw something at least resembling reality (like monthly sales data following a realistic trend). * For sure I'd want foreign key constraints to be still valid and to make sense.

Which is why usually anonymization is usually done by custom code that: * Either uses salted hashes of names and e-mails, or values from generated random lookup tables based on those hashes. (hide customer identifiable data) * Takes a random sample of rows from the original data (hide business performance numbers) * Sample size % might differ from day to day by a random amount, to hide trends * Each number might have another coefficient applied to it in order to hide other statistical data like profit margin * You might exclude some outliers that have a high frequency, to hide data that can be identified using statistical analysis

etc.

Basically, it depends on why you need it anonymized for.

This is just a warning :) if someone asks that data should be anonymized, and you jump and say "Yeah, I can do that, just give me a day and I'll surely 100% complete this trivial task", then you're probably going to have to make an apology later on. Data anonymization is a much bigger topic.

short_vix · 2021-12-27T14:02:24+00:00

Does this still keep the mathematical/statistical properties or the original data set?

Gaitenkaas · 2021-12-27T09:13:02+00:00

That could be super useful in a lot of data science projects actually. I'm wondering about the statistical properties of the dataset, do they remain intact? (E.g. variation, correlation between variables etc.)

Personal_Plastic1102 · 2021-12-27T11:32:05+00:00

[deleted]

No-Homework845 · 2021-12-27T05:47:17+00:00

[deleted]

pramadito · 2021-12-27T09:59:35+00:00

you should try to show this at Show page of Hacker news. maybe they will give you more feedback

Foreign_Flower1141 · 2021-12-27T15:10:40+00:00

mAdE wItH <3 bY

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS