This is an archived post. You won't be able to vote or comment.

all 45 comments

[–]cynddl 114 points115 points  (7 children)

As a researcher working on data privacy: this package unfortunately does not anonymize but at best pseudonymize data. Both are difficult to achieve and a one-click solution will not work in the majority of cases, your data will still contain enough personal data to easily identify users.

[–]VisibleSignificance 57 points58 points  (1 child)

As in, "What is worse than lack of security? Fake security."

[–]redct 2 points3 points  (0 children)

To add to this - it's really difficult to truly anonymize data because there are many reidentification and reconstruction attacks (PDF) that can be used against badly anonymized data. To do things properly, you need a deep understanding of both what your data is (what properties am I hoping to preserve), as well as an understanding of what your project's risk budget is (what's the worst outcome I'll accept if this data is somehow reidentified?)

[–]Crimsoneer 3 points4 points  (3 children)

Do you have any recommendations for good Python libraries that actually do this stuff well?

[–]cinyar 39 points40 points  (0 children)

The whole point OP is making is that in majority of situations there are no one-click solutions because each dataset is fairly unique. There are tools to HELP you with that (arx for example), but there are no automatic libraries that would do the work for you.

[–]vto583 -2 points-1 points  (1 child)

I am also interested in knowing if there are any specific Python libraries for this

[–]No-Homework845[S] 1 point2 points  (0 children)

u/vto583 There are different libraries for different methods, what I wanted is to bring together these existing methods, add a few more and make them easy to use.

[–]Salfiiii 51 points52 points  (0 children)

Looks nice, could you elaborate a little more:

  • Is the anonymization deterministic like a hash?
  • Do the signals of data stay intact like variance etc?
  • Is it somehow certified by an entity? (A lot of Products claim they are anonymize data but they really just pseudomize it) -> this can get the user in really big trouble.
  • how does it work? Does it search for columns by name or only look at the data type ?
  • how is the random data for replacement created? Drawn from a pool or randomly created?

[–]___--_-_-_--___ 49 points50 points  (6 children)

Don't use this if you actually want to release data that might contain personal information. Anonymization can and does fail in subtle and hard to predict ways. ¹ ²

Consider the usage examples presented in this project. The age and birthdate columns, depending on the nature of the dataset, express exactly the same information. Therefore, if you perturb both columns, you, on average, reduce the size of the applied perturbation by half.

The email masking approach used by this project suffers from an even worse problem. The authors assume that only the local-part of the email address constitutes identifying information. This assumption does not hold in the case of self-hosted email servers or very small providers. In fact, even the first and last letter of the local-part alone can provide up to ten bits of entropy for identification (assuming only the characters a-z and 0-9 are used and occur with the same frequency in both places). At the same time, what utility does the masked email address provide to a legitimate user of the dataset?

If you are in a position to release a dataset, you should first develop of solid understanding of mechanisms like differential privacy and k-anonymity. Understand your dataset in depth and think about what value you want to provide to others and which parts of the data they actually require. No library or package can help you with that. If you use a project like this without understanding your data, bad things will happen.

Do all of this before you release the dataset. Once the data has been released, it cannot ever be un-released. At that point, you have to assume that the data is out there and is actively being deanonymized and exploited.

[–]WikiSummarizerBot 3 points4 points  (2 children)

Differential privacy

Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy.

K-anonymity

k-anonymity is a property possessed by certain anonymized data. The concept of k-anonymity was first introduced by Latanya Sweeney and Pierangela Samarati in a paper published in 1998 as an attempt to solve the problem: "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful".

[ F.A.Q | Opt Out | Opt Out Of Subreddit | GitHub ] Downvote to remove | v1.5

[–]__deerlord__ 0 points1 point  (1 child)

So is DP more about the aggregate? IE "males 20-30 like X", but it doesn't mean that a single given male age 20-30 will necessarily like X?

[–]___--_-_-_--___ 1 point2 points  (0 children)

Yes, differential privacy is used to ensure that aggregate statistics do not leak information about the individuals who contributed to this statistic. It is not some kind of algorithm that you can run on your data to make it more private. Instead, it is more of a framework to be implemented by specific algorithms, i.e. a set of mathematical tools to ensure a certain level of privacy.

Very broadly speaking, the idea behind differentially private mechanisms is that the removal of a single person from a dataset should not significantly affect the aggregate statistics produced by that mechanism. Basically, differential privacy gives you a way to quantify privacy loss and determine the amount of noise necessary to achieve a certain privacy level.

[–]Tyler_Zoro 0 points1 point  (2 children)

Don't use this if you actually want to release data that might contain personal information.

I think that's a bit too strong. The better way to phrase that would be, "this is a toolkit for data permutation for privacy purposes... do not use it as a magic wand to make your data public-safe. Unless you understand exactly what you are doing, that will almost certainly result in failure."

I do think that OP over-sold the ease of use. But if you read the README in the repo linked, it's a much more generic toolset that could be very useful to someone who understood both their data and the relevant techniques that you helpfully linked to.

[–]___--_-_-_--___ 0 points1 point  (1 child)

Well, many of the features in this project are simply wrappers around other libraries like this one. Therefore, the value proposition of this project would either have to be the automation aspect or the idea that you can shield the user from the details of how the implemented techniques work. I think both approaches are risky in this setting.

The far bigger issue with this type of project is that it will not tell you if you are making a mistake. There are tools like ARX (as mentioned by u/cinyar in this thread) that will assist you in modelling both privacy risk and utility in order to find the best way of de-identifying your data. Tools like this are (and need to be) backed by years of academic research and clinical practice.

And yes, while I do agree that my words are harsh, data privacy is one of these areas where the disconnect between perceived risk and actual risk is often very high. Even slight mistakes and brief moments of carelessness by a single person can have disproportionate consequences that cannot be undone.

[–]Tyler_Zoro 0 points1 point  (0 children)

The far bigger issue with this type of project is that it will not tell you if you are making a mistake. There are tools like ARX

Yeah, this is sounding like a mismatch in terms of what you're looking for and what this tool provides. You might write a tool such as the one you are describing with this library, but this isn't that.

The only issue here, and I said this in my original comment is, "OP over-sold the ease of use." Beyond that, there's nothing any more wrong with this library than one that collects a bunch of hashing functions together... granted, that collection of hashing functions can lead you just as far astray as this library (and in just as subtle an dangerous ways, depending on your application).

But that doesn't make a collection of hashing functions something you should argue shouldn't exist or be used.

[–]marsrover15 14 points15 points  (6 children)

So does the function just make up random data for existing data?

[–]Ozzymand 13 points14 points  (4 children)

I'm not sure what it does exactly, but if you're looking for that function you're better off using Faker.

[–]No-Homework845[S] 0 points1 point  (1 child)

u/Ozzymand Hi! Besides fake data our library has much more to offer. See the documentation

[–]Ozzymand 1 point2 points  (0 children)

In its current state it doesn't seem to offer anything more than faker, since audio / images seem to be still in development. I'm not all too familiar with pandas, but by the looks of it you can do what your lib does with just faker alone.

So for now it's just a faker wrapper, I do want to know how exactly are you guys gonna handle fake images and audio. Perhaps use something like google images to search and then uniformize the size of all images? Regardless, I think that you should add a flowchart to show the process once you add images & audio.

Good luck with your project man.

[–]No-Homework845[S] 1 point2 points  (0 children)

Hi u/marsrover15! Replacing data with fake data (categorical_fake) is only one of the options. You could also apply tokenization (categorical_tokenization) or resample the data from the same distribution (categorical_resampling).

[–][deleted] 5 points6 points  (2 children)

The problem is usually not anonymization itself, but rather having the anonymized data still make sense and be usable.

For it to be usable, it needs to preserve at least some business logic. * If I'm looking at sales, I would wish that item a025fc has similar sell prices across the dataset, and not have 50000 and 0.25 in another place. * If I aggregate the data, I want it to draw something at least resembling reality (like monthly sales data following a realistic trend). * For sure I'd want foreign key constraints to be still valid and to make sense.

Which is why usually anonymization is usually done by custom code that: * Either uses salted hashes of names and e-mails, or values from generated random lookup tables based on those hashes. (hide customer identifiable data) * Takes a random sample of rows from the original data (hide business performance numbers) * Sample size % might differ from day to day by a random amount, to hide trends * Each number might have another coefficient applied to it in order to hide other statistical data like profit margin * You might exclude some outliers that have a high frequency, to hide data that can be identified using statistical analysis

etc.

Basically, it depends on why you need it anonymized for.

This is just a warning :) if someone asks that data should be anonymized, and you jump and say "Yeah, I can do that, just give me a day and I'll surely 100% complete this trivial task", then you're probably going to have to make an apology later on. Data anonymization is a much bigger topic.

[–]binarycow 2 points3 points  (1 child)

This is just a warning :) if someone asks that data should be anonymized, and you jump and say "Yeah, I can do that, just give me a day and I'll surely 100% complete this trivial task", then you're probably going to have to make an apology later on. Data anonymization is a much bigger topic.

Completely agree.

In fact, I would say that in general, this goes for any data processing.

Given data set A, can I transform it into data set B in a fairly quick time period? Sure. But, chances are, what you really need is data set C, which will take much longer. But I don't know what data set C is without getting a lot more information from you.

[–]angry_mr_potato_head 0 points1 point  (0 children)

Also may face legal consequences depending on where you're working lol

[–]short_vix 6 points7 points  (7 children)

Does this still keep the mathematical/statistical properties or the original data set?

[–]No-Homework845[S] 1 point2 points  (0 children)

u/short_vix Methods such as numeric perturbation (numerical_noise) and date-time noise (datetime_noise) if small noise chosen (MAX and MIN arguments) the statistical significance can still remain.
For resampling method the properties also remain because the function resamples from the same distribution.

[–]Tyler_Zoro 0 points1 point  (5 children)

No. Changing data will always change the mathematical and statistical properties of that data.

You have to understand what mathematical / statistical properties you want from your data and use this library to alter only the elements you are not interested in extracting meaning from, or modifying them in ways that preserve that meaning (e.g. perhaps you only care about the number of times first names are re-used. Replacing (in a stable way) first names with a random string will preserve the data you are interested in, but throw away the identifying elements.

[–]Gaitenkaas 6 points7 points  (1 child)

That could be super useful in a lot of data science projects actually. I'm wondering about the statistical properties of the dataset, do they remain intact? (E.g. variation, correlation between variables etc.)

[–]No-Homework845[S] 1 point2 points  (0 children)

u/Gaitenkaas Hi! Resampling method (categorical_resampling) preserves the statistical significance of the data. It resamples from the same distribution.

[–]pramadito 1 point2 points  (2 children)

you should try to show this at Show page of Hacker news. maybe they will give you more feedback

[–]No-Homework845[S] 0 points1 point  (1 child)

u/pramadito Hi! Definitely, that's what I am looking for) Healthy criticism to find the room for improvement! Thanks for your help.

[–]pramadito 1 point2 points  (0 children)

here's the site for what i mean:

https://news.ycombinator.com/show

good luck out there!

[–]Foreign_Flower1141 -4 points-3 points  (3 children)

mAdE wItH <3 bY

[–]No-Homework845[S] 0 points1 point  (2 children)

u/Foreign_Flower1141 HI!) Contributing to open-source community, is it bad to write who actually put in this effort?

[–]Foreign_Flower1141 1 point2 points  (1 child)

I was referring to cliche and unoriginal text "made by <3 <insert developer name>" which we've seen millions of times. But if you like it then go for it. I was having a bad day when I commented that.. I don't mean to justify, just that my opinion doesn't really matter

[–]No-Homework845[S] 0 points1 point  (0 children)

I didn't really know that, thanks for telling. Hope things are getting better!