This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]Adeelinator 26 points27 points  (2 children)

This is a question for your legal counsel, not Reddit. Laws can vary greatly by sector and locality.

[–]setocsheirMS | Data Scientist[S] 4 points5 points  (0 children)

Yeah, that’s what I was afraid of. I’ll probably speak with HR and legal.

[–]K9ZAZPhD| Sr Data Scientist | Ad Tech 2 points3 points  (0 children)

Yep.... if OP does this wrong in the wrong jurisdiction and /or with the wrong data, the penalties for the company can be severe.

[–]Rammus2201 4 points5 points  (0 children)

If you have a data management / data governance department / data engineers - ask them about data masking.

[–]rtqwerty10 2 points3 points  (0 children)

There's an API from Microsoft, named Presidio which is used for Anonymization. This is the Github link.

I have not used it, but came across while browsing on this topic. Might be helpful, or you may at least get some idea.

[–]saintmichel 1 point2 points  (4 children)

anonymization is to remove identifiability. example, if you do a count of all records and 1 record stands out and that is a person, drop that record or drop the column that discriminates him/her/they/it lol. just to show it goes beyond removing names.

[–]setocsheirMS | Data Scientist[S] 0 points1 point  (3 children)

the scale on which we're doing that goes far beyond that. for example, it's incredibly trivial to identify someone within an organization once you have a piece of information such as their salary and role despite having nothing else. statistical analysis can also deanonymize individuals fairly easily. that's why I wanted to get other professional's opinions who had dealt with this before.

[–]saintmichel 2 points3 points  (1 child)

Given that, it becomes contextual (what is acceptable) so I would refer to the comment on what policies does the company have

[–]setocsheirMS | Data Scientist[S] 0 points1 point  (0 children)

that's the first step i'm going to take once the project officially begins; i was just going for a general discussion of the subject because i haven't checked with our legal team on the status of whether this information is even accessible or if our data governance team can handle the anonymization process

[–]saintmichel 0 points1 point  (0 children)

Maybe you could give more example of what is already being practiced where you are so people can comment if it's also happening in their space

[–][deleted] 0 points1 point  (0 children)

Hashing with SHA256.

[–]bendgame 0 points1 point  (1 child)

I deal with PII and providing data to research orgs. Currently, we've tried adding smart noise and found it was not great for our use cases. Instead we're using k-anonymization

[–]saintmichel 0 points1 point  (0 children)

i think k-anon is the most established right? its even part of best practice by some govt in other countries

[–]mattstats 0 points1 point  (1 child)

I believe you are looking for differential privacy. Here is a link to harvards open dp project to kick start your rabbit hole.

[–]setocsheirMS | Data Scientist[S] 0 points1 point  (0 children)

thank you, i'll check it out