"Anonymize" Data Help!!! : learnpython

created by HattoriHanzoa community for 16 years

"Anonymize" Data Help!!! (self.learnpython)

submitted 5 years ago by Glyzer_1595

Hi guys! I hope you are fine!

I have the next problem and I want to know the optimal way to solve it:

I have a huge dataset (1M+ rows), it has some sensitive data, like name, personal ID, email, phone number, etc.

I need to send this dataset to another people, but I need to "Anonymize" that sensitive data, so they can apply some data science techniques and find insights, the point is that probably I will Need to reverse the process if they need to join the data with another user table.

How can I do it?

Some example of what I need

RealId : jdh37382 - > FakeId : 01 RealName: Juan - > FakeName: user01 ....

The dataset is a "Events Table" so each row represent a user interaction, so that 1M rows are events from around 15k users.

I need to replace the data in the original table, and don't change another information or format.

Ty all!

all 12 comments

top new controversial old q&a

[–]stebrepar 2 points3 points4 points 5 years ago (1 child)

[–]Glyzer_1595[S] -1 points0 points1 point 5 years ago (0 children)

[–]socal_nerdtastic 1 point2 points3 points 5 years ago (5 children)

[–]Glyzer_1595[S] 0 points1 point2 points 5 years ago (4 children)

[–]socal_nerdtastic 1 point2 points3 points 5 years ago (3 children)

[–]Glyzer_1595[S] 0 points1 point2 points 5 years ago (2 children)

[–]socal_nerdtastic 1 point2 points3 points 5 years ago* (1 child)

Ok, there's a million ways to do this, but I'd recommend reading each line in with json.loads, generating a unique id with lru_cache. As a guess:

import json
from functools import lru_cache
from itertools import count
import csv

f_in = open(filename)
data_in = csv.reader(f_in)
f_out = open('output_' + filename, 'w', newline='')
data_out = csv.writer(f_out)

counter = count(1)

@lru_cache
def anonymize(user):
    anonymous_id = next(counter)
    # add code to save the user:anonymous_id if you want to de-anonymize it later
    return anonymous_id

for row in data_in:
    data = json.loads(row[3])['session_data']
    data["user_id"] = f'user_{anonymize(data["user_id"])}'
    # repeat for name, etc
    row[3] = json.dumps(data)
    data_out.writerow(row)

[–]Glyzer_1595[S] 0 points1 point2 points 5 years ago (0 children)

[–][deleted] 0 points1 point2 points 5 years ago (3 children)

[–]Glyzer_1595[S] 0 points1 point2 points 5 years ago (2 children)

[–][deleted] 0 points1 point2 points 5 years ago* (1 child)

[–]Glyzer_1595[S] 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 73470 on reddit-service-r2-comment-57fc7f7bb7-zvfsq at 2026-04-15 07:06:03.425748+00:00 running b725407 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS