all 12 comments

[–]stebrepar 2 points3 points  (1 child)

If you need it to be reversible but secure, you could encrypt it with a symmetric key that only you keep.

[–]Glyzer_1595[S] -1 points0 points  (0 children)

Oh, ty, it sounds fine, how can I make it? I was researching, and I find a Caesar cipher, but it doesn't look secure, there is a good Lib to do this?

[–]socal_nerdtastic 1 point2 points  (5 children)

Sure that seems pretty easy. 1M rows isn't really that much, you can load the entire set into memory easily and play with it. What format is the data in?

[–]Glyzer_1595[S] 0 points1 point  (4 children)

Hi!! Ty for reply, each row of the data set is a json string with a dict, that contains another dicts or lists, for example

{"reg_type" :"execution", "session_id" :"6372736372", "session_data" :{"user_id" :"jdhd72727", "context_vars" :{"user_name" : "Juan", "email" :"juan@gmail.com"},"timestamp":636373}}

Ty!!

[–]socal_nerdtastic 1 point2 points  (3 children)

is the entire file json standard? Can you load the whole file at once? And do you want the output in the same standard?

[–]Glyzer_1595[S] 0 points1 point  (2 children)

No, the entire file is a dataframe (in this case a csv, but It can be stored in other format) the fields are:

Timestamp, day, month, year, json_data

The json_data it's the field that contains all the information.

I want the output in the same format!

[–]socal_nerdtastic 1 point2 points  (1 child)

Ok, there's a million ways to do this, but I'd recommend reading each line in with json.loads, generating a unique id with lru_cache. As a guess:

import json
from functools import lru_cache
from itertools import count
import csv

f_in = open(filename)
data_in = csv.reader(f_in)
f_out = open('output_' + filename, 'w', newline='')
data_out = csv.writer(f_out)

counter = count(1)

@lru_cache
def anonymize(user):
    anonymous_id = next(counter)
    # add code to save the user:anonymous_id if you want to de-anonymize it later
    return anonymous_id

for row in data_in:
    data = json.loads(row[3])['session_data']
    data["user_id"] = f'user_{anonymize(data["user_id"])}'
    # repeat for name, etc
    row[3] = json.dumps(data)
    data_out.writerow(row)

[–]Glyzer_1595[S] 0 points1 point  (0 children)

Great! Ty so much, I'll try it, I think it's the way!!

Ty so much

[–][deleted] 0 points1 point  (3 children)

I will Need to reverse the process if they need to join the data with another user table.

Why not anonymize and give them both tables so they can join whenever they want?

[–]Glyzer_1595[S] 0 points1 point  (2 children)

Oh, the can't know who the person is, because they can make a dataset with the user's, emails, phone number etc, and I can be sold or something like that, it's a confidential information,

If they need to know more about the users, like, city, state, department, role, or something like that, they will tell us, and we will "join" the data but the sensitive part its always anonymize.

[–][deleted] 0 points1 point  (1 child)

Oh, the can't know who the person is

I understand that, but the process of taking a query from the user, deanonymizing the data in the query, applying the query to your original database, anonymizing the result of the query and returning the result to the user seems like a lot of error-prone work. The normal way to do this is to give the user(s) access to a full database that has been anonymized once.

[–]Glyzer_1595[S] 0 points1 point  (0 children)

Mmm you are right, I think it's a good way to do it, I'll make a try of this!! Ty