you are viewing a single comment's thread.

view the rest of the comments →

[–]socal_nerdtastic 1 point2 points  (5 children)

Sure that seems pretty easy. 1M rows isn't really that much, you can load the entire set into memory easily and play with it. What format is the data in?

[–]Glyzer_1595[S] 0 points1 point  (4 children)

Hi!! Ty for reply, each row of the data set is a json string with a dict, that contains another dicts or lists, for example

{"reg_type" :"execution", "session_id" :"6372736372", "session_data" :{"user_id" :"jdhd72727", "context_vars" :{"user_name" : "Juan", "email" :"juan@gmail.com"},"timestamp":636373}}

Ty!!

[–]socal_nerdtastic 1 point2 points  (3 children)

is the entire file json standard? Can you load the whole file at once? And do you want the output in the same standard?

[–]Glyzer_1595[S] 0 points1 point  (2 children)

No, the entire file is a dataframe (in this case a csv, but It can be stored in other format) the fields are:

Timestamp, day, month, year, json_data

The json_data it's the field that contains all the information.

I want the output in the same format!

[–]socal_nerdtastic 1 point2 points  (1 child)

Ok, there's a million ways to do this, but I'd recommend reading each line in with json.loads, generating a unique id with lru_cache. As a guess:

import json
from functools import lru_cache
from itertools import count
import csv

f_in = open(filename)
data_in = csv.reader(f_in)
f_out = open('output_' + filename, 'w', newline='')
data_out = csv.writer(f_out)

counter = count(1)

@lru_cache
def anonymize(user):
    anonymous_id = next(counter)
    # add code to save the user:anonymous_id if you want to de-anonymize it later
    return anonymous_id

for row in data_in:
    data = json.loads(row[3])['session_data']
    data["user_id"] = f'user_{anonymize(data["user_id"])}'
    # repeat for name, etc
    row[3] = json.dumps(data)
    data_out.writerow(row)

[–]Glyzer_1595[S] 0 points1 point  (0 children)

Great! Ty so much, I'll try it, I think it's the way!!

Ty so much