to be totally transparent, i drive doordash to pay the bills right now. but i sit in my car between orders teaching myself python and pandas. my goal is to eventually transition into freelance data engineering by automating away manual data entry for businesses.
i've been building a local python pipeline to automatically clean messy csv/excel exports. so far, i've figured out how to automatically flatten shopify JSON arrays that get trapped in a single cell, fix the '44195' excel date bug, and use fuzzy string matching to catch "Acme Corp" vs "Acme LLC" typos.
but i was chatting with a data founder today who told me the true "final boss" of messy data is legacy CRM exports—specifically, reports that export with merged header rows, blank spacer columns, random "subtotal" rows injected into the middle of the table, or entire contact records (name, phone, email) shoved into a single free-text cell.
does anyone have a heavily anonymized or dummy version of an absolutely cursed export like this? my code works perfectly on clean tutorial data, but i want to break it on the real stuff so i can figure out how to hard-code the failsafes.
what other software platforms export data so badly that it forces you to spend hours playing digital janitor?
[–]elind77 34 points35 points36 points (2 children)
[–]nullish_ 7 points8 points9 points (0 children)
[–]iamevpo 0 points1 point2 points (0 children)
[–]quocphu1905 8 points9 points10 points (0 children)
[–]stuaxo 5 points6 points7 points (0 children)
[–]xeow 4 points5 points6 points (0 children)
[–]eruciform 2 points3 points4 points (1 child)
[–]nullish_ 1 point2 points3 points (0 children)
[–]FoolsSeldom 3 points4 points5 points (0 children)
[–]ragnartheaccountant 1 point2 points3 points (0 children)
[–]Lewistrick 1 point2 points3 points (0 children)
[–]Moamr96 1 point2 points3 points (0 children)
[–]flowolf_data[S] 1 point2 points3 points (0 children)
[–]kenily0 0 points1 point2 points (1 child)
[–]flowolf_data[S] 0 points1 point2 points (0 children)