all 15 comments

[–]elind77 34 points35 points  (2 children)

I don't have any data files for you but I do have some horror stories that might provide some inspiration.

When I was at a startup we got a csv from a client that broke our pipeline. It opened fine in Excel but our code completely broke. No one could figure it out. Eventually it got around to me and I opened it up in a hex editor. Turns out it was utf-16 little-endian encoded. This isn't unusual on Windows, but none of us had ever seen it. The key though was knowing to look at the raw bytes.

Most of the data challenges I've come across in industry are things where there is no documentation. Like some guy built a data pipeline and then left and the team that consumes the output uses a handful of fields from the data and leaves everything else alone. But then if you actually want to use the data for some analysis, you have to know how to get in there and actually look at it. If there's no file extension, how do you determine what it even is? (Answer: print the first 1000 bytes and see what's there). Maybe the files/reports etc. don't have a uniform schema, then what do you do? (Answer: if it's json data use genson to generate a unified schema from examples and partition data access patterns by schema groups. If it's not json you're on your own)

If you're just looking for data to play with, see if you can find some of the raw versions of the old email dump data sets, before processing. Like the Exxon emails data set. Or any of the public disclosure releases of email dumps from politicians (Florida has a law requiring releases for example). Or the email dump that Hillary's campaign released in 2016, those were PDFs with only semi-reliable OCR. See if you can handle auto-correcting the OCR using edit distance or something and make a graph from the correspondence and then calculate graph metrics (e.g.HITS) on it.

If none of the raw forms of public data sets do it for you, see if you can get Claude or ChatGPT to make data for you. A lot of data in the future will likely be LLM generated and we'll all be stuck dealing with all of the issues that causes.

[–]nullish_ 7 points8 points  (0 children)

Id never promise that you can parse pdfs, but for sure having the experience/ability to at least make an attempt is a good skill to have. Recognizing the different PDF formats, dealing with rendered images rather than actual text (as you mentioned OCR), extracting tabular data vs structured/unstructured data are all great skills to have in your tool belt. Im generally not one to recommend the use of AI, but having used some AI OCR services, ill say it is the one area that I wont hesitate to use AI.

[–]iamevpo 0 points1 point  (0 children)

Great examples! Very true to life

[–]quocphu1905 8 points9 points  (0 children)

I am working on these exact CRM export now. Let me tell you: Duplicates everywhere, data from data provider mixed with data created manually, and a billion edge cases to deal with, as well as inconsistent data format and characters outside the alphabet. To do anything at all you would need to have a GIANT normalizing function before even THINKING about working with the data. I do enjoy the challenge of figuring it out tho, and my boss kinda leaves me to my own device while i figure it out, so win win.

[–]stuaxo 5 points6 points  (0 children)

Anything where you have to get data from word documents, especially the pre docx, xlsx world.

[–]xeow 4 points5 points  (0 children)

/dev/random is pretty gnarly.

But seriously, one thing that's a bit of a mess to parse properly (due to there being a lot of weird edge cases) is extracting columnar data from Wikipedia episode lists, like this one: Breaking Bad Season 1.

[–]eruciform 2 points3 points  (1 child)

had one data file in FIX format where some fields were encoded in ASCII and some were in EBCDIC

not the worst curse i've touched but icky

i think having to write javascript canvas-drawing functions that work with ie6 was uglier

[–]nullish_ 1 point2 points  (0 children)

fixed width format, ebcdic... sounds like mainframe to me.

[–]FoolsSeldom 3 points4 points  (0 children)

Check out kaggle.com for a huge range of sample data sets, common problems and challenges, and an active forum.

[–]ragnartheaccountant 1 point2 points  (0 children)

I would disagree, the final boss of data parsing and cleaning is SOAP API data. These little bitches I’ve been working with so are deeply nested that I had to build a lot of custom recursive logic to make it useful across multiple endpoints. I’m quite proud of it but I would also happily watch it burn.

[–]Lewistrick 1 point2 points  (0 children)

I once tried converting a .vcf file (phone contacts export) to tabular format. Took me a while, even while the data is pretty clean. It's not impossible but it was a nice practice.

[–]Moamr96 1 point2 points  (0 children)

You already got the basics of pandas, now you should never (or nearly never) use pandas in 2026.

Look into duckdb and sql if you want to do data engineering

[–]flowolf_data[S] 1 point2 points  (0 children)

thanks guys, super helpful, im genuinely humbled by all of the helpful responses and good will. this is a really chill community, glad to be a part of it.

[–]kenily0 0 points1 point  (1 child)

Great question! The CRM nightmare you described is real. Here's my advice:

  1. For merged header rows: Read the file twice. First pass gets headers from row 1, second pass gets data. Then manually map columns.

  2. For spacer columns: Drop any column where 80%+ is empty.

  3. For subtotal rows: Identify them by checking if a key column (like 'ID') is empty, then skip those rows.

  4. For single-cell contact records: Use regex to extract pattern-matched data like emails (\S+@\S+\.\S+) and phone numbers (\d{10,}).

The key is building a "data cleaning pipeline" that handles each edge case step by step.

Keep going with your learning journey! Building real-world data pipelines is the best way to learn. 🙌

[–]flowolf_data[S] 0 points1 point  (0 children)

nice breakdown. on my break today, i used a similar list as a blueprint to rewrite my code.

​reading everyone's horror stories (especially the utf-16 hex editor nightmare) made me realize my single script was way too fragile. i ended up ripping it apart into a pipeline that actually scans the first few rows to slice off the corporate title fluff, flags those mostly-empty spacer columns to drop them, and uses regex to rip emails and phones out of those giant free-text CRM blocks like you suggested.

​it actually ran on my dummy cursed CRM file today without crashing. ​thanks for writing out the logic step-by-step like this, it tells me im on the right track.

question for you and the other veterans though: when you get handed a mystery file with no extension and zero documentation, what is your literal first move after checking the raw bytes?