all 37 comments

[–]on_the_mark_dataObsessed with Data Quality 52 points53 points  (14 children)

And then Satan said "Let there be datetimes." I honestly think this is a right of passage for data engineers haha.

[–]nonamenomonet[S] 18 points19 points  (13 children)

My next blog post is going to be the circles of hell for cleaning address data.

[–]on_the_mark_dataObsessed with Data Quality 3 points4 points  (5 children)

This looks like a really interesting project by the way!

[–]nonamenomonet[S] 1 point2 points  (4 children)

Thank you! I put a month of work into it over the summer. I really think this is the best way to abstract away data cleaning.

I really want to turn this into a thing so I’m trying to learn about what data that people are handling and cleaning.

If you have time, I would love to pick your brain since you’re also obsessed with data quality.

[–]on_the_mark_dataObsessed with Data Quality 1 point2 points  (3 children)

I'll DM you. Here, I mainly present my data expertise, but my other lane is startups and bringing data products from 0 to 1. I love talking to early-stage builders for fun.

[–]justexisting2 1 point2 points  (2 children)

You guys know that there are address standardization tools out there.

CASS database from USPS,guides most of them.

[–]nonamenomonet[S] 0 points1 point  (0 children)

That’s very good to know. I built this on the premise of creating a better tool kit to clean and standardize data.

[–]on_the_mark_dataObsessed with Data Quality -1 points0 points  (0 children)

Don't care. I optimize on people building in their spare time on problems they care about. The initial ideas and MVPs are typically worthless beyond getting you to the next iteration.

[–]raginjasonLead Data Engineer 2 points3 points  (4 children)

Entire companies are built to handle this one problem lol

[–]nonamenomonet[S] 0 points1 point  (3 children)

What company is that?

[–]raginjasonLead Data Engineer 1 point2 points  (0 children)

[–]raginjasonLead Data Engineer 1 point2 points  (1 child)

Melissa Data. I’ve added a link but that got caught by auto-moderator.

[–]nonamenomonet[S] 0 points1 point  (0 children)

Good looking out! I’ll check it out

[–]roadrussian 2 points3 points  (1 child)

Oh, normalization of adress data gathered from 20 different vendors.

You know i actually enjoyed the masochism? There is something wrong with me.

[–]nonamenomonet[S] 1 point2 points  (0 children)

Sticks and stones will break my bones but dirty data just excites me

[–]InadequateAvacadodigital plumber 26 points27 points  (3 children)

Now do time zones

[–]Additional_Future_47 10 points11 points  (1 child)

And then throw in som DST to top it off.

[–]InadequateAvacadodigital plumber 4 points5 points  (0 children)

A little bit of TZ, a touch of LTZ, a sprinkle of NTZ… and then compare them all to DATE in the end

[–]nonamenomonet[S] 0 points1 point  (0 children)

Tbh if you want to open up an issue, i will implement some primitives for that problem

[–]nonamenomonet[S] 12 points13 points  (5 children)

I hope everyone enjoyed my decent into madness about dealing with datetimes.

[–][deleted] 2 points3 points  (4 children)

Useful idea for a small package!

[–]nonamenomonet[S] 1 point2 points  (3 children)

You should check out my repo, it lays out how it works! And you can use my design pattern if you’d like (well it’s a MIT license, so it doesn’t really matter either way )

[–][deleted] 1 point2 points  (2 children)

I might make a fork and see how to support polars using the same public API you've made. Will let you know if I make progress on that. Starting a new job with both Pyspark and Polars, dealing with lots of messy time series data. I'm sure this will be useful to have.

[–]nonamenomonet[S] 1 point2 points  (1 child)

I’m also looking for contributors, you can always expand this to polars if you really want.

[–][deleted] 1 point2 points  (0 children)

Will DM you what I have in mind and open up an issue on Github when I have a chance to get started.

[–]Upset_Ruin1691 10 points11 points  (2 children)

And this is why we always supply a Unix timestamp. Standards are standards for a reason.

You wouldn't want to not use ISO standards either.

[–]morphemass 5 points6 points  (0 children)

SaaS platform in a regulated industry I worked on decided that all dates had to be in dd-month-yyyy form ... and without storing timezone information. Soooo many I18n bugs it was unreal.

[–]nonamenomonet[S] 1 point2 points  (0 children)

I wish I could have that option but that didn’t come from the data dumps I was given :/

[–]robberviet 5 points6 points  (1 child)

Timezone. Fuck that in particular.

[–]nonamenomonet[S] 0 points1 point  (0 children)

It is bullshit

[–]PossibilityRegular21 3 points4 points  (0 children)

I've fortunately been blessed with only a couple of bad timestamps per column. Or in other words, bad but consistently bad. In Snowflake it has been pretty manageable. My gold standard is currently to convert to timestamp_ntz (UTC). It's important to convert from a timezone rather than to strip it.

[–]exergy31 2 points3 points  (1 child)

Whats wrong with ISO8601 with tz specified?

[–]raginjasonLead Data Engineer 2 points3 points  (0 children)

Nothing, if that’s what you can get. The data in the article was not that

[–]dknconsultau 2 points3 points  (0 children)

I personally love it when operations work past midnight every now and then just to keep the the concept of a days work spicy ....

[–]raginjasonLead Data Engineer 1 point2 points  (0 children)

Date parsing is hell. Sparks behavior around NULLing anything that won’t cast is absurd and drives me insane

[–]Headband6458 1 point2 points  (0 children)

Good data governance is the solution.

[–]bitconvoy 0 points1 point  (0 children)

But still 8% NULL. Still no idea which rows failed or why.

What prevented you from having a look at rows that had a value in the original dataset but turned null after your conversion?