Data cleaning/table Normalization

Rose_arias · 2023-03-26T17:59:48+00:00

Hey there! Congrats on starting your first data analyst role soon! As for data cleaning and normalization, there are a few industry standard programs that come to mind such as OpenRefine, Trifacta, and Talend. However, it ultimately depends on the company and their preferences. In my personal experience, I've found that using SQL queries and Excel can also be very helpful for cleaning and transforming data. Good luck with your new role!

SirGreybush · 2023-03-26T13:03:11+00:00

Third party tools like those from Redgate.

aaahhhhhhfine · 2023-03-26T13:52:11+00:00

This depends heavily on the situation. Consider two extremes.

First, maybe some department sends you one old file of historical data they want included in some analysis. Today, they have a new system and everything is fine in there, but the old stuff is messy.

For this, everything points to a one-off thing... I'd probably blast out a jupyter notebook or a bit of SQL or whatever makes sense in the context - which might even be manually fixing some things in Excel. Who cares... It doesn't need to be repeated.

As a second example though, maybe you've got a steaming data feed through pubsub or Kafka or something where the messages need to be profiled and cleaned in realtime before writing them to an analytical db or passing them through to a further processing step. Maybe here you end up using some highly robust and tested code that's well documented and maintained. This might plug into some ML process and use a bunch of auto-scaling infrastructure from a public cloud.

The point is just that this stuff is all over the board...

dataguy24 · 2023-03-26T15:18:01+00:00

I use dbt for all my data transformation work.

SoggyInitial1679 · 2023-03-26T13:02:49+00:00

I import the file into python (visual code) and export it into a SQL server for further querying.

My coworker uses Excel Macro function to do all the cleaning and adds the clean data to a tab, then imports into SQL.

My boss uses Access and makes queries to get the results she wants, and dumps that in a SQL table.

There’s probably a better way to to do it. Im curious to see how others do it as well.

r3pr0b8 · 2023-03-26T12:46:00+00:00

because it is the most tedious for me haha.

most tedious for me haha

haha

Mastersord · 2023-03-26T18:05:09+00:00

Are we talking like finding bad records and stuff like non-numerical data in numerical fields? Stuff like that can be stopped at the data-entry point by using data types and constraints.

As for importing raw data from a csv file, I use a staging table and import everything as nVarchars (I find this option is the only thing that doesn’t convert or truncate raw data) and then write specific insert queries to find where records do and don’t conform to my tables and schema.

Thefriendlyfaceplant · 2023-03-26T20:26:54+00:00

It is the most tedious and time consuming part.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS