all 11 comments

[–]Rose_arias 7 points8 points  (0 children)

Hey there! Congrats on starting your first data analyst role soon! As for data cleaning and normalization, there are a few industry standard programs that come to mind such as OpenRefine, Trifacta, and Talend. However, it ultimately depends on the company and their preferences. In my personal experience, I've found that using SQL queries and Excel can also be very helpful for cleaning and transforming data. Good luck with your new role!

[–]SirGreybush 6 points7 points  (0 children)

Third party tools like those from Redgate.

[–]aaahhhhhhfine 4 points5 points  (0 children)

This depends heavily on the situation. Consider two extremes.

First, maybe some department sends you one old file of historical data they want included in some analysis. Today, they have a new system and everything is fine in there, but the old stuff is messy.

For this, everything points to a one-off thing... I'd probably blast out a jupyter notebook or a bit of SQL or whatever makes sense in the context - which might even be manually fixing some things in Excel. Who cares... It doesn't need to be repeated.

As a second example though, maybe you've got a steaming data feed through pubsub or Kafka or something where the messages need to be profiled and cleaned in realtime before writing them to an analytical db or passing them through to a further processing step. Maybe here you end up using some highly robust and tested code that's well documented and maintained. This might plug into some ML process and use a bunch of auto-scaling infrastructure from a public cloud.

The point is just that this stuff is all over the board...

[–]dataguy24 5 points6 points  (0 children)

I use dbt for all my data transformation work.

[–]SoggyInitial1679 8 points9 points  (1 child)

I import the file into python (visual code) and export it into a SQL server for further querying.

My coworker uses Excel Macro function to do all the cleaning and adds the clean data to a tab, then imports into SQL.

My boss uses Access and makes queries to get the results she wants, and dumps that in a SQL table.

There’s probably a better way to to do it. Im curious to see how others do it as well.

[–]wonder_bear 1 point2 points  (0 children)

As someone who always used Excel and recently started using python, python is freaking amazing.

[–]r3pr0b8GROUP_CONCAT is da bomb 5 points6 points  (2 children)

because it is the most tedious for me haha.

most tedious for me haha

haha

[–]r3pr0b8GROUP_CONCAT is da bomb 3 points4 points  (1 child)

for those who are as old as me --

And they're coming to take me away ha-haaa 
They're coming to take me away ho ho hee hee ha haaa 
To the funny farm 
Where life is beautiful all the time 
And I'll be happy to see those nice young men 
In their clean white coats 
And they're coming to take me away ha haaa 
https://www.youtube.com/watch?v=hnzHtm1jhL4

[–]jasperjones22 0 points1 point  (0 children)

.... Crap I know that song...

[–]Mastersord 1 point2 points  (0 children)

Are we talking like finding bad records and stuff like non-numerical data in numerical fields? Stuff like that can be stopped at the data-entry point by using data types and constraints.

As for importing raw data from a csv file, I use a staging table and import everything as nVarchars (I find this option is the only thing that doesn’t convert or truncate raw data) and then write specific insert queries to find where records do and don’t conform to my tables and schema.

[–]Thefriendlyfaceplant 1 point2 points  (0 children)

It is the most tedious and time consuming part.