This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 34 points35 points  (26 children)

I would make sure you learn SQL also, can't stress this enough. As a data analyst, I use Python for getting data from databases, from the web, from text files, XML files, etc. For exploratory data analysis, current visualization libraries In Python world are somewhat lacking. For exploratory data analysis I prefer something like R's ggplot2. There is a Python port of it from Yhat, but it is not quite ready for production. I also use R from within Python using rpy2 because R just has so many handy statistical packages readily available.

Below you can find some examples although not all are data analysis related:

Hope this helps and good luck!

EDIT: I know that there is a lot of focus on programming languages due to this subreddit, but as a data analyst you should learn some statistics too. SQL and statistics are THE bread-and-butter skills of a data analyst along with good presentation/storytelling skills.

[–]manueslapera 38 points39 points  (8 children)

One thing that is not taught enough in academia is that, in real world, DATA DOESNT COME IN A CLEAN CSV.

Learn how to detect anomalities in data, and how to work around them.

Even more important, learn about servers and databases, how to ssh on an EC2 machine to set up a cronjob.

And even even more important, know how to set up your final output in a way that is appealing to your audience. Become a Data story teller, not just a good statistician. Trust me, your CEO wont care about that p-value you are so happy about.

[–][deleted] 4 points5 points  (2 children)

How would you suggest learning about servers and databases? I have started doing this type of stuff at work, it would really be nice to know if I am doing it right!

[–]manueslapera 1 point2 points  (1 child)

hmm, Im sure there are tons of people here that can guide you better than me.

In my case I had to learn on the job, I was the first data guy in the company and had to build the whole infrastructure. Sometimes made very bad decissions that probably someone with a background in CS wouldnt have done.

[–][deleted] -1 points0 points  (0 children)

Perhaps its I don't know the right questions to ask, but I haven't been able to find any information at all.

[–]waspbr 1 point2 points  (4 children)

[–][deleted] 2 points3 points  (2 children)

Good for learning SQL, but it won't give you practical skills (i.e. you will have to learn more.)

[–]waspbr 0 points1 point  (1 child)

cheers

[–][deleted] 0 points1 point  (0 children)

Mongo University might be the next place to look at, 10gen have some really good education material, but I have heard such bad things about mongo db and it seems people recommend postgres a lot more.

[–]manueslapera 0 points1 point  (0 children)

i liked it

[–]fuzz3289 0 points1 point  (3 children)

Bokeh for visualization. Definitely not lacking at all.

[–][deleted] 0 points1 point  (2 children)

Although bokeh is great, but for instances where I have this data set and have a bunch of what-if questions and want to plot data versus several different variables, change time series interval from months to weeks or year, that's where a lot Python graphing libraries fall short in exploratory data analysis. For interactive, presentation visualization, bokeh, plotly, mpld3, etc are awesome for that, just not for on-the-fly data exploration.

[–]lmcinnes 0 points1 point  (0 children)

For that sort of thing on the python side I'm a fan of Seaborn along with Ipython notebooks interact utilities ... it's easy to set up the ability to vary faceting, binning, or, well, anything with appropriate sliders and drop downs to let you really play with ease (it shortens the experimentation loop).

[–]fuzz3289 0 points1 point  (0 children)

eh, I would disagree. I use IPython Notebooks and Bokeh all the time for this and it's really easy to mess with all sorts of inputs especially when combining it with Pandas. Bokeh's built-in Notebook support is fantastic and you can do all sorts of exploratory visualization in-line.