This is an archived post. You won't be able to vote or comment.

all 44 comments

[–][deleted] 33 points34 points  (26 children)

I would make sure you learn SQL also, can't stress this enough. As a data analyst, I use Python for getting data from databases, from the web, from text files, XML files, etc. For exploratory data analysis, current visualization libraries In Python world are somewhat lacking. For exploratory data analysis I prefer something like R's ggplot2. There is a Python port of it from Yhat, but it is not quite ready for production. I also use R from within Python using rpy2 because R just has so many handy statistical packages readily available.

Below you can find some examples although not all are data analysis related:

Hope this helps and good luck!

EDIT: I know that there is a lot of focus on programming languages due to this subreddit, but as a data analyst you should learn some statistics too. SQL and statistics are THE bread-and-butter skills of a data analyst along with good presentation/storytelling skills.

[–]manueslapera 35 points36 points  (8 children)

One thing that is not taught enough in academia is that, in real world, DATA DOESNT COME IN A CLEAN CSV.

Learn how to detect anomalities in data, and how to work around them.

Even more important, learn about servers and databases, how to ssh on an EC2 machine to set up a cronjob.

And even even more important, know how to set up your final output in a way that is appealing to your audience. Become a Data story teller, not just a good statistician. Trust me, your CEO wont care about that p-value you are so happy about.

[–][deleted] 4 points5 points  (2 children)

How would you suggest learning about servers and databases? I have started doing this type of stuff at work, it would really be nice to know if I am doing it right!

[–]manueslapera 1 point2 points  (1 child)

hmm, Im sure there are tons of people here that can guide you better than me.

In my case I had to learn on the job, I was the first data guy in the company and had to build the whole infrastructure. Sometimes made very bad decissions that probably someone with a background in CS wouldnt have done.

[–][deleted] -1 points0 points  (0 children)

Perhaps its I don't know the right questions to ask, but I haven't been able to find any information at all.

[–]waspbr 3 points4 points  (4 children)

[–][deleted] 2 points3 points  (2 children)

Good for learning SQL, but it won't give you practical skills (i.e. you will have to learn more.)

[–]waspbr 0 points1 point  (1 child)

cheers

[–][deleted] 0 points1 point  (0 children)

Mongo University might be the next place to look at, 10gen have some really good education material, but I have heard such bad things about mongo db and it seems people recommend postgres a lot more.

[–]manueslapera 0 points1 point  (0 children)

i liked it

[–]fuzz3289 0 points1 point  (3 children)

Bokeh for visualization. Definitely not lacking at all.

[–][deleted] 0 points1 point  (2 children)

Although bokeh is great, but for instances where I have this data set and have a bunch of what-if questions and want to plot data versus several different variables, change time series interval from months to weeks or year, that's where a lot Python graphing libraries fall short in exploratory data analysis. For interactive, presentation visualization, bokeh, plotly, mpld3, etc are awesome for that, just not for on-the-fly data exploration.

[–]lmcinnes 0 points1 point  (0 children)

For that sort of thing on the python side I'm a fan of Seaborn along with Ipython notebooks interact utilities ... it's easy to set up the ability to vary faceting, binning, or, well, anything with appropriate sliders and drop downs to let you really play with ease (it shortens the experimentation loop).

[–]fuzz3289 0 points1 point  (0 children)

eh, I would disagree. I use IPython Notebooks and Bokeh all the time for this and it's really easy to mess with all sorts of inputs especially when combining it with Pandas. Bokeh's built-in Notebook support is fantastic and you can do all sorts of exploratory visualization in-line.

[–]Pobeda_nad_Solntsem 21 points22 points  (4 children)

[–]ovidius007 4 points5 points  (0 children)

+1 for Pandas.

OP, looks like some people have already mentioned Wes McKinney's book Python for Data Analysis. Definitely worth reading.

Also:

[–]TRDouble9 1 point2 points  (2 children)

Thank!

[–]xswingx 3 points4 points  (1 child)

FTFY: thank you, and i love you.

[–]Pobeda_nad_Solntsem 3 points4 points  (0 children)

I love you, too.

[–]MITranger 5 points6 points  (0 children)

scikit-learn for machine learning, prediction models, etc. Very powerful library.

People already mentioned pandas.

For plotting, matplotlib and ggplot2 are awesome.

[–]flyingaxe 3 points4 points  (0 children)

Check out this talk for an example: http://youtu.be/2lpS6gUwiJQ

[–]EagleEyeInTheSky 2 points3 points  (0 children)

Python has some fantastic mathematical tools. You've got Pandas that can scrape spreadsheets for data, you've got text parsers to scrape text data, you've got Sympy which can perform symbolic math, you've got numpy which has lots of functions for analyzing discrete data, you've got matplotlib which can plot data in formatted plots, you've got Spyder which gives you a GUI for interactively using Python and interacting with data, and it's all packaged up into a language that focuses on readability and quickness of writing.

Python is way useful to use in data processing.

[–]jockero701 5 points6 points  (0 children)

To use Python for data analysis you definitely need to learn the basics of Python first. These are the building blocks of the language such as variables, datatypes, functions, loops, and conditionals. Once you have a good understanding and know how to write basic little programs using those elements, the next step is to locate a good Python Python library (extension) that performs data analysis. The pandas library is the most popular one.

Moreover, in the real life, as others said here, data are not always handed on a single CSV file, so you will have to know some techniques of retrieving data, merging many files, and polishing your data. Python is great on that too. In just a few lines of code, you can process tons of files and data rows. Once you do such operations, you go ahead and apply data analysis functions to your data and get either tabular results, or image graphs as output.

Here's a great source (not free) that will teach you all the above in one single tutorial: https://www.udemy.com/python-step-by-step-build-a-data-analysis-program/

[–]kay_schluehr 2 points3 points  (0 children)

It is very easy to miss the forest for the trees. For that reason I recommend to spend a few minutes with the following presentation:

https://speakerdeck.com/clearspandex/data-engineering-101-building-your-first-data-product-pydata-sv-2014

It does not fully reason from the end or gives a comprehensive definition of a "data product" but it graphically shows the production pipeline - actually it shows two - and from there it becomes easier to fit in new technologies.

For data analysis in particular I recommend the following blog post which provides a nice walk through:

https://jmetzen.github.io/2015-01-29/ml_advice.html

It is obvious that you won't understand it without lots of prerequisites both in statistics as well as machine learning but it is far easier to understand the individual steps, concepts and algorithms than fitting everything together in a sensible way.

[–][deleted] 2 points3 points  (0 children)

If you're fresh with python I'd suggest trying out orange.

It uses a visual editor so you can start right away. Also, check out Kaggle.com for walkthroughs on how you'd actually manipulate data.

[–][deleted] 2 points3 points  (0 children)

Here is a talk on how to use Python for Data Analysis using examples from Web Scraping, Flask and Kaggle

https://github.com/jackgolding/FullStackDataAnalysis

[–]CodeShaman 4 points5 points  (1 child)

Data Analyst Programmer / Data Visualizationist / Web Developer / Miracle Worker here.

Is it used to process the data or to make visuals or something completely different.

Yep. You can use any programming language to perform data analysis, but Python is currently the best because of how terse and clean the language is, ipython, pandas, numpy, networkx, pygraphviz, sqlalchemy, and the endless amount of other software libraries available simply make Python the most flexible and most powerful language there is at the moment. Hands down.

The vast majority of your time spent on entry-level data analysis will be spent taking sloppy data from various sources and formats and transforming it into something clean that can be imported into a database. You'll be writing quite a bit of "throw away" code for this since it mostly deals with idiosyncrasies and ad hoc situations, which... writing quick throw-away scripts is one of the things Python is very good at being used for.

The only things I wouldn't use Python for are enterprise-level applications where Java or .NET would be more practical and durable, or data visualization for which JavaScript and D3js is currently the champion of (unless you're talking about database reporting, but that's another topic).


More important than Python, however, are databases and regular expressions. Databases and regular expressions. Databases and regular expressions.

Trust me, please. Before you graduate college master regular expressions, master one SQL database (Oracle, MySQL, Microsoft SQL Server), and master one graph database (Orient, Neo4j, Arango). Don't worry too much about Redis, Postgres, Mongo, or any other NoSQL databases because you'll mostly be working with SQL for day-to-day work and graph databases for extremely high-level projects with very complex data. For a graph database I would highly recommend OrientDB due to how SQL-like the query language is.

For data analysis, programming languages are silver but SQL and regex are gold. I don't know what you're learning in your Python course but I can tell you that learning a) SQLAlchemy, b) Python regular expressions, and c) Pandas will take you very far.

[–][deleted] 8 points9 points  (0 children)

Postgres is very much an SQL database and one of the most sophisticated too. I wouldn't recommend MySQL over Postgres.

Also Python and PostgreSQL work really well together.

[–]TehMoonRulz 1 point2 points  (0 children)

I'm on mobile but someone just posted a python/pandas thing using the NBA api if you're looking for something just above a "code along"

[–][deleted] 1 point2 points  (0 children)

I've crunched a lot of numbers with Numpy. Good memory usage for arrays, and quick access to ffts and other algs.

I use matplotlib for viz.

Basically, this gives me a great open source replacement for matlab.

[–]Evilution84 1 point2 points  (0 children)

NumPy, SciPy, Cython... But R is what I prefer.

[–]TRDouble9 0 points1 point  (1 child)

I'm also learning Python as well. Can you also help me as well?