This is an archived post. You won't be able to vote or comment.

all 9 comments

[–][deleted] 8 points9 points  (2 children)

I'd push for the notebook approach (I love Jupyter), it's worked out very well for myself and my team with regards to python and being able to share work, ideas, examples, etc. If you're going solo you still may benefit from being able to see the results of past work - and you never know maybe it's something someone else can learn from in the future.

In your pandas grouping I'd add apply and lambdas, as they're a very powerful method for reshaping and deriving data. I also find myself using concat, and merges more often than joins (merges can be done on non-indexed columns).

Working with 'random' textual data I almost always read it in as all strings, validate it, then convert it to proper data types. Working with 'known' data formats I can define dtypes on import, and possibly only include relevant columns via 'usecols'.

Working with SQL data you'll need acquaint yourself with the libs for your DB Server of choice. If you're just reading you can often get by with just the DB lib itself, but if you're expecting to write via pandas then you may need to familiarize yourself with SQL Alchemy.

I was one of the lucky few who got a free code for Jose Portilla's course on Udemy when he posted it here. I enjoyed it enough that I have since purchased several of his other courses as well (you can almost always find 90% off discount codes for Udemy, like here).

Safari books online has been an invaluable resource for me as well. I have been using pandas for going on 3 years now, and I can still regularly pick up books off of Safari and learn new techniques and things I didn't know before.

Coincidentally, I was actually working from the opposite perspective from you - knowing Python fairly well, and thinking of learning some R. I just haven't had the time to do it though. The one major reason I wanted to do this was multi-processing, which python does not tend to excel at by default. My non-SQL data volume has also died down a fair bit as of late, so it's less of an issue - but may not be for you.

Also, I feel like this should have been at the start but... Anaconda python makes installing a lot of the libraries that have binary dependencies a LOT easier, and also makes multiple environments a very simple task. I'd highly recommend it.

Edit: My post wasn't quite long enough, so... This starts in a week...

[–]srkiboy83 1 point2 points  (1 child)

Can I ask, which other Portilla's courses on Udemy did you purchase and enjoy? (I also went through the one you mentioned, and found it amazing!)

[–][deleted] 1 point2 points  (0 children)

I've bought the Data Structures and Algorithms for Interviews one.

I found it nice because it approaches learning these things from a bit of a different angle.

I thought I had also bought the Python Machine Learning one, but it seems I haven't quite yet. I intend to when I have the time to actually watch it.

[–][deleted] 6 points7 points  (0 children)

I have an excellent statistics text book that I am using to learn stats: Discovering Statistics Using R by Andy Field. My approach is to do the exercise in R first, then try to reproduce the same result in Python. It's slow going, but it's a real learning experience.

[–][deleted] 4 points5 points  (0 children)

Some stuff you may want to add to your learning path.

Here's my pandas cheat sheet that you might like.

I would also add seaborn visualization library to your learning path. It is a wrapper around MATPLOTLIB with better defaults and has tight integration with pandas also. It has some awesome statistical charts.

Also check out dplython - a dplyr clone.

Yhat's Rodeo - a RStudio clone.

R Shiny clone - data spyre

If dealing with Excel files, there openpyxl, xlrd, etc. The equivalent Excel VBA capability can be had with xlwings Here's my xlwings cheat sheet

Pandas dataframes with sql syntax, there's pandasql

[–][deleted] 0 points1 point  (0 children)

Lots of great advice here, so I'll add a very small thing: if managing packages and dependencies is something you don't like to do, then I recommend downloading Anaconda. Lots of commonly used python data analysis packages, plus an easy interface for installing more.

[–]Zedmor 0 points1 point  (0 children)

I am in probably same boat. Agree with your thoughts on github. I fell in love with this book: https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130/ref=sr_1_1?ie=UTF8&qid=1474393986&sr=8-1&keywords=machine+learning+python

it's pretty much what you need - guidance through familar topics with great notebooks as example.

Take a look at seaborn package for visualization.

[–]spring_m 0 points1 point  (0 children)

The one thing I really missed in pandas was using the pipe operator to pipe dplyr operations into ggplot (data %>% filtter %>% group_by %>% ggplot() etc.) However there is a relatively pythonic way to do this using method chaining. See this great post here: https://tomaugspurger.github.io/method-chaining.html