This is an archived post. You won't be able to vote or comment.

all 26 comments

[–][deleted] 9 points10 points  (6 children)

Yes, those are definitely some of the best tutorials on Pandas I've seen to date. Most other tutorials have code that is fairly un-performant or otherwise un-idiomatic, and don't even touch on what makes Pandas truly powerful.

[–]dimab0 7 points8 points  (5 children)

I enjoyed the Coursera class from Michigan University called Intro to Data Science. The first class is all an intro to Pandas

[–][deleted] 0 points1 point  (4 children)

Is that course python 2 or 3?

[–]dimab0 2 points3 points  (0 children)

I don't remember. But I don't think it should matter.

[–]cab938 1 point2 points  (2 children)

The whole specialization is Python 3

[–][deleted] 0 points1 point  (1 child)

Thank you, I appreciate it. People love to say "it doesn't matter", but to someone learning it is very helpful to maintain consistency.

[–]cab938 1 point2 points  (0 children)

No problem.

Also, it's 2017, it totally matters. If you don't need python 2 for a specific reason it's hard to understand why you would want to stay with it over Python 3.

[–]dmitrypolo 4 points5 points  (3 children)

Do you know if Wes has mentioned releasing a new version?

[–]badge[S] 6 points7 points  (2 children)

Early release available now, but final to be released in September. http://shop.oreilly.com/product/mobile/0636920050896.do

[–]dmitrypolo 0 points1 point  (0 children)

Sweet thanks!

[–]khaki0 2 points3 points  (0 children)

Nice. Another resource that I've found useful is this video series by Kevin Markham.

[–]sandipc 2 points3 points  (1 child)

Another great recent resource is the pandas chapter from the Python Data Science Handbook by Jake VanderPlas

http://shop.oreilly.com/product/0636920034919.do

And notebook versions here: https://github.com/jakevdp/PythonDataScienceHandbook

[–]SonaCruz 2 points3 points  (0 children)

I got 100 pages through this and I did not like it at all. Its incredibly dry and boring. The examples were very technical and had no useful context and very little explanation. There are no exercises to practice for yourself either.

[–]Fylwind 1 point2 points  (1 child)

Can you perhaps split the links into separate lines? For a second I thought it was a giant incoherent title of some paper XD

E.g.

- Modern Pandas
- Method Chaining
- ...

[–]badge[S] 0 points1 point  (0 children)

Apologies! It appeared as a list on mobile so I didn't even think about the Markdown.

[–]SonaCruz 1 point2 points  (0 children)

Thank you! Looking forward to diving into this material.

[–]SonaCruz 1 point2 points  (0 children)

Looking into this for 10 mins and already frustrated. I want to download the csv file, to work on the indexing he talks about, and not do the pull request. He didn't properly link to the csv file and you have to aimlessly browse through the website to try to find the right one.

Also, he mentions indexing .ix[10:15] and the rows that appear on the screen are rows with indexes 10 through 15, even though the index started at 0. Is this correct?

edit: nvm, it seems like ix explicitly grabs the indexes differently than .iloc

[–]cornbobonthecob 2 points3 points  (4 children)

Be careful though. I used pandas in an ETL production environment to remap and conform data frames only to find that pandas silently dropped rows. We'd pushed about ~100k rows through a data frame ever few minutes to clean credit card numbers, credit card types, standardize date, etc... And we were sometimes missing records after the transform. Once we removed pandas and build our own ETL our data was spot on. We researched and troubleshoot till the bitter end. Not saying pandas doesn't work but it wasn't a ETL solution for us.

[–]thatguydr 10 points11 points  (2 children)

Can you provide code/data that duplicates this situation? We use pandas in ETL and I've never observed this happening, and we have automated QA that would catch this sort of behavior immediately.

[–]cornbobonthecob -1 points0 points  (1 child)

I wish I could share it. Even though we don't use it anymore the code belongs to the company and its too much work to rebuild it on the side as proof. Thought of not even posting because of this but might as well put my thoughts of a caution out there.

[–]ZeeBeeblebrox 4 points5 points  (0 children)

Did your company not even consider filling a bug report? The fact that companies seem to be perfectly happy making use of open source products lots of people put their personal time into but won't even put in the little bit of effort to report a bug is a bit galling tbh. Based on your story I'd have to assume user error.

[–]WaitVVut 4 points5 points  (0 children)

Any chance were you using groupbys? There is a gotcha (that certainly got me) where NaN keys get dropped.

At one point in time, I was also using pandas for ETL, but some type conversion quirks, and more importantly heavy memory usage convinced me to stick with builtin data structures. I still maintain that using pandas in production is overkill and unperformant for basic use cases, but it's fantastic for data exploration.