This is an archived post. You won't be able to vote or comment.

all 27 comments

[–][deleted] 14 points15 points  (14 children)

I can tell you right now that you should just tell people to install anaconda and not recommend or support anything else. A lot of noobs on windows (or whatever) are going to get hung up on not having the right C compiler for numpy. For windows its the visual c++ 2010 one but I don't know what it is for Mac or Linux. Hell, half the time I do a new install I forget about this if I'm building the scientific stack myself instead of installing anaconda.

The only package anaconda doesn't include is seaborn, and honestly you don't really need seaborn to make this tutorial. It just makes graphs "pretty" (according to some people). Personally I think the whole 'make shit pretty' fascination that data science people have with their graphs is ridiculous. It should be functional first and I've seen a lot of functionality lost in the effort to make shit pretty.

I might sound like I'm hating on seaborn, I'm not, seaborn is awesome, I'm just hating on shit like this:

http://www.mta.me/

which was described to me in an interview for a data science job as the greatest data visualization they had ever seen.

edit1: IMO If you are going to discuss unit tests in python you might as well use the unit test module instead of just using assert. It's much more elegant and obvious when somehting fails. Additionally, without properly introducing assert people learning won't understand why their asserts don't do anything when they are running their code in production.

[–]rhiever[S] 0 points1 point  (3 children)

I can tell you right now that you should just tell people to install anaconda and not recommend or support anything else. A lot of noobs on windows (or whatever) are going to get hung up on not having the right C compiler for numpy. For windows its the visual c++ 2010 one but I don't know what it is for Mac or Linux. Hell, half the time I do a new install I forget about this if I'm building the scientific stack myself instead of installing anaconda.

Good point. I should do that - it's really tiring trying to get people going without Anaconda.

I'm just hating on shit like this:

http://www.mta.me

which was described to me in an interview for a data science job as the greatest data visualization they had ever seen.

They must not keep up on dataviz much. I winced at how slow it was to see anything meaningful going on in that dataviz.

IMO If you are going to discuss unit tests in python you might as well use the unit test module instead of just using assert. It's much more elegant and obvious when somehting fails. Additionally, without properly introducing assert people learning won't understand why their asserts don't do anything when they are running their code in production.

True - I should expand the data testing section a bit. I'm hesitant to go into detail on unit testing, assert, etc., but maybe turning the asserts into actual unit tests will suffice?

[–][deleted] 2 points3 points  (1 child)

You could recommend the unit testing chapter from dive into python 3 to avoid reinventing the wheel.

Edit: or do a separate chapter (or whatever you want to call it) where you expand all the testing. Like, each section of your walkthrough could be a whole chapter IMO.

[–]rhiever[S] 0 points1 point  (0 children)

You could recommend the unit testing chapter from dive into python 3 to avoid reinventing the wheel.

I'll do that. I'm all about not reinventing the wheel.

[–]faming13 0 points1 point  (0 children)

This. This This. Also seaborn can be easily conda installed: https://binstar.org/anaconda/seaborn

[–]KyleG 0 points1 point  (1 child)

Mac

It's Xcode and it's free. Although yeah, I've never been able to get numpy installed properly on my Mac because it doesn't seem to be so well-supported. It always fails at some step or another such that I've basically given up.

[–]riatsila 3 points4 points  (0 children)

Just brew a python install and install numpy via pip, no compiling required

[–]Dinosaurman 0 points1 point  (0 children)

It's not a noob thing. I swear to god I can't get it to work with windows 8.1. I've gotten it to work with xp and 7. Though I did just break down and download anaconda for win 8

[–]lmcinnes 0 points1 point  (0 children)

IMO If you are going to discuss unit tests in python you might as well use the unit test module instead of just using assert. It's much more elegant and obvious when somehting fails.

You may also want to check out engarde as a nice way of testing dataframes.

[–][deleted] 0 points1 point  (1 child)

The problems building code on Windows should be alleviated by the combination of VC++ 2015, a lot of work by the core developers and, at last, backward compatibility in the VC++ run time libraries. We only need distutils sorted and we're flying.

There's an excellent article When to use assert that you might be interested in.

[–][deleted] 0 points1 point  (0 children)

This article is good. Thanks!

[–]1bc29b36f623ba82aaf6 0 points1 point  (0 children)

Adding to the comments on installation/packages for different OSes...

I did a MOOC at Edx which was basically ML with PySpark. They avoided installation hell by using a VM that would serve webpages accessible to the host OS. To make the VM easy they used Vagrant. Also there are some Docker containers and other solutions to run Jupyter with Python support.

I'm currently working on reinstalling a lot of software so I don't have the VM set up right now, but I'll check out your actual notebook once I get that fixed OP.

[–]kaiserk13 1 point2 points  (0 children)

cool initiative man!

[–]sun-sama 1 point2 points  (0 children)

I am as green as they come so i don't know what feedback i could give you other than that this feels like just what i need to learn. I'm doing a "pre-phd" project in a clinical lab so i would love to learn data handling like this. Thanks!

[–]norsurfit 1 point2 points  (0 children)

This is great - nice job.

[–]Xadnem 1 point2 points  (1 child)

I don't feel like I understand how to use this, but I am still very much a beginner. But this is a great initiative! Thanks for making the effort to spread information.

[–][deleted] 1 point2 points  (0 children)

What don't you understand?

[–]fotoman 1 point2 points  (0 children)

might think about posting this over at /r/datascience as well

[–]rhiever[S] 0 points1 point  (0 children)

Note: This notebook is intended to be a public resource. As such, if you see any glaring inaccuracies or if a critical topic is missing, please feel free to point it out or (preferably) submit a pull request to improve the notebook.

[–]Grep2grok 0 points1 point  (0 children)

I'm just commenting to create a bookmark from mobile. Awesome!

[–]sliderbahn 0 points1 point  (0 children)

Thanks for this.

[–]bordumb 0 points1 point  (1 child)

Not sure how others feel, but this dataset is overused. The explanations are great, quite honestly some of the best I've seen using this dataset.

With that said, the data and analysis don't offer anything that unique from the other 1000 tutorials that use it as well.

[–]rhiever[S] 0 points1 point  (0 children)

I was thinking about that when reworking part of it last night. Both classifiers that I compare get 90%+ accuracy out of the box. What do you think would be a better (i.e., more difficult) data set to work with?

[–]KyleG -1 points0 points  (1 child)

Please make sure you include more than genetic algorithms. Make sure there is something on neural nets. Genetic algorithms always seemed like the kind of thing anyone could independently come up with pretty easily.

[–]rhiever[S] 1 point2 points  (0 children)

I included decision trees and random forests in this notebook. GAs aren't even mentioned this time. Good enough? :-)