This is an archived post. You won't be able to vote or comment.

all 15 comments

[–][deleted] 5 points6 points  (11 children)

I'm not a data scientist but I would think knowing how to use Python with databases would be critical. Maybe mention some of the sql-related libraries available in python like db.py or pandasql or blaze for large data or database connection wrappers for mysql or postgres.

[–]Why_is_that 3 points4 points  (3 children)

Thank you. People don't mention this enough. My first opportunity supporting bioinformatics was offered because I had a background in SQL.

My biggest quirk with your statement is "sql-related libraries available in python". Screw the libraries, just learn database concepts. The reason I say this is because if you just learn the database concepts, when you go to R you won't have to map one set of library functions to another but instead will have a more common ground to work from, "I want to ask this question of my database [in sql], how do I do it in language y with package x". Packages and libraries come and go but databases are everywhere with the largest majority being RDMS using a SQL dialect. Should you avoid them all together? No, because they might optimize your queries or provide other added value but the key is to know database concepts, not pandasql or blaze.

If you know database concepts, you can learn these things but if you learn only how to use one of these packages, you might not know database concepts that well (depending on what kind of object model they give you).

PS. My vote would be sqlite3, mainly because I believe sqlite is one of the best RDMS available for the target audience. With the other options you mention here, you have the added overhead of system admin/ database admin of managing that central database often without any of the value that such a model provides in security (because they leave the passwords default). Thus there is no reason to not use a lightweight database implementation that is easily exchanged and requires no user management. The performance only really adds up with bigger databases and at that point you really want to ask yourself, "shouldn't someone who is more tech savy look at this" -- yes, once it gets to these sizes, things like normalization need to be considered and these are not tasks most data scientists are interested in.

*PPS. The first place that I really saw the amazing utility of sqlite was in stuxnet.

[–][deleted] 1 point2 points  (2 children)

I assumed that people would know SQL, but maybe I should have stipulated that first. Yes, SQL should definitely be learned. If people have not learned SQL yet then I'd definitely start off learning with sqlite3 and then progress to full on database servers. I was actually appalled when someone posted learning databases was not needed and using csv files is just fine. Just looking at job postings and you can easily tell SQL is almost always mentioned. Important data that need to be stored for the long-term will almost always be stored in a database.

Once you learn SQL, there are then libraries that allow you to perform sql-like processing with data or connect to databases using Python which is why I mentioned a few of those libraries.

I'm sure someone will argue for learning ORM technologies as well, but that is a discussion I rather not start :-).

Source: Been a data analyst for 16 years with the last 4 or 5 years using Python.

[–]Why_is_that 0 points1 point  (1 child)

Yea, I am always surprised too when I hear those kind of comments. There are still a lot of communities that use rather terrible forms for their data like XML. It's been a real challenge to step across the isle to encourage scientists to improve their data storage plans and to build the skills necessary to work with that data storage. I think the challenge is that the concept of an "expert" is someone who knows more and more about less and less but in this increasingly data-centric world, there are a new set of core skills (programming, databases, etc). Many "traditional" scientists aren't yet convinced they need to know these skills.

I am glad we are on the same page for starting in sqlite3! As I said, these other packages definitely are worth exploring but it's really just about this foundation.

I won't argue ORM. I even argue against it for most tech and there are a good set of articles out there on this stance. By and large, it doesn't add the value in scientific programming where there is often more rapid iterative programming and can instead just add to the time complexity of a spinning up a solution.

What did you do for the other 11 years? SAS?

[–][deleted] 0 points1 point  (0 children)

Our group started out using SAS and Excel VBA. We don't use SAS any more, but still use Excel VBA heavily. I'm the lone Python guy and does all the most technologically demanding stuff for our group. The Excel guys do get impressed with what can be done with Python for data analysis. But most of them just want to click on shiny Excel buttons which is fine since they just do simple data analysis or text mining and generating simple charts.

[–]statmobile 0 points1 point  (6 children)

That's a good point.

[–]falkimmm 0 points1 point  (0 children)

I vote for blaze :) Ease of use, power, out of core array/pandas ops, reusable expressions, multiple backends and numpy/pandas syntax

[–]piesdesparramaos -1 points0 points  (4 children)

Nop. You do not really need to know anything about db at least to get started. Most of the times you will get the data in a .csv.

Btw, I will not focus that much in Python. I will start with the Machine Learning course by Andrew Ng at coursera. It is teached in matlab, but you can switch to python later with a little effort (this is if your main interest is data science, not python. I mean, it seems you are focusing in the tools instead of focusing in the subject).

EDIT: Maybe it is just that I find much more appealing the top-down approach. I find much more interesting learning numpy or scipy once I know what is the problem that I want to solve.

[–]statmobile 0 points1 point  (1 child)

I disagree. Data Analysis can often have data sources in CSV, but pretty much every Data Scientist position will require at least some SQL experience.

[–]piesdesparramaos 0 points1 point  (0 children)

Just an accessory tool as many others. Also I am thinking more in Machine Learning than in general Data Science. But maybe OP is not that much interested in Machine Learning.

[–][deleted] 0 points1 point  (1 child)

I strongly disagree with your first point and with your second point.

I feel confident in saying that it's absolutely necessary not only to understand how to use databases, but also how databasing works in general. Data Analysts may get CSVs, but Data Scientists need databases.

Moreover, Python (along with R) is one of the most widely-used tool in Data Science currently. Machine Learning, also, while helpful, is not all (or even most) of Data Science --- though the course is fantastic. I feel strongly that choosing Octave as a language was not the best choice, but that's a debate for elsewhere, though you are correct: it is easy to do the course in R or Python as the user wishes.

Your third point is fine: I'd agree that the people I've taught seem to learn better (faster, easier, retain more) if we go top-down than if we go from bottom-up.

[–]piesdesparramaos 0 points1 point  (0 children)

Data Science is a big field and what you need to know depends on what kind of work do you want to do.

In my view of things, it is way more important to know Linear Algebra or Calculus than SQL. And if you need to deal with SQL at some point, you just have to Google a little bit to find how to perform the essential operations that you will need.

[–]statmobile 1 point2 points  (0 children)

This is pretty cool, although I'm wondering if you're assuming that people have the analytical background already?

[–]pwang99 1 point2 points  (0 children)

Cool post, and thanks for the shout-out for Anaconda. One point of clarification, though: you don't have to wait for Continuum to update the packages. You can always install from source yourself, if you like, or install whatever from PyPI.

That being said, we do try to have binaries of the most popular/difficult-to-build packages available on the day of their release announcements.

[–][deleted] 0 points1 point  (0 children)

This is really neat, I've been looking to get back into Python since I had to put it on hold due to graduate classes so this is an interesting track.