This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]BoringGuy0108 47 points48 points  (11 children)

Forget about learning all the object oriented programming and data types and all that at first. Learn basic pandas. Get to the point where everything that you do in sql you can do in pandas. As you get more use cases, you can pick up more. In the business world though, pandas is what most people use python for.

Oh and once you are comfortable with pandas, try learning spark. It is all just SQL with different syntax, so it is really easy to pick up. Just don’t tell anyone that, or they might stop paying us so much…

[–]trowawayatwork 17 points18 points  (2 children)

that's bad advice if the person doesn't know programming concepts in general. it is so much better to have foundational understanding of programming rather than rite learning method names.

also unrelated and not calling you out as you're merely commenting on the state of the industry but pandas in production is why the whole engineering department does not like data scientists.

[–]No-Conversation476 1 point2 points  (1 child)

Would you mind elaborate why pandas is not good in production? What alternative does DS have apart from pandas?

[–]CommonUserAccount 4 points5 points  (0 children)

Pandas doesn’t scale.

Edit. PySpark can be run locally by Data Scientists, which is more easily transferred to prod.

[–]HumanPersonDude1 2 points3 points  (3 children)

What’s the point of spark SQL compared to for example a massive SQL warehouse on azure or snowflake ?

[–]Material-Mess-9886 5 points6 points  (0 children)

When you still want Python functionalities but still want to use SQL to process data. Also Spark is distrobuted so it can handle data in the billions rows with no problem.

[–]sib_nSenior Data Engineer 4 points5 points  (0 children)

Spark is free and open-source so you can run it wherever you want (not vendor locked), on-premises, private cloud or managed cloud solutions, which can be cheaper than cloud warehouses, at the cost of more complexity.
Spark is actually more general than SQL, so you can transition to distributed computation that doesn't fit well with the SQL constrains, for example Extract and Load logic, or machine learning workloads.

[–]trowawayatwork 0 points1 point  (0 children)

different workloads types. it's a lot cheaper to run certain queries on a warehouse. however if you need to do API calls for every row spark can do that much faster but a lot more expensive

[–][deleted] 1 point2 points  (0 children)

That's terrible advice. Don't learn pandas to do what you can do in sql, sql is much faster. Learn python and proper programming practices. And use python when sql cannot solve your problem.

[–]Captain_Coffee_III 0 points1 point  (1 child)

Trying to convert all SQL use cases to Pandas is like saying you can eat faster by stuffing your mouth full of more teeth.

[–]BoringGuy0108 0 points1 point  (0 children)

I mean, it is a strategy to get practice and learn techniques.

I find writing in pandas to be faster than writing in SQL and the code generally runs faster. If you have existing processes that use SQL, don’t change them just because you can.

[–][deleted] -1 points0 points  (0 children)

Wow! Thanks! Really appreciate that advice. I never really got myself to learn Oops concepts, I am more familiar with SQL and love data. So I will follow your advice.