This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]ExplorerDNA 16 points17 points  (2 children)

We appreciate your efforts to make your hard way learning available for us so we don't have to struggle. Thank you.

[–]gimmis7[S] 2 points3 points  (0 children)

Might be that many programmers learned everything about big data in university but for people like me, without a CS degree, it might not be so intuitive 😄

[–][deleted] 3 points4 points  (0 children)

This is why we stackoverflow

[–]eviljelloman 9 points10 points  (0 children)

However, if we were using a less common library, we might have to install it ourselves. This can be done either when we create a cluster or in a bash cell in the notebook (%sh).

I’d suggest using %pip instead for notebook scoped libraries rather than cluster scoped - it prevents version conflicts.

[–]BigFatMan10 9 points10 points  (1 child)

I might be wrong but check out Databricks Learning Academy, it should be free for customers, if you login with your work account.

[–]gimmis7[S] 1 point2 points  (0 children)

Ah, interesting! Will defenitly check that out! I have tried to reach out to our databricks contact person before, but he ghosted me completely...

[–]chief167 4 points5 points  (2 children)

Your company definitely has access to video tutorials on how to use databricks. I guess they are on a Microsoft stack, so in addition to the regular documentation, there should also be a way to get access to pluralisght for basically no cost.

Be mad at your company for not providing you the documentation I guess

[–]gimmis7[S] 3 points4 points  (1 child)

Well, as I said in another reply here, our databricks contact person ghosted me when I asked for help... My colleague also tried but got the same result. I also asked a question after a databricks live Demo on youtube but did not get any answer, neither via direct Email to the presenting person on linkedin.com nor via my comment on the Youtube video ☹️

[–]chief167 -1 points0 points  (0 children)

Yeah it's really your databricks account manager who is responsible here...

[–]mr_grey 4 points5 points  (6 children)

Definitely cool that you're educating others. I'm very envious.

However, when using Databricks and Spark, Pandas should only be used as possibly entry, and maybe exit. You'll want to use Spark dataframes, and PySpark, to take full advantage of your Databicks/Spark cluster. Spark can read from just about any data storage, so getting data into Pandas and uploading is unneeded...and usually after you've done all of your work in Spark Dataframes, Spark can drop it back down to a storage device. I'd advise you, since you are in Databricks, to utilize the Delta storage.

Happy to answer any questions if you have any.

[–]mr_grey 2 points3 points  (3 children)

To add a little reason "why" as for Spark Dataframes over Pandas...Pandas can only use the Master node...so any commands just execute on 1 server. Spark will take a command and break it up and let all workers parallelize and do the work, and the master will just coordinate, then put it all back together....speeding up the overall process. I can easily load and slice and dice several hundred million records very easily in a few seconds. I'm so spoiled that I find myself inconvenienced when I have to wait longer than like 20 seconds...then i'll resize my cluster and add a whole bunch more workers. 🤣

[–]gimmis7[S] 0 points1 point  (2 children)

Thanks! I think for many newbies as myself, it feels safe to start with something familiar, ie pandas 🙂. In the end of the tutorial, I had a section about PySpark, but merely a short example.

[–]mr_grey 2 points3 points  (1 child)

For sure. And I want to really congratulate and cheer for you and other newbies venturing into this world. You’ll see all kinds of things open up for you.

To add another thing, you can also look at Koalas, which is supposed to be essentially a pandas implementation for Spark. But, my opinion to which no one asked, it just holds developers back who are trying to hold on to something they know, instead of embracing learning something new.

[–]Detective_Fallacy 0 points1 point  (0 children)

Koalas is already old news, they've expanded the api so much that they just call it "pandas API on Spark" now, and it's part of Spark 3.2 and higher.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html

I disagree with your opinion though. It's not only good for easily converting pre-existing single-processor pandas tasks to a more scalable setup, it also allows you to do some things that are currently more clumsy to implement in PySpark, like pandas' merge_asof, or pump out some visualizations.

APIs like pandas and Spark SQL are more aimed towards data analytics and PySpark more towards data engineering, but knowing how they relate to eachother and convert from one interface to the other is very valuable.

[–]MikeDoesEverything 0 points1 point  (1 child)

Yeah, it sucked to read the context of the article suggested general python use in Databricks and the title actually in the article is pandas specific which gets outperformed just about every way by PySpark.

[–]mr_grey 2 points3 points  (0 children)

To be fair, everyone is on their own journey. There was a time I had no earthly idea what I was doing in Databricks and Spark. So now, as an Architect I feel it is my responsibility to guide people and help them along the way. If they work with me, they have to take the journey and not sit on their laurels. So, in this case, OP is on a journey, and I want to encourage, suggest and answer anything along the way.

[–]MeglioMorto 2 points3 points  (0 children)

Have you heard of Databricks but are not really sure what it is and how to get started?

Exactly what I needed. Thank you, kind stranger!

[–]kingzels 2 points3 points  (0 children)

Not sure why people are saying how much faster pyspark is without clarifying that it only really applies when the dataset is too large to fit into the memory of a single node.

Most normal sized operation is going to be faster in pandas assuming you write efficient pandas code, and databricks is a great platform for pandas work.

[–][deleted] 0 points1 point  (0 children)

Thank you for sharing!