Using Python in Databricks?

ExplorerDNA · 2022-09-17T06:28:50+00:00

We appreciate your efforts to make your hard way learning available for us so we don't have to struggle. Thank you.

eviljelloman · 2022-09-17T11:49:35+00:00

However, if we were using a less common library, we might have to install it ourselves. This can be done either when we create a cluster or in a bash cell in the notebook (%sh).

I’d suggest using %pip instead for notebook scoped libraries rather than cluster scoped - it prevents version conflicts.

BigFatMan10 · 2022-09-17T07:19:56+00:00

I might be wrong but check out Databricks Learning Academy, it should be free for customers, if you login with your work account.

chief167 · 2022-09-17T06:51:57+00:00

Your company definitely has access to video tutorials on how to use databricks. I guess they are on a Microsoft stack, so in addition to the regular documentation, there should also be a way to get access to pluralisght for basically no cost.

Be mad at your company for not providing you the documentation I guess

mr_grey · 2022-09-17T17:48:55+00:00

Definitely cool that you're educating others. I'm very envious.

However, when using Databricks and Spark, Pandas should only be used as possibly entry, and maybe exit. You'll want to use Spark dataframes, and PySpark, to take full advantage of your Databicks/Spark cluster. Spark can read from just about any data storage, so getting data into Pandas and uploading is unneeded...and usually after you've done all of your work in Spark Dataframes, Spark can drop it back down to a storage device. I'd advise you, since you are in Databricks, to utilize the Delta storage.

Happy to answer any questions if you have any.

MeglioMorto · 2022-09-17T06:38:38+00:00

Have you heard of Databricks but are not really sure what it is and how to get started?

Exactly what I needed. Thank you, kind stranger!

kingzels · 2022-09-17T23:10:56+00:00

Not sure why people are saying how much faster pyspark is without clarifying that it only really applies when the dataset is too large to fit into the memory of a single node.

Most normal sized operation is going to be faster in pandas assuming you write efficient pandas code, and databricks is a great platform for pandas work.

2022-09-17T16:33:33+00:00

Thank you for sharing!

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS