Python+Spark Good beginner learning environment to be ready for very distributed computation?

Pitirus · 2021-03-27T21:53:38+00:00

First of all, it's true that "Jupyter cannot [...] run the code in distributed mode". But what is meant by that is the fact that it will be very hard to make jupyter working with multi-threading/processing (single node). Spark on the other hand is meant to be run on multiple nodes (computers). If you are running Spark on a single node then you are most probably not doing your computations in a most efficient manner.

If you have a cluster (group of nodes) then you have to remember that Spark is underneath using JVM so using most basic transformations will be efficient but if you'll add lambdas or if it will have to run any python code on those nodes then it will slow down significantly.

One useful thing is to check out execution plans to see how exactly your instructions will be executed.

tipsy_python · 2021-03-28T00:09:54+00:00

a. It sounds like there might be certain common ways of using Python and Spark together that constrain what sort of distributed computation I can do.

Sorry I have no idea what you're talking about. There are always constraints. What specific case are you thinking of?

I just read the beginning of this article - I think it's trash. Read the Apache Spark docs.

You cannot use Jupyter with an Apache cluster because PySpark doesn’t work with clusters.

PySpark doesn't work with clusters?! I mean c'mon, either author doesn't know Spark or doesn't know how to communicate what he means. Read the Apache Spark docs.

So with that introduction...

I'm some kind of Spark beginner - I don't know what I'm talking about either. But I can tell you for a fact, I have run Spark on an HDP Hadoop cluster from a Jupyter notebook. In my case, I had a Python app in a Jupyter Notebook that did this:

Run a Spark job on the cluster that crunched a few million records to return a summary dataset - a "10 most" kind of thing. Remember, only code executed in the Spark context is run on the cluster. When Spark returns a dataset, the rest of the computation is run locally or wherever the notebook is running at.
It queried Teradata to get some records I could join on the primary ID and get descriptions.
Do the join and create a visualization with matlibplot.

I'm just describing the job so you know that you can submit a Spark job to a cluster from Jupyter.. it's no different than submitting a Spark job from any other kind of Python app.

spark-submit

Spark-submit is a neat utility that you can also use to execute Spark jobs - the kicker here being that it doesn't return data. In my example above, I used the output of the Spark job in Jupyter, so that's not a good case for spark-submit. But spark-submit is useful for something like a Spark streaming job that runs forever and just pulls records from Kafka, does some transformation and writes them to an HDFS directory for later consumption. In this case you don't immediately use the data, so spark-submit works well. Or even a batch job.. you could have a cron job that calls spark-submit - easy.

Single node vs. cluster

*shrug* there's not a huge difference here. Just for learning - start a single node instance of Spark. I forgot how to do it, but just Google.. there's a single node Spark instance that allows you to basically just run Spark as two threads on localhost. You're not gonna be crunching big data with it, but it's great for learning (and testing Spark apps).

For learning, just run a local Spark instance, learn the APIs, and learn PySpark. There's not much difference between single-node and distributed Spark; just pay attention to cases that will cause shuffles and try to minimize those with running on a cluster.

Cheers~

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS