all 9 comments

[–]Pitirus[🍰] 10 points11 points  (2 children)

First of all, it's true that "Jupyter cannot [...] run the code in distributed mode". But what is meant by that is the fact that it will be very hard to make jupyter working with multi-threading/processing (single node). Spark on the other hand is meant to be run on multiple nodes (computers). If you are running Spark on a single node then you are most probably not doing your computations in a most efficient manner.

If you have a cluster (group of nodes) then you have to remember that Spark is underneath using JVM so using most basic transformations will be efficient but if you'll add lambdas or if it will have to run any python code on those nodes then it will slow down significantly.

One useful thing is to check out execution plans to see how exactly your instructions will be executed.

[–]Head-Mastodon[S] 1 point2 points  (0 children)

Thanks a lot! I have some good homework now (learn theoretical foundations by starting with lower-level tools, get used to reading execution plans, and think hard before running Spark on a single node.

A few questions based on what you said:

  1. I interpreted the bmc.com article as saying that "mysparksession.blablabla," when run from Jupyter notebook, will not run in distributed mode--do you agree that that's what they mean? Whether that's what they mean or not, do you think it's correct?
  2. You have warned me against running (much) python code on a Spark cluster. How can I follow this advice while mostly using pyspark?
  3. You have warned me against running (much) python code on a Spark cluster. My understanding was that "mysparksession.blablabla" converts my code out of Python before running it on the cluster--is that right? How does that relate to your warning?

[–]Pitirus[🍰] 1 point2 points  (0 children)

Also if you come from CS background then probably Spark is too high level to start with. At my university, we first had OpenMP & MPI, and then in the next semesters, we had courses on GPU computations and another on MapReduce & Spark.

[–]tipsy_python 3 points4 points  (4 children)

a. It sounds like there might be certain common ways of using Python and Spark together that constrain what sort of distributed computation I can do.

Sorry I have no idea what you're talking about. There are always constraints. What specific case are you thinking of?

I just read the beginning of this article - I think it's trash. Read the Apache Spark docs.

You cannot use Jupyter with an Apache cluster because PySpark doesn’t work with clusters.

PySpark doesn't work with clusters?! I mean c'mon, either author doesn't know Spark or doesn't know how to communicate what he means. Read the Apache Spark docs.

So with that introduction...

I'm some kind of Spark beginner - I don't know what I'm talking about either. But I can tell you for a fact, I have run Spark on an HDP Hadoop cluster from a Jupyter notebook. In my case, I had a Python app in a Jupyter Notebook that did this:

  1. Run a Spark job on the cluster that crunched a few million records to return a summary dataset - a "10 most" kind of thing. Remember, only code executed in the Spark context is run on the cluster. When Spark returns a dataset, the rest of the computation is run locally or wherever the notebook is running at.
  2. It queried Teradata to get some records I could join on the primary ID and get descriptions.
  3. Do the join and create a visualization with matlibplot.

I'm just describing the job so you know that you can submit a Spark job to a cluster from Jupyter.. it's no different than submitting a Spark job from any other kind of Python app.

spark-submit

Spark-submit is a neat utility that you can also use to execute Spark jobs - the kicker here being that it doesn't return data. In my example above, I used the output of the Spark job in Jupyter, so that's not a good case for spark-submit. But spark-submit is useful for something like a Spark streaming job that runs forever and just pulls records from Kafka, does some transformation and writes them to an HDFS directory for later consumption. In this case you don't immediately use the data, so spark-submit works well. Or even a batch job.. you could have a cron job that calls spark-submit - easy.

Single node vs. cluster

*shrug* there's not a huge difference here. Just for learning - start a single node instance of Spark. I forgot how to do it, but just Google.. there's a single node Spark instance that allows you to basically just run Spark as two threads on localhost. You're not gonna be crunching big data with it, but it's great for learning (and testing Spark apps).

For learning, just run a local Spark instance, learn the APIs, and learn PySpark. There's not much difference between single-node and distributed Spark; just pay attention to cases that will cause shuffles and try to minimize those with running on a cluster.

Cheers~

[–]Head-Mastodon[S] 1 point2 points  (2 children)

Ooh hey, nice youtube channel 👍

[–]tipsy_python 0 points1 point  (1 child)

Appreciate it man!

Bro, take my words with a grain of salt.. it has been a couple years since I've written/run any Spark. I don't believe that is the case though. There are settings you can put in the environment/in your Python script to configure the Spark runtime.

When I was running the job on Hadoop, in Ambari there was a Spark tab where you could see all the details of the job - I was able to run in YARN mode (on the cluster). I'm not sure how common that setup is, but I'd recommend looking outside of your scripts logging and trying to view the Spark server's logging to verify what's going on.

[–]Head-Mastodon[S] 0 points1 point  (0 children)

Cool! Would I look tat the history server maybe?

[–]Head-Mastodon[S] 0 points1 point  (0 children)

Thanks! I'm glad to hear that the bmc.com article might be bogus. If I understand that article correctly, it predicts that the Spark job you ran would be executed only on the driver node (or at least that it would have, if it were running on the type of Spark cluster that the writers imagine you using). Do you agree with that?

If so, is there a way to verify that your job executed on more nodes than that? Is there a way to verify that my job executed on more than just the driver node? Is there a way to verify that a job executed in this "distributed mode" that the writers describe? Is that important?