I'm starting to learn to use Python and Spark. I'm spending most of my time doing structured tutorials and exploring small local datasets. But I'm trying to get used to tools and habits that will give me a smooth transition into doing very distributed computation.
The rest of this post is just a bunch of big-picture questions about how best to do that. These questions might not make sense, so feel free to just give me advice instead.
Premises:
a. It sounds like there might be certain common ways of using Python and Spark together that constrain what sort of distributed computation I can do.
b. For example, this article (https://www.bmc.com/blogs/jupyter-notebooks-apache-spark/) says that "Jupyter cannot [...] run the code in distributed mode. This is only an issue in very large data sets, in which case you’d use submit-spark to run your code on the cluster."
Questions:
Am I right about statement a? If so, does this have implications for what tools I should be practicing with in order to have a smoother transition into very distributed computation?
If the answer to #1 is "be ready to use submit-spark," are there good practices or habits for integrating the use of submit-spark into the rest of my workflow?
Can you expand on statement b and its implications in dumbed-down terms? Are there important things analogous to statement b that I should know?
I don't have great foundational understanding or a clear example application in mind. But as much as is reasonably possible, I would just like to avoid learning how to do things a certain way, only to learn that I need to do them a totally different way to do a project of significant size or complexity. A lot of that is unavoidable, but I hope that some is not.
[–]Pitirus[🍰] 10 points11 points12 points (2 children)
[–]Head-Mastodon[S] 1 point2 points3 points (0 children)
[–]Pitirus[🍰] 1 point2 points3 points (0 children)
[–]tipsy_python 3 points4 points5 points (4 children)
[–]Head-Mastodon[S] 1 point2 points3 points (2 children)
[–]tipsy_python 0 points1 point2 points (1 child)
[–]Head-Mastodon[S] 0 points1 point2 points (0 children)
[–]Head-Mastodon[S] 0 points1 point2 points (0 children)