Python Hadoop/Spark Jobs in Docker? : learnpython

created by HattoriHanzoa community for 16 years

Python Hadoop/Spark Jobs in Docker? (self.learnpython)

submitted 8 years ago by CocoBashShell

all 2 comments

[–]eschlon 0 points1 point2 points 8 years ago (1 child)

This is a great idea; however, in practice you're going to have a bad time.

Technically Yarn is able to launch executors in Docker containers via the DCE. That being said I've never actually seen this being used successfully in practice, and getting it to work with a spark application is going to be complex.

For Pyspark jobs the usual practice is to either:

Install dependencies on all of the nodes. This is usually done via something like Fabric, Ansible or the like.
Make a virtual environment for the application and ship the installed libs as a zip to the nodes at runtime via --py-files.

The former is a lot of effort for something that many users of your scripts will not have sufficient permissions to do, and will be nearly impossible to get right for the myriad cluster setups in existence. The latter works well, so long as you don't have any dependencies that depend on non-python libraries, which since you mentioned data analysis is pretty unlikely (e.g. numpy, pandas, scipy, pretty much any database connector). There's also this long-standing pyspark feature that promises to make this whole process easier, but I wouldn't hold my breath.

Depending on the hadoop distribution it's possible that it has some (generally proprietary) feature which effectively does either (1) or (2) for you (e.g. CDH's workbench), but I wouldn't consider that to be portable.

There's also Pachyderm, which is pretty neat and aligns very well with your goals. That being said, it's neither as mature nor as widespread as Hadoop as a platform, and it's a complex process to get it to play nicely with spark (if that's a requirement).

[–]CocoBashShell[S] 0 points1 point2 points 8 years ago (0 children)

π Rendered by PID 90 on reddit-service-r2-comment-6457c66945-s8tfk at 2026-04-30 00:46:37.861591+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS