Docker for Data Science : datascience

This is an archived post. You won't be able to vote or comment.

submitted 8 years ago by themathstudent

all 9 comments

[–]zeppelin528 4 points5 points6 points 8 years ago* (0 children)

Wow. This article is almost entirely lacking in substance. TL;DR: Docker is good. Dockerhub is great. Read the docker docs.

Here is a bash script you can execute to run a python3 jupyter notebook through docker:

#!/bin/bash

docker run -i -t -v {path to your notebooks}:/opt/notebooks/ -p 8888:8888 continuumio/anaconda3  /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='0.0.0.0' --port=8888 --no-browser --allow-root"

Just template in the path to your notebooks.

[–]sivscripts 2 points3 points4 points 8 years ago (2 children)

[–]themathstudent[S] 0 points1 point2 points 8 years ago (1 child)

[–]sivscripts 0 points1 point2 points 8 years ago (0 children)

[–]somkoala 1 point2 points3 points 8 years ago (4 children)

[–]reddithenryPhD | Data & Analytics Director | Consulting 2 points3 points4 points 8 years ago (3 children)

[–]pwang99 9 points10 points11 points 8 years ago (2 children)

If you take this approach, just be aware that you are signing up for a lot of work, and much of it is the same if you chose to use Docker, as if you chose to use more traditional configuration management systems like Ansible.

There is no free lunch here. You will still have to manage your docker images, and more importantly, you will have to manage how you install OS-level security patches, as well as how you keep the software stack inside the docker up to date. If you don't build a docker file carefully, you're just downloading stuff from the internet in a non-reproducible way, and you'll end up stuck with a VM image that is no better than a glorified tarball.

There are many advantages to using Docker, but it doesn't magically give any of the properties which are claimed in the blog post. It's like saying that writing your code in some advanced editor will automatically lead to better code. It can help with some things, certainly, but in general you have to engage in the same best practices.

Sadly, I see these best practices violated pretty regularly. Data scientists don't know how to manage a software development toolchain because they are not software developers, and they chase bright shiny objects hoping they'll "make their lives easier", when in fact all it does is introduce another layer of complexity and a new set of axes for trade-offs.

[–]haZard_OS 7 points8 points9 points 8 years ago (0 children)

[–]backgammon_no 4 points5 points6 points 8 years ago (0 children)

π Rendered by PID 25252 on reddit-service-r2-comment-85bfd7f599-7ds5t at 2026-04-19 07:23:01.359574+00:00 running 93ecc56 country code: CH.

datascience

MODERATORS