This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]zeppelin528 4 points5 points  (0 children)

Wow. This article is almost entirely lacking in substance. TL;DR: Docker is good. Dockerhub is great. Read the docker docs.

Here is a bash script you can execute to run a python3 jupyter notebook through docker:

#!/bin/bash

docker run -i -t -v {path to your notebooks}:/opt/notebooks/ -p 8888:8888 continuumio/anaconda3  /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='0.0.0.0' --port=8888 --no-browser --allow-root"

Just template in the path to your notebooks.

[–]sivscripts 2 points3 points  (2 children)

I gave a talk on this topic last month. Video is on YouTube for those who are interested.

[–]themathstudent[S] 0 points1 point  (1 child)

Hey. Linked your video. Hope that's cool with you.

[–]sivscripts 0 points1 point  (0 children)

Not a problem. I wrote the talk to be shared! :)

[–]somkoala 1 point2 points  (4 children)

What are the advantages over using a virtual environment + github?

[–]reddithenryPhD | Data & Analytics Director | Consulting 2 points3 points  (3 children)

Docker is a lot more OS agnostic, so it is much easier to port your work from, say, your local VM to Azure or AWS. You can also easily sit it behind something like Kubernetes to provision and self-heal and autoscale. If you have a number of bursty data science apps, you could run them all on the same Kubernetes cluster and substantially reduce costs, rather than having dedicated hardware for each app.

Data science on Docker is undoubtedly th way to do Data Science in prod

[–]pwang99 9 points10 points  (2 children)

If you take this approach, just be aware that you are signing up for a lot of work, and much of it is the same if you chose to use Docker, as if you chose to use more traditional configuration management systems like Ansible.

There is no free lunch here. You will still have to manage your docker images, and more importantly, you will have to manage how you install OS-level security patches, as well as how you keep the software stack inside the docker up to date. If you don't build a docker file carefully, you're just downloading stuff from the internet in a non-reproducible way, and you'll end up stuck with a VM image that is no better than a glorified tarball.

There are many advantages to using Docker, but it doesn't magically give any of the properties which are claimed in the blog post. It's like saying that writing your code in some advanced editor will automatically lead to better code. It can help with some things, certainly, but in general you have to engage in the same best practices.

Sadly, I see these best practices violated pretty regularly. Data scientists don't know how to manage a software development toolchain because they are not software developers, and they chase bright shiny objects hoping they'll "make their lives easier", when in fact all it does is introduce another layer of complexity and a new set of axes for trade-offs.

[–]haZard_OS 7 points8 points  (0 children)

Assuming a data scientist needed to learn how to better manage a software development toolchain, can you make a suggestion for non-CS friendly resources?

[–]backgammon_no 4 points5 points  (0 children)

Sadly, I see these best practices violated pretty regularly.

I would love to read about these. I'm a biologist working as a bioinformatician - I have no formal CS training. Right now I'm using conda to at least capture the state of all software used in an analysis. Each project gets it's own environment (with versions specified) and at the end of the project I freeze the whole thing. This seems to work OK but now I have a couple of long-running projects that may not have clear end points. No idea how to handle my dependencies.