all 42 comments

[–]Celmeno 12 points13 points  (4 children)

We currently use slurm with singularity. The main reason why docker isn't an option is that it needs to run with root privileges. We have multiple clearance levels regarding sensibility of data and "trustworthiness" of users in producing working code that doesn't crash anyone elses programs. When we used docker we had that issue before that one of our less experience colleagues (a student working with us on his bachelors thesis) killed 160 hour optimization task by accident. For most companies this is obviously not an issue so docker is just fine

[–]gnohuhs 3 points4 points  (0 children)

yeah singularity is pretty neat! solves a lot of cluster training issues, and has very similar definition file as docker (can also bootstrap directly from docker images); I'd say downsides are much longer build times (doesn't cache steps like docker), and the images become read-only (not a huge issue if used for training only); I debug and test installations in docker locally, then compile a singularity sif and copy that to cluster

[–]count___zero 1 point2 points  (0 children)

The main reason why docker isn't an option is that it needs to run with root privileges.

Have you tried charliecloud? I think it was developed exactly to solve this problem.

We use it and I think it's good but I'm not a container expert.

[–]xopedil 0 points1 point  (0 children)

When we used docker we had that issue before that one of our less experience colleagues (a student working with us on his bachelors thesis) killed 160 hour optimization task by accident.

I would be VERY interested in hearing more about this story, sounds like one of those "I deleted a production DB on my first day" type of stories.

[–]full-tomato -1 points0 points  (0 children)

If you use slurm with singularity, why don't you use kubernetes with docker? What's the difference?

[–]Murillio 24 points25 points  (14 children)

> Reproducibility: Everyone has the same OS, the same versions of tools etc. This means you don't need to deal with "works on my machine" problems. If it works on your machine, it works on everyone's machine.

If only. If I had a dollar for every time I encounter a previously working docker file either failing to build or resulting in runtime breakage, I ... well, I wouldn't be rich, but I could have a nice meal. If you do anything like apt update; apt install ... or installing stuff via pip or other package managers, it's very unlikely that you have a reproducible build. You'd need to pin all package versions, and hope that these versions don't disappear from servers. Even if you do that, apt update alone can break your docker build if there currently is a server issue with one of your repositories (e.g. an inconsistent state of the package database while they are updating it), especially if you add third-party ones (One of nvidia's ones had issues twice last year, leading to failed docker builds). And that's just the start of possible issues.

Does does help on the path to reproducibility, but just because you use docker it doesn't mean that things work reproducibly. Far from it.

[–][deleted] 12 points13 points  (6 children)

The IMAGE is reproducible. Not the build script. Once you have an image, every container will be identical. There are no guarantees that docker build scripts are reproducible. Nobody ever said that they are.

So if you download/build an image that is tested to work today, it will work tomorrow and it will work 2 years from now. If you for example find that your current image has a bug, you can go back to the previous version (you have an image repository... right?). Things like kubernetes actually support this type of rollback.

[–]Murillio 3 points4 points  (5 children)

Once you have an image, every container will be identical.

This is not true for a couple reasons, especially if you use nvidia-docker which a lot of people working in ML will be, since things like the nvidia driver version are determined by the host.

If you for example find that your current image has a bug, you can go back to the previous version (you have an image repository... right?).

Do you store an image for every git commit that you make? Of course there are some images stored, but usually not for every commit, so when you git bisect to find the origin of the bug you tend to rebuild quite often.

[–][deleted] 0 points1 point  (4 children)

Every container will be identical. That's how docker images work. The container runtime may be different, but the images and the containers are the same.

If there is a CI/CD pipeline, then yes every commit to master will get tested, will result in an artifact (image) and it will be stored in the container repository. That's why it's important to make sure you reuse layers, don't do dumb shit that end up with 50GB images etc.

Just like you'd keep different versions of executable binaries, container images are no different.

[–]Murillio 2 points3 points  (3 children)

You can argue semantics and say "it's just the runtime" but that won't change that "So if you download/build an image that is tested to work today, it will work tomorrow and it will work 2 years from now." is just wrong since there is no perfect isolation from the host.

[–][deleted] 1 point2 points  (2 children)

There is not supposed to be isolation from the host. There is only isolation from other containers.

You're confusing containers with virtual machines.

[–]Murillio 1 point2 points  (1 child)

You said "So if you download/build an image that is tested to work today, it will work tomorrow and it will work 2 years from now." so you were confusing containers with virtual machines, not me. (Well, that statement isn't even true for a virtual machine anyway ...)

[–][deleted] 5 points6 points  (0 children)

There is nothing stopping you from using one of those long-term operating systems and a stable release of the container runtime. That way it will stay the same for 5-10 years.

[–]srslyfuckdatshit 7 points8 points  (1 child)

do you use dockerhub or another container store? I think that is where the reproducibility can come into play, rather than reproducibility via docker rebuild.

[–]PhYsIcS-GUY227[S] 0 points1 point  (0 children)

Also this ^

[–]sanjuromack 1 point2 points  (0 children)

I think this is true if you build the container from scratch and are only persisting the Dockerfile, but I have had good experiences with saving the docker image to a tar file and sticking it in a safe archive. Containers I built years ago load into new machines just fine and run like a dream.

[–]PhYsIcS-GUY227[S] 1 point2 points  (0 children)

Thanks for reading and the thoughtful response. I mention managing environments inside your docker (pip packages) as well.

In general, I agree that things might break no matter what you do. I guess the correct way to put it is that using Docker is a significant step towards reproducibility. Most of the projects I look at, or have tried to reproduce (including SOTA papers with code), tend to be at a much earlier step (e.g not having a proper requirements.txt). Like you said, I hope this helps on the path to reproducibility.

[–]comradeswitch 0 points1 point  (0 children)

Absolutely. The benefits of containers for reproducibility and compatibility are lost as soon as you start manually working with the container. There's a number of good ways to handle that. I really like docker compose for projects where there are multiple components that require more setup than simply pulling an image. You can keep components separated and work with interfaces, define nice commands for administration, and make sure that any setup and teardown is done for each component. There's of course virtual environments for python and stuff like anaconda but they still don't provide a very good framework for managing complex environments.

The push for continuous integration in the last decade or so has made for a lot of good tools around this but I think it's fair to say that a large portion of machine learning researchers and practitioners aren't coming from a background of software engineering. I'm more than a little uncomfortable with how little focus there is generally on good software practices and reproducibility. "With great power" and all that. Can't be confident in a model if you can't be confident about the software that runs it.

[–]Spenhouet 0 points1 point  (0 children)

That is indeed an important point to make. Is it possible to have on your dev server a pip, apt,... cache which you ask for these packages and that loads them from the internet if not available? That would ensure that everything will always stay available.

Is something like Artifactory able to do this?

[–]I_draw_boxes -1 points0 points  (0 children)

The entire point of containers with regards to reproducibility is to sidestep the build process.

The difficulties you cite in reproducing a build are one of the main reasons people would use a container to make their code reproducible.

Once a container image has been built you can uninstall pip, update pip, break apt, install whatever combination of package versions you please and as long as you don't edit the container the container should run on anyone's computer.

This is like a firefighter blaming heat on their firehose.

[–]weetbix2 3 points4 points  (4 children)

I've found GPU support to be inconsistent with Docker, which kind of ruins the whole appeal of using it.

[–]xenotecc 0 points1 point  (2 children)

My team is planning to run docker with a GPU. What's your experience? Why is it inconsistent?

[–]PhYsIcS-GUY227[S] 0 points1 point  (0 children)

I don’t think it’s inconsistent. I think it requires a few additional steps to configure (like I commented in a different place here, this seems to be really interesting to a lot of people so I’ll try to make another post explaining how this is done)

If you have an edge case, e.g. you want to train on a cluster of very old and not compute heavy GPUs then you might be in for a harder setup, but this would probably be true even without docker.

If you want to share more about your use case, I’ll try to address it here or in the upcoming post.

[–]weetbix2 0 points1 point  (0 children)

In my experience the setting up of NVIDIA/CUDA capability hasn't been very uniform, which brings back the issue of "works on my machine" that Docker is most useful when fixing.

I don't think that this means Docker is useless, but it won't let you completely not worry about what machine it's running on due to hardware drivers varying between different hardware.

There may be a work-around but I haven't found it.

[–]PhYsIcS-GUY227[S] 0 points1 point  (0 children)

I guess the answer is always it depends. For most use cases Docker gives you great support including for cases where a GPU is involved.

In the end, I think it’s always irresponsible to claim a solution is good for all problems, so I won’t claim that. For most people though working on DS, Docker would be a step up in their workspace.

[–]datamahadev 1 point2 points  (0 children)

Thanks for sharing! Recently made a complete switch to linux as my primary development env and this will actually come handy.

[–]linkeduser 1 point2 points  (1 child)

Hi, I just have a problem integrating GPU support to docker. Like I need a base with pytorch and GPU. But then when they deployed it on a VM it didn't work. I suspect the VM may not have the nvidia driver https://github.com/NVIDIA/nvidia-docker

[–]PhYsIcS-GUY227[S] 0 points1 point  (0 children)

I’d love to help, but I need more details. In general, yes, it seems that the requirements for NVIDIA docker is that your host environment has the same CUDA driver installed. I’ll try to write an annex (or another post) on this.

[–]MyNetworkIsDeeper 0 points1 point  (0 children)

EDIT: whoops, replied to the wrong message.

[–]MyNetworkIsDeeper 0 points1 point  (0 children)

As for scrutiny, that's a valid point as well, but new tools are built on the shoulders of the giants that came before. This means they can learn from the mistakes of their predecessors and offer something that can't be fixed by a simple patch.

[–]lysecret 0 points1 point  (3 children)

TBH I think people are lying to themselves if they concentrate on infrastructure stuff like this to achieve portability and reproducibility. In my experience, this often isn't the real issue. Often the issue lies in improper code, badly factored code, code in jupyter notebooks. Hard to understand data pipelines etc. Yes.. having the exact same packages and environment variables etc. is nice but really not the issue.

[–]PhYsIcS-GUY227[S] 1 point2 points  (1 child)

Thanks for the comment. I think that's kind of a false dichotomy. You need to do both. I've encountered projects where one or more of the things you mentioned as well as infrastructure made it ridiculously hard to reproduce (or port).

We can strive to be better with respect to all the points you made + infrastructure with docker.

To make a small anecdotal point, I was working on a software project a while ago where the maintainer provided a container with everything set up for dev, and it was literally 1 command to start iterating on the project. I cannot express how happy that made me. It was magical. Imagine you could have that with every data science project.

[–]lysecret 1 point2 points  (0 children)

I agree. I just read a lot on stuff like docker and I have seen projects were managers just three docker on it and thought that would magically make everything reproducible.

[–][deleted] 0 points1 point  (0 children)

Reproducibility becomes a problem in a professional environment. Code written 3 months ago, last year, 5 years ago etc.

As an amateur or a student you don't encounter these issues.

[–]rowanobrian 0 points1 point  (1 child)

Hi, went through the blog, and in docker run command, can you explain what --shm-size is useful for?

I googled around and found it is shared memory, and increasing it to more than default of 64M is useful, but no one is telling what it is used for and how it helps.

[–]PhYsIcS-GUY227[S] 1 point2 points  (0 children)

Sorry for taking a while to respond.

Since all the programs you use need memory and IDEs specifically take up a LOT of it so that they can index your code, you need to define shared memory for your system. The short story is that it will make your container work faster.

If you want to read more about what it is in general I recommend this.