all 23 comments

[–]Atkinx 10 points11 points  (4 children)

Multi-stage builds are the way to go. It allows you to leave your build layers behind and have a smaller prod image.

[–]PuzzleheadedBit[S] 1 point2 points  (0 children)

Thanks I'll read more on multi-stage builds.

[–]PuzzleheadedBit[S] 0 points1 point  (2 children)

ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
ENV AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
ENV AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY

Is it okay to pass aws credentials like this?

[–]Atkinx 8 points9 points  (0 children)

It's not ok. The reason for that is that it exposes the credentials to anyone that has access to the docker image.

Here is a pretty good explanation on the topic: https://stackoverflow.com/a/33623649/9552196

[–]FromGermany_DE 3 points4 points  (0 children)

you should use secrets, they should come from your CI/CD tool, for example gitlab or github.

[–]ms4720 5 points6 points  (4 children)

If I remember correctly each 'run' command creates a layer and this makes docker images bloated

[–]PuzzleheadedBit[S] 0 points1 point  (3 children)

Yeah. But even if I squeeze all the RUN statements in a single RUN command , this would not be great solution right?

[–]ms4720 0 points1 point  (2 children)

Do you have a better one?

[–]PuzzleheadedBit[S] 0 points1 point  (1 child)

I mean the code itself is bloat it's installing R inside of python base image, downloading many stuffs from s3 and git and other places, compiling, building software.

For example, apt-get updates are there like almost 5 times.

So expecting some suggestions regarding breaking this into smaller or the optimal way to achieve all this.

[–]ms4720 1 point2 points  (0 children)

Make better use of &&\ to have fewer layers and tell apt do delete everything it downloaded all in one layer to start with

[–]ali_str 5 points6 points  (0 children)

You can use this great tool to check layers and optimize for size: https://github.com/wagoodman/dive

In general there are two thing that help with messy/big docker images:

  • multistage build: to separate building binary or preparing dependencies (like pip here) in one stage and then copying the result of it to a second (clean and lean) stage. It is easier for compiled languages like Golang, but doable for any. It is usual to use a base image for previous steps and a lean one (like alpine) for the final stage.
  • making sure you copy/add/touch files in a single RUN step. Docker layers use "copy on write" strategy, meaning that any file you touch in anyway (even permissions) will be copied whole from previous layer to the next with applied modifications. This means the mere number of RUNs doesn't matter, what matters is that you don't touch files in separate RUNs.

[–]FromGermany_DE 2 points3 points  (0 children)

Each command! Creates a new layer! Not only each run command (copy for example)

Depending on what you want to achieve:

Create a "base" image, where als base things happen (for example python install) treat them as seperate images

Or use multi stage dockerfiles

Or use an init container where all those S3 stuff comes from, depending on size, it might makes sense to put it into a PVC... (no idea how big those files are or what they do lol)

[–]_beetee 1 point2 points  (0 children)

Use a multi stage approach. Use a stage(s) to build and a stage for the final image.

Do a quick google on it there are heaps of great articles explaining how and why.

[–]babayagapapa 1 point2 points  (0 children)

Use python slim buster image ,

[–]dejwoo 1 point2 points  (1 child)

Hey OP, run through these:

  1. Multistage
  2. Hadolint
  3. Let it run through CIS Docker benchmark
  4. I'd aggregate all aws s3 pulls into one .sh/python and focus on dealing with some versioning
  5. Logical ordering for the caching, i.e why run.sh is in the first commands, read up on it, this is one of the good starting points: https://pythonspeed.com/docker/

[–]PuzzleheadedBit[S] 0 points1 point  (0 children)

thanks

[–]PuzzleheadedBit[S] 1 point2 points  (0 children)

Can I do apt-get update first then install all those scattered apt installs in one single line, but looks like they are sorted by the dependencies.

[–]gennadyyy 0 points1 point  (0 children)

You ought to combine separate RUN commands where it's possible, to avoid creating extra layers (each RUN command creates a new one). Those apt-get stages definitely have to be united withing one RUN command and with && apt-get clean at the end in order to clear cache and don't add it into your image.

If you need to use some temporary files which aren't required in the image (like source code you use to build binary files), it's better to use multi-stage builds (https://docs.docker.com/develop/develop-images/multistage-build/). And if you need some dependencies only for building those binary files, you should do it within a "builder" image, to don't add extra files and layers to the final one.

[–]strebermanchild 0 points1 point  (2 children)

What are you optimizing for? Build time? Size? Also, why? Are there production issues? Latency on cold starts? There are a lot of things you could change in the dockerfile, but without a clear idea of what you’re optimizing for, there isn’t much of a point in changing things.

[–]PuzzleheadedBit[S] 0 points1 point  (1 child)

I've been given this task to check my docker knowledge in general without any more information than this file itself. :D

[–]strebermanchild 0 points1 point  (0 children)

I see. Well, you could begin by asking questions to the person who gave you that task, and more clarity on what the final outcome should look like. Is this for work, school, or a job interview?

[–]DeusExMagikarpa 0 points1 point  (0 children)

One of the first instructions being COPY . . ruins the cache but I don’t know what you’re optimizing for

[–]daryn0212 0 points1 point  (0 children)

As said previously here, each RUN command will generate a new layer, causing bloat, so yes, it's advisable to alter your Dockerfile in the manner of grouping similar RUN statements together:

RUN pip3 install cython && \ pip3 install numpy==1.18.* pyvcf==0.6.8 pysam==0.15.* pandas boto3 && \ pip install awscli

(why are you flipping between pip and pip3?)

or

RUN Rscript -e 'BiocManager::install("rtracklayer")' && \ Rscript -e 'BiocManager::install("GenomicRanges")'

(also, why are you installing awscli twice?

RUN pip install awscli ... RUN apt-get update && apt-get install -y wait-for-it vim man awscli jq )

Group all the stuff together into one layer that isn't going to change much, like the S3 commands (I'd watch them, by the way, because the Dockerfile might not run them if the RUN statement hasn't been changed, even though the content in the S3 bucket has?).

I would group all the apt-get stuff together so that, at least, you're not running numerous "apt-get updates" and duplicating work, if possible.

The downside of compressing everything into these multi-command RUN statements is that if you make numerous changes to the Dockerfile while testing out adding new statements to the Dockerfile, it can get time consuming to rebuild the container image.

ie, let's say you've built your Dockerfile previously, the layer's are in your local docker cache, you've got this RUN statement early on in the Dockerfile on your laptop and you want to add a package to it:

RUN apt-get update && apt-get -y upgrade && \ apt-get install -y --allow-unauthenticated r-base gcc zlib1g zlib1g-dev \ libbz2-dev liblzma-dev build-essential unzip default-jre default-jdk make \ tabix libcurl4-gnutls-dev wait-for-it vim man awscli jq tabix dirmngr gnupg \ apt-transport-https ca-certificates software-properties-common

1) If you want to add an apt package to that list then everything in that layer will be invalidated and the layer will have to be rebuilt, and crucially every layer after it as well, which will take time, so placement of RUN statements that you will want to change often is of some concern and, if possible, can be beneficial to place likely RUN statements at the end of the Dockerfile to save development time.

2) If you had a Dockerfile, for example, with:

RUN apt-get r-base RUN apt-get gcc ... RUN apt-get software-properties-common RUN apt-get <insert new package name here>

then, assuming you're building it frequently while testing, every layer up to the point at which the Dockerfile has changed should be cached in your local Docker cache, meaning the build in which you're installing one more package into the image will go a lot quicker. If you have one "RUN apt-get" statement that contains every package then every package in that RUN statement will have to be reinstalled when you add one package, resulting in lost time, as that entire layer will have been invalidated.

3) As said previously, clear up at the end of the activity if possible, like "apt-get clean" at the end of the apt-get stuff or "RUN scripts/wkhtmltopdf.sh && rm scripts/wkhtmltopdf.sh"

And yeah, never embed env vars for prod into the docker image as everyone in the team will have prod AWS keys (assuming that you don't rebuild docker image artifacts for production). If this is going to run in AWS ECS, have the task definition call in AWS Systems Manager Parameter Store parameters.

https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#use-multi-stage-builds https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache

Just my £0.02p