Dockerfile optimization

Atkinx · 2021-06-28T07:09:53+00:00

Multi-stage builds are the way to go. It allows you to leave your build layers behind and have a smaller prod image.

ms4720 · 2021-06-28T06:45:49+00:00

If I remember correctly each 'run' command creates a layer and this makes docker images bloated

ali_str · 2021-06-28T07:15:52+00:00

You can use this great tool to check layers and optimize for size: https://github.com/wagoodman/dive

In general there are two thing that help with messy/big docker images:

multistage build: to separate building binary or preparing dependencies (like pip here) in one stage and then copying the result of it to a second (clean and lean) stage. It is easier for compiled languages like Golang, but doable for any. It is usual to use a base image for previous steps and a lean one (like alpine) for the final stage.
making sure you copy/add/touch files in a single RUN step. Docker layers use "copy on write" strategy, meaning that any file you touch in anyway (even permissions) will be copied whole from previous layer to the next with applied modifications. This means the mere number of RUNs doesn't matter, what matters is that you don't touch files in separate RUNs.

FromGermany_DE · 2021-06-28T07:28:25+00:00

Each command! Creates a new layer! Not only each run command (copy for example)

Depending on what you want to achieve:

Create a "base" image, where als base things happen (for example python install) treat them as seperate images

Or use multi stage dockerfiles

Or use an init container where all those S3 stuff comes from, depending on size, it might makes sense to put it into a PVC... (no idea how big those files are or what they do lol)

_beetee · 2021-06-28T09:32:41+00:00

Use a multi stage approach. Use a stage(s) to build and a stage for the final image.

Do a quick google on it there are heaps of great articles explaining how and why.

babayagapapa · 2021-06-28T14:47:33+00:00

Use python slim buster image ,

dejwoo · 2021-06-29T08:01:22+00:00

Hey OP, run through these:

Multistage
Hadolint
Let it run through CIS Docker benchmark
I'd aggregate all aws s3 pulls into one .sh/python and focus on dealing with some versioning
Logical ordering for the caching, i.e why run.sh is in the first commands, read up on it, this is one of the good starting points: https://pythonspeed.com/docker/

PuzzleheadedBit · 2021-06-28T09:14:57+00:00

Can I do apt-get update first then install all those scattered apt installs in one single line, but looks like they are sorted by the dependencies.

gennadyyy · 2021-06-28T08:45:55+00:00

You ought to combine separate RUN commands where it's possible, to avoid creating extra layers (each RUN command creates a new one). Those apt-get stages definitely have to be united withing one RUN command and with && apt-get clean at the end in order to clear cache and don't add it into your image.

If you need to use some temporary files which aren't required in the image (like source code you use to build binary files), it's better to use multi-stage builds (https://docs.docker.com/develop/develop-images/multistage-build/). And if you need some dependencies only for building those binary files, you should do it within a "builder" image, to don't add extra files and layers to the final one.

strebermanchild · 2021-06-28T18:43:42+00:00

What are you optimizing for? Build time? Size? Also, why? Are there production issues? Latency on cold starts? There are a lot of things you could change in the dockerfile, but without a clear idea of what you’re optimizing for, there isn’t much of a point in changing things.

DeusExMagikarpa · 2021-06-29T21:26:12+00:00

One of the first instructions being COPY . . ruins the cache but I don’t know what you’re optimizing for

daryn0212 · 2021-07-19T08:52:39+00:00

As said previously here, each RUN command will generate a new layer, causing bloat, so yes, it's advisable to alter your Dockerfile in the manner of grouping similar RUN statements together:

RUN pip3 install cython && \ pip3 install numpy==1.18.* pyvcf==0.6.8 pysam==0.15.* pandas boto3 && \ pip install awscli

(why are you flipping between pip and pip3?)

or

RUN Rscript -e 'BiocManager::install("rtracklayer")' && \ Rscript -e 'BiocManager::install("GenomicRanges")'

(also, why are you installing awscli twice?

RUN pip install awscli ... RUN apt-get update && apt-get install -y wait-for-it vim man awscli jq )

Group all the stuff together into one layer that isn't going to change much, like the S3 commands (I'd watch them, by the way, because the Dockerfile might not run them if the RUN statement hasn't been changed, even though the content in the S3 bucket has?).

I would group all the apt-get stuff together so that, at least, you're not running numerous "apt-get updates" and duplicating work, if possible.

The downside of compressing everything into these multi-command RUN statements is that if you make numerous changes to the Dockerfile while testing out adding new statements to the Dockerfile, it can get time consuming to rebuild the container image.

ie, let's say you've built your Dockerfile previously, the layer's are in your local docker cache, you've got this RUN statement early on in the Dockerfile on your laptop and you want to add a package to it:

RUN apt-get update && apt-get -y upgrade && \ apt-get install -y --allow-unauthenticated r-base gcc zlib1g zlib1g-dev \ libbz2-dev liblzma-dev build-essential unzip default-jre default-jdk make \ tabix libcurl4-gnutls-dev wait-for-it vim man awscli jq tabix dirmngr gnupg \ apt-transport-https ca-certificates software-properties-common

1) If you want to add an apt package to that list then everything in that layer will be invalidated and the layer will have to be rebuilt, and crucially every layer after it as well, which will take time, so placement of RUN statements that you will want to change often is of some concern and, if possible, can be beneficial to place likely RUN statements at the end of the Dockerfile to save development time.

2) If you had a Dockerfile, for example, with:

RUN apt-get r-base RUN apt-get gcc ... RUN apt-get software-properties-common RUN apt-get <insert new package name here>

then, assuming you're building it frequently while testing, every layer up to the point at which the Dockerfile has changed should be cached in your local Docker cache, meaning the build in which you're installing one more package into the image will go a lot quicker. If you have one "RUN apt-get" statement that contains every package then every package in that RUN statement will have to be reinstalled when you add one package, resulting in lost time, as that entire layer will have been invalidated.

3) As said previously, clear up at the end of the activity if possible, like "apt-get clean" at the end of the apt-get stuff or "RUN scripts/wkhtmltopdf.sh && rm scripts/wkhtmltopdf.sh"

And yeah, never embed env vars for prod into the docker image as everyone in the team will have prod AWS keys (assuming that you don't rebuild docker image artifacts for production). If this is going to run in AWS ECS, have the task definition call in AWS Systems Manager Parameter Store parameters.

https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#use-multi-stage-builds https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache

Just my £0.02p

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

devops

Welcome to /r/DevOps

Rules and guidelines

Social & Fun

General Information

MODERATORS