Multiple RUN vs. single chained RUN in Dockerfile, which is better?
Asked Answered
C

4

304

Dockerfile.1 executes multiple RUN:

FROM busybox
RUN echo This is the A > a
RUN echo This is the B > b
RUN echo This is the C > c

Dockerfile.2 joins them:

FROM busybox
RUN echo This is the A > a &&\
    echo This is the B > b &&\
    echo This is the C > c

Each RUN creates a layer, so I always assumed that fewer layers is better and thus Dockerfile.2 is better.

This is obviously true when a RUN removes something added by a previous RUN (i.e. yum install nano && yum clean all), but in cases where every RUN adds something, there are a few points we need to consider:

  1. Layers are supposed to just add a diff above the previous one, so if the later layer does not remove something added in a previous one, there should not be much disk space saving advantage between both methods.

  2. Layers are pulled in parallel from Docker Hub, so Dockerfile.1, although probably slightly bigger, would theoretically get downloaded faster.

  3. If adding a 4th sentence (i.e. echo This is the D > d) and locally rebuilding, Dockerfile.1 would build faster thanks to cache, but Dockerfile.2 would have to run all 4 commands again.

So, the question: Which is a better way to do a Dockerfile?

Caithness answered 30/8, 2016 at 9:9 Comment(1)
Can't be answered in general as it depends on the situation and on the use of the image (optimize for size, download speed, or building speed)Burdened
B
223

When possible, I always merge together commands that create files with commands that delete those same files into a single RUN line. This is because each RUN line adds a layer to the image, the output is quite literally the filesystem changes that you could view with docker diff on the temporary container it creates. If you delete a file that was created in a different layer, all the union filesystem does is register the filesystem change in a new layer, the file still exists in the previous layer and is shipped over the networked and stored on disk. So if you download source code, extract it, compile it into a binary, and then delete the tgz and source files at the end, you really want this all done in a single layer to reduce image size.

Next, I personally split up layers based on their potential for reuse in other images and expected caching usage. If I have 4 images, all with the same base image (e.g. debian), I may pull a collection of common utilities to most of those images into the first run command so the other images benefit from caching.

Order in the Dockerfile is important when looking at image cache reuse. I look at any components that will update very rarely, possibly only when the base image updates and put those high up in the Dockerfile. Towards the end of the Dockerfile, I include any commands that will run quick and may change frequently, e.g. adding a user with a host specific UID or creating folders and changing permissions. If the container includes interpreted code (e.g. JavaScript) that is being actively developed, that gets added as late as possible so that a rebuild only runs that single change.

In each of these groups of changes, I consolidate as best I can to minimize layers. So if there are 4 different source code folders, those get placed inside a single folder so it can be added with a single command. Any package installs from something like apt-get are merged into a single RUN when possible to minimize the amount of package manager overhead (updating and cleaning up).


Update for multi-stage builds:

I worry much less about reducing image size in the non-final stages of a multi-stage build. When these stages aren't tagged and shipped to other nodes, you can maximize the likelihood of a cache reuse by splitting each command to a separate RUN line.

However, this isn't a perfect solution to squashing layers since all you copy between stages are the files, and not the rest of the image meta-data like environment variable settings, entrypoint, and command. And when you install packages in a linux distribution, the libraries and other dependencies may be scattered throughout the filesystem, making a copy of all the dependencies difficult.

Because of this, I use multi-stage builds as a replacement for building binaries on a CI/CD server, so that my CI/CD server only needs to have the tooling to run docker build, and not have a jdk, nodejs, go, and any other compile tools installed.

Bugeye answered 5/9, 2016 at 12:17 Comment(0)
R
77

Official answer listed in their best practices ( official images MUST adhere to these )

Minimize the number of layers

You need to find the balance between readability (and thus long-term maintainability) of the Dockerfile and minimizing the number of layers it uses. Be strategic and cautious about the number of layers you use.

Since docker 1.10 the COPY, ADD and RUN statements add a new layer to your image. Be cautious when using these statements. Try to combine commands into a single RUN statement. Separate this only if it's required for readability.

More info: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#minimize-the-number-of-layers

Update: Multi stage in docker >17.05

With multi-stage builds you can use multiple FROM statements in your Dockerfile. Each FROM statement is a stage and can have its own base image. In the final stage you use a minimal base image like alpine, copy the build artifacts from previous stages and install runtime requirements. The end result of this stage is your image. So this is where you worry about the layers as described earlier.

As usual, docker has great docs on multi-stage builds. Here's a quick excerpt:

With multi-stage builds, you use multiple FROM statements in your Dockerfile. Each FROM instruction can use a different base, and each of them begins a new stage of the build. You can selectively copy artifacts from one stage to another, leaving behind everything you don’t want in the final image.

A great blog post about this can be found here: https://blog.alexellis.io/mutli-stage-docker-builds/

To answer your points:

  1. Yes, layers are sort of like diffs. I don't think there are layers added if there's absolutely zero changes. The problem is that once you install / download something in layer #2, you can not remove it in layer #3. So once something is written in a layer, the image size can not be decreased anymore by removing that.

  2. Although layers can be pulled in parallel, making it potentially faster, each layer undoubtedly increases the image size, even if they're removing files.

  3. Yes, caching is useful if you're updating your docker file. But it works in one direction. If you have 10 layers, and you change layer #6, you'll still have to rebuild everything from layer #6-#10. So it's not too often that it will speed the build process up, but it's guaranteed to unnecessarily increase the size of your image.


Thanks to @Mohan for reminding me to update this answer.

Railing answered 30/8, 2016 at 18:27 Comment(0)
H
42

More recent docs note this:

Prior to Docker 17.05, and even more, prior to Docker 1.10, it was important to minimize the number of layers in your image. The following improvements have mitigated this need:

[...]

Docker 17.05 and higher add support for multi-stage builds, which allow you to copy only the artifacts you need into the final image. This allows you to include tools and debug information in your intermediate build stages without increasing the size of the final image.

and this:

Notice that this example also artificially compresses two RUN commands together using the Bash && operator, to avoid creating an additional layer in the image. This is failure-prone and hard to maintain.

This guidance suggests using multistage builds and keeping the Dockerfiles readable.

Hanker answered 23/11, 2017 at 8:44 Comment(4)
While multistage builds seems a good option to keep the balance, the actual fix to this question will come when the docker image build --squash option goes outside of experimental.Caithness
@Caithness - I'm skeptical about squash getting past experimental. It has many gimmicks and only made sense before multi-stage builds. With multi stage builds you only need to optimise the final stage which is very easy.Railing
@Caithness To expand on that, only layers in the last stage make any difference to the size of the final image. So if you put all your builder gubbins in earlier stages and have the final stage just install packages and copy across files from earlier stages, everything works beautifully and squash isn't needed.Hanker
Contrary to your opening statement "answers above are outdated" other answers are still correct. You are selectively quoting the documentation, and assume everyone will just switch to multi-stage builds to get performance improvements. While multi-stage builds are great, they may not be the best solution for everyone.Defray
S
12

It depends on what you include in your image layers. The key point is sharing as many layers as possible.

Bad Examples
  1. Dockerfile.1

    RUN yum install big-package && yum install package1
    
  2. Dockerfile.2

    RUN yum install big-package && yum install package2
    
Good Examples
  1. Dockerfile.1

    RUN yum install big-package
    RUN yum install package1
    
  2. Dockerfile.2

    RUN yum install big-package
    RUN yum install package2
    

Another suggestion is deleting is not so useful only if it happens on the same layer as the adding/installing action.

Scheer answered 30/8, 2016 at 9:35 Comment(4)
Would these 2 really share the RUN yum install big-package from cache?Caithness
Yes, they would share the same layer, provided they start from the same base.Cosmorama
Why did you provide 2 examples (with package1 and package2)? They are both the same, the package name does not make any difference.Staceystaci
I updated dockerfiles name again to keep aligned with the question. The purpose of the demo is to show how two dockerfiles share cache from each other.Scheer

© 2022 - 2024 — McMap. All rights reserved.