Docker at insane scale on IBM Power Systems
By: Seetharami Seelam
Docker at insane scale on IBM Power Systems
This post is an update on the Docker on IBM Power Systems work we have discussed in our earlier articles where we described how to build and run Docker on POWER and how to get pre-build Docker tools on POWER. Since the publication of those articles, we have made a lot of progress in enabling the Docker ecosystem and in optimizing the stack (Linux kernel and GO runtime). We have accomplished an incredible milestone and we want to share how we got there.
What is this milestone you say? We can run up to 10,000 Docker containers on a single IBM Power System! These are not toy containers with no networking, no logging, read-only, etc; these are real application containers with all of these functions that make them usable in real world. You can skip to How do you get 10,000 containers on a single system? Are you serious? if you are interested in learning about how we have gotten to this large scale.
Questions and Answers
Below we will discuss a few updates on the ecosystem work we have done since our earlier articles.
Q: How do I get Docker daemon on my POWER Linux system?
As we discussed in Docker for Linux on Power Systems, Docker is already part of Ubuntu 15.04 and later versions and Fedora version 23; experimental versions of the tool on Ubuntu and other distributions are also available.
Q: How do I get base images for my application?
We have worked with the community to make many base images available in dockerhub from this location: hub.docker.com/u/ppc64le/ and the list is growing by the day so keep an eye on that location. If you need an image and it is not part of the base images, please reach out to us and we will make an effort to make it available. At that link, you will find base images for Ubuntu, debian, busybox, etc, all of your favorites. We also post base images based on IBM technology here: hub.docker.com/u/ibmcom/.
Q: What about Docker registry?
See Setup a Docker Private Registry on POWER Servers running Linux. At that link you will find information on how to get a pre-built registry and instructions to build one from scratch on different Linux distributions.
Q: How do you get 10,000 containers on a single system? Are you serious?
Yes, we are very serious indeed!!
Getting to 10k containers has been a journey. We encounter different problems at different milestones in this journey so let us go through the various milestones, the issues we encountered, and our mitigation techniques. However, before we go forward, we need to keep in mind that you need to have a system with sufficiently large amount of memory. How much memory is sufficient you ask, that depends on a few factors. The major factor is the memory footprint of your container. If your container has a 1MB footprint, 10k containers would need 10GB memory and if your container has a 100MB footprint, you might need 1TB memory. I know that I super-simplified it, but you get the idea. There will be some memory that will be shared between containers for shared libraries but again that is a function of your base image and application. In addition to the memory occupied by the containers, the Docker daemon itself would need some memory (after all it is the parent process for all your containers) as does the Linux kernel. We have done a lot of work to optimize the memory footprint of the management functions like the Docker daemon and the Linux kernel. We will discuss these memory optimizations further once we cover some of the limits we encountered.
The first major limit you will encounter as you start many containers is the limits on the system wide resource usage by any particular user. Two particular parameters that may constrain how many Docker containers we could launch are the number of open files and the maximum number of processes available to a single user. Docker daemon runs as root so in this case, the limit is for the root user. Each Docker container will have multiple open files and 3 or more user processes so for 10k containers, we set these to about 10 times the number of containers, i.e., 100k. So, set these parameters to sufficiently large numbers using the ulimit utility as below:
ulimit -n 1048576 (This sets the number of open files) ulimit -u 1048576 (This sets the maximum number of user processes)
The next limit is on the number of IPs that the docker0 bridge is configured to allocate to the containers. If each container needs an IP, we would need 10K IPs. Docker bridge should be configured with a 16-bit mask instead of 24-bit mask so it can assign not 256 but 65K IPs. You can perform these operations to reconfigure the Docker bridge, if it not already configured with the proper bitmask:
ifconfig docker0 down ifconfig docker0 172.17.42.1/16 up
Second major limit is when you hit the 1K (1024) container milestone. You cannot start a new a new container (1025th containers) at this point (especially if each container gets its own IP from the Linux bridge), because maximum number of ports allowed by Linux on a Linux bridge is 1K. This value is controlled by two hard coded parameters in the Linux kernel (as of this writing):
BR_PORT_BITSby default is 10 so
BR_MAX_PORTSis limited to 2^10 or 1024. So, to allow the bridge to use more ports, we need to change
BR_PORT_BITSto 14, which will allow 16K ports so 16K container can get one IP per container. Of course, after this change you would need to recompile the kernel. Caution: Don’t do this on production systems or a system where you have a service contract for the kernel because recompiling the kernel may void your service contract. We discussed this port issue further here and described alternatives to recompiling the kernel. We went to the next milestone by recompiling the kernel.
Third major limit is around 2500 or 3333 containers, depending on what you have in the container. You may wonder that these numbers look odd and at first glance they do (not any powers of 2, like all other limits). But in fact they are not that odd. See, the developers of GO had imposed a limit on how many OSthreads a single GO program can spawn and this is set to 10,000. This limit is set such that a (rouge) go program cannot spawn unlimited number of threads and bring the system to a halt. So, you might ask why we couldn’t start 10,000 containers, if each container is a process, I like the curiosity!!Each container typically invokes multiple GO routines. Each time a GO routine is invoked, a new thread is created for that routine. In our experiments, a very simple bash shell container invoked three GO routines, so we could start 3333 bash containers. However, when we started a container with an application inside of it, those resulted in invoking four GO routines per container so we could only start 2500 containers in that case.
So, how do we get around this limit? There are a couple of different ways to get around this problem. One way is to change this default number in the GO runtime (GCC-GO or GC) and recompile the runtime. In proc.go, you can change the line
sched.maxmcount = 10000to a number that is appropriate.
The other way is to modify your GO program (in this case the Docker daemon code) by making the API call SetMaxThreads to set the required number of threads. For example, you can set the number of threads to 40,000, as follows:
import "runtime/debug"<br> debug.SetMaxStack(40000)
Obviously this requires recompiling your GO program in this case the Docker daemon. Once you make this change, recompile your Docker daemon, an start the containers, you can get up to 4K containers and you will hit the next limit.
The final major limit is the number of psuedoterminal interfaces (aka pty). This is typically set to 4K and pty one interface is used per containers. You can increase this number as follows:
echo 11000 > /proc/sys/kernel/pty/max
Surprisingly with these four major changes, we can now create more than 10,000 containers (we created 10,023 total containers). As we briefly touched before, we had to solve a number of other memory related problems during this journey for minimizing the memory usage and container launch time.
Q: Won’t the time for creating processes grow too fast as the number of processes increases?
Docker container creation is a non-trivial process that makes several calls into the OS to setup several facilities like create a new namespace, cgroup structures, file system mount points, network, configure IP table rules, etc. This results in several iterations between the kernel space of the OS and the user space of the Docker daemon. As you create more containers, more of these facilities are created and as more of these are created, it will take longer to create the new ones because the kernel processes have to walk through the existing ones before they create a new one. As a result, container creation linearly grows with the number of existing containers in the system. In our experiments, our first container took about 0.5 seconds where as a thousandth container took about 8 seconds. Linear extrapolation showed us that at this rate it would take over a minute to create 10,000th container. We found various per-cpu loops that result in this excess time so we created patches to optimize those loops and up-streamed them into the kernel! If you don’t have the latest upstream kernel, you can get them from here:
Result, with these patches together, we are able to cut down the start up time for the 10,000th container to less than 22 seconds (3x reduction).
Q: What other optimizations were necessary?
On the POWER systems, we use GCC-GO runtime to build Docker because there is no GO lang compiler (as of this writing). We worked with the community to optimize stack allocation on POWER using split-stacksupport. We used the gold linker and this split-stack support to minimize the stack growth with the number of containers. IBM Advance Toolchain for PowerLinux 9.0-0 comes with the GCC 5.2, which supports GO 1.4.2, split stack support and gold linker. This is substantial effort between our colleagues and the community, which resulted in over 5x improvement in memory consumption.
We also introduced an optimization that minimizes memory space allocation for large number of cgroups on a NUMA system such as POWER. This patch is already up-streamed and can be obtained from here.
With the optimizations described above, you can create over 10,000 Docker containers on a single system and we have just done that on the IBM Power Systems. We pushed this limit by an order of magnitude with optimizations to the entire stack: Linux kernel, GO Runtime, and Docker code, which allowed us to optimize the memory space, minimize container start up and create virtually limitless number of containers. These are actual application container with networking, logging, and all other functions necessary to use them in real world! Now the question is: what would you innovative with this many insane-number of containers?
—Seetharami R. Seelam, IBM Research —Raghavendar K. Thimmappa, IBM Linux Technologies Center
This article is a culmination of the many months of hard work of a world class and worldwide “Docker on POWER” team from IBM Systems, Cloud and Research organization. These include Lynn Boger, Nishanth Aravamuda, Pradipta Kumar, Dipankar Sarma, Anton Blanchard, Bruce Anthony, Kazunori Ogata, Rina Nakazawa, Yohei Ueda, Tamiya Onodera, Steve VanderWiel, Mel Bakshi, Paul Mazzurana and finally Doug Davis and Srini Brahmaroutu. Our thanks to open source partners at Docker, Linux kernel and GCC. Our special thanks to Ian Lance Taylor of Google for helping us with split stack implementation.