Given a process that creates a large linux kernel page cache via mmap'd files, running in a docker container (cgroup) with a memory limit causes kernel slab allocation errors:
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252395] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252402] cache: kmalloc-2048(2412:6c2c4ef2026a77599d279450517cb061545fa963ff9faab731daab2a1f672915), object size: 2048, buffer size: 2048, default order: 3, min order: 0
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252407] node 0: slabs: 135, objs: 1950, free: 64
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252409] node 1: slabs: 130, objs: 1716, free: 0
Watching slabtop
I can see the number of buffer_head, radix_tree_node and kmalloc* objects is heavily restricted in a container started with a memory limit. This appears to have pathologic consequences for IO throughput in the application and observable with iostat
. This does not happen even when the page cache consumes all available memory on the host OS running outside a container or a container with no memory limit.
This appears to be an issue in the kernel memory accounting where the kernel page cache is not counted against the containers memory, but the SLAB objects that support it are. The behavior appears to be aberrant because running when a large slab object pool is preallocated, the memory constrained container works fine, freely reusing the existing slab space. Only slab allocated in the container counts against the container. No combination of container options for memory and kernel-memory seems to fix the issue (except not setting a memory limit at all or a limit so large that is does not restrict the slab but this restricts the addressable space). I have tried to disable kmem accounting entirely with no success by passing cgroup.memory=nokmem
at boot.
System Info:
- Linux ip-10-10-17-135 4.4.0-1087-aws #98-Ubuntu SMP
- AMI ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190204.3
- Docker version 18.09.3, build 774a1f4
- java 10.0.1 2018-04-17
To reproduce the issue you can use my PageCache java code. This is a bare bones repro case of an embedded database library that heavily leverages memory mapped files to be deployed on a very fast file system. The application is deployed on AWS i3.baremetal instances via ECS. I am mapping a large volume from the host to the docker container where the memory mapped files are stored. The AWS ECS agent requires setting a non zero memory limit for all containers. The memory limit causes the pathologic slab behavior and the resulting application IO throughput is totally unacceptable.
It is helpful to drop_caches
between runs using echo 3 > /proc/sys/vm/drop_caches
. This will clear the page cache and the associated pool of slab objects.
Suggestions on how to fix, work around or even where report this issue would be welcome.
UPDATE It appears that updating to Ubuntu 18.04 with the 4.15 kernel does fix the observed kmalloc allocation error. The version of Java seems to be irrelevant. This appears to be because each v1 CGroup can only allocate page cache up to the memory limit (with multiple cgroups it is more complicated with only one cgroup being "charged" for the allocation via the Shared Page Accounting scheme). I believe this is now consistent with the intended behavior. In the 4.4 kernel we found that the observed kmalloc errors were an intersection of using software raid0 in a v1 Cgroup with a memory limit and a very large page cache. I believe the cgroups in the 4.4 kernel were able to map an unlimited number of pages (a bug which we found useful) upto the point at which the kernel ran out of memory for the associated slab objects, but I still don't have a smoking gun for the cause.
With the 4.15 kernel, our Docker containers are required to set a memory limit (via AWS ECS) so we have implemented a task to unset the memory limit as soon as the container is created in /sys/fs/cgroup/memory/docker/{contarainer_id}/memory.limit_in_bytes
. This appears to work though it is not a good practice to be sure. This allows the behavior we want - unlimited sharing of page cache resources on the host. Since we are running a JVM application with a fixed heap, the down side risk is limited.
For our use case, it would be fantastic to have the option to discount the page cache (mmap'd disk space) and associated slab objects entirely for a cgroup but maintain the limit on heap & stack for the docker process. The present Shared Page Accounting scheme is rather hard to reason about and we would prefer to allow the LRU page cache (and associated SLAB resources) to use the full extent of the hosts memory as is the case when the memory limit is not set at all.
I have started following some conversations on LWN but I am a bit in the dark. Maybe this is a terrible idea? I don't know... advice on how to proceed or where to go next is welcome.