JVM in container calculates processors wrongly?

I recently did some research again, and stumbled upon this. Before crying about it to the OpenJDK team, I wanted to see if anyone else has observed this, or disagrees with my conclusions.

So, it's widely known that the JVM for a long time ignored memory limits applied to the cgroup. It's almost as widely known that it now takes them into account, starting with Java 8 update something, and 9 and higher. Unfortunately, the calculations done based on the cgroup limits are so useless that you still have to do everything by hand. See google and the hundreds of articles on this.

What I only discovered a few days ago, and did not read in any of those articles, is how the JVM checks the processor count in cgroups. The processor count is used to decide on the number of threads used for various tasks, including also garbage collection. So getting it correct is important.

In a cgroup (as far as I understand, and I'm no expert) you can set a limit on the cpu time available (--cpus Docker parameter). This limits time only, and not parallelism. There are also cpu shares (--cpu-shares Docker parameter), which are a relative weight to distribute cpu time under load. Docker sets a default of 1024, but it's a purely relative scale.

Finally, there are cpu sets (--cpuset-cpus for Docker) to explicitly assign the cgroup, and such the Docker container, to a subset of processors. This is independent of the other parameters, and actually impacts parallelism.

So, when it comes to checking how many threads my container can have running in parallel, as far as I can tell, only the cpu set is relevant. The JVM though ignores that, instead using the cpu limit if set, otherwise the cpu shares (assuming the 1024 default to be an absolute scale). This is IMHO already very wrong. It calculates available cpu time to size thread pools.

It gets worse in Kubernetes. It's AFAIK best practice to set no cpu limit, so that the cluster nodes have high utilization. Also, you should set for most apps a low cpu request, since they will be idle most of the time and you want to schedule many apps on one node. Kubernetes sets the request in milli cpus as cpu share, which is most likely below 1000m. The JVM then always assumes one processor, even is your node is running on some 64 core cpu monster.

Has anyone ever observed this as well? Am I missing something here? Or did the JVM devs actually make things worse when implementing cgroup limits for the cpu?

For reference:

https://bugs.openjdk.java.net/browse/JDK-8146115
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run
cat /sys/fs/cgroups/cpu/cpu.share while inside a container, locally or a cluster of your choice, to get settings used on startup

Being a developer of a large scale service (>15K containers running distributed Java applications in the own cloud), I also admit that so called "Java container support" is too far from being perfect. At the same time, I can understand the reasoning of JVM developers who implemented the current resource detection algorithm.

The problem is, there are so many different cloud environments and use cases for running containerized applications, that it's virtually impossible to address the whole variety of configurations. What you claim to be the "best practice" for most apps in Kubernetes, is not necessarily typical for other deployments. E.g. it's definitely not a usual case for our service, where most containers require the certain minimum guaranteed amount of CPU resources, and thus also have a quota they cannot exceed, in order to guarantee CPU for other containers. This policy works well for low-latency tasks. OTOH, the policy you've described, suits better for high-throughput or batch tasks.

The goal of the current implementation in HotSpot JVM is to support popular cloud environments out of the box, and to provide the mechanism for overriding the defaults.

There is an email thread where Bob Vandette explains the current choice. There is also a comment in the source code, describing why JVM looks at cpu.shares and divides it by 1024.

/*
 * PER_CPU_SHARES has been set to 1024 because CPU shares' quota
 * is commonly used in cloud frameworks like Kubernetes[1],
 * AWS[2] and Mesos[3] in a similar way. They spawn containers with
 * --cpu-shares option values scaled by PER_CPU_SHARES. Thus, we do
 * the inverse for determining the number of possible available
 * CPUs to the JVM inside a container. See JDK-8216366.
 *
 * [1] https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu
 *     In particular:
 *        When using Docker:
 *          The spec.containers[].resources.requests.cpu is converted to its core value, which is potentially
 *          fractional, and multiplied by 1024. The greater of this number or 2 is used as the value of the
 *          --cpu-shares flag in the docker run command.
 * [2] https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerDefinition.html
 * [3] https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/docker/docker.cpp#L648
 *     https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/slave/containerizer/mesos/isolators/cgroups/constants.hpp#L30
 */

As to parallelism, I also second HotSpot developers that JVM should take cpu.quota and cpu.shares into account when estimating the number of available CPUs. When a container has a certain number of vcores assigned to it (in either way), it can rely only on this amount of resources, since there is no guarantee that more resources will be ever available to the process. Consider a container with 4 vcores running on a 64-core machine. Any CPU intensive task (GC is an example of such task) running in 64 parallel threads will quickly exhaust the quota, and the OS will throttle the container for a long period. E.g. each 94 out of 100 ms the application will be in a stop-the-world pause, since the default period for accounting quota (cpu.cfs_period_us) is 100 ms.

Anyway, if the algorithm does not work well in your particular case, it's always possible to override the number of available processors with -XX:ActiveProcessorCount option, or disable container awareness entirely with -XX:-UseContainerSupport.

Recommended topics

Hot tags