JVM in container calculates processors wrongly?
Asked Answered
T

1

11

I recently did some research again, and stumbled upon this. Before crying about it to the OpenJDK team, I wanted to see if anyone else has observed this, or disagrees with my conclusions.

So, it's widely known that the JVM for a long time ignored memory limits applied to the cgroup. It's almost as widely known that it now takes them into account, starting with Java 8 update something, and 9 and higher. Unfortunately, the calculations done based on the cgroup limits are so useless that you still have to do everything by hand. See google and the hundreds of articles on this.

What I only discovered a few days ago, and did not read in any of those articles, is how the JVM checks the processor count in cgroups. The processor count is used to decide on the number of threads used for various tasks, including also garbage collection. So getting it correct is important.

In a cgroup (as far as I understand, and I'm no expert) you can set a limit on the cpu time available (--cpus Docker parameter). This limits time only, and not parallelism. There are also cpu shares (--cpu-shares Docker parameter), which are a relative weight to distribute cpu time under load. Docker sets a default of 1024, but it's a purely relative scale.

Finally, there are cpu sets (--cpuset-cpus for Docker) to explicitly assign the cgroup, and such the Docker container, to a subset of processors. This is independent of the other parameters, and actually impacts parallelism.

So, when it comes to checking how many threads my container can have running in parallel, as far as I can tell, only the cpu set is relevant. The JVM though ignores that, instead using the cpu limit if set, otherwise the cpu shares (assuming the 1024 default to be an absolute scale). This is IMHO already very wrong. It calculates available cpu time to size thread pools.

It gets worse in Kubernetes. It's AFAIK best practice to set no cpu limit, so that the cluster nodes have high utilization. Also, you should set for most apps a low cpu request, since they will be idle most of the time and you want to schedule many apps on one node. Kubernetes sets the request in milli cpus as cpu share, which is most likely below 1000m. The JVM then always assumes one processor, even is your node is running on some 64 core cpu monster.

Has anyone ever observed this as well? Am I missing something here? Or did the JVM devs actually make things worse when implementing cgroup limits for the cpu?

For reference:

Tiberius answered 9/1, 2020 at 10:12 Comment(1)
As of Java 19, a lot of the behaviours you mention are being removed and deprecated, see: "Do not use CPU Shares to compute active processor count" bugs.openjdk.java.net/browse/JDK-8281571 (some other related details in here also).Externality
C
9

Being a developer of a large scale service (>15K containers running distributed Java applications in the own cloud), I also admit that so called "Java container support" is too far from being perfect. At the same time, I can understand the reasoning of JVM developers who implemented the current resource detection algorithm.

The problem is, there are so many different cloud environments and use cases for running containerized applications, that it's virtually impossible to address the whole variety of configurations. What you claim to be the "best practice" for most apps in Kubernetes, is not necessarily typical for other deployments. E.g. it's definitely not a usual case for our service, where most containers require the certain minimum guaranteed amount of CPU resources, and thus also have a quota they cannot exceed, in order to guarantee CPU for other containers. This policy works well for low-latency tasks. OTOH, the policy you've described, suits better for high-throughput or batch tasks.

The goal of the current implementation in HotSpot JVM is to support popular cloud environments out of the box, and to provide the mechanism for overriding the defaults.

There is an email thread where Bob Vandette explains the current choice. There is also a comment in the source code, describing why JVM looks at cpu.shares and divides it by 1024.

/*
 * PER_CPU_SHARES has been set to 1024 because CPU shares' quota
 * is commonly used in cloud frameworks like Kubernetes[1],
 * AWS[2] and Mesos[3] in a similar way. They spawn containers with
 * --cpu-shares option values scaled by PER_CPU_SHARES. Thus, we do
 * the inverse for determining the number of possible available
 * CPUs to the JVM inside a container. See JDK-8216366.
 *
 * [1] https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu
 *     In particular:
 *        When using Docker:
 *          The spec.containers[].resources.requests.cpu is converted to its core value, which is potentially
 *          fractional, and multiplied by 1024. The greater of this number or 2 is used as the value of the
 *          --cpu-shares flag in the docker run command.
 * [2] https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerDefinition.html
 * [3] https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/docker/docker.cpp#L648
 *     https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/slave/containerizer/mesos/isolators/cgroups/constants.hpp#L30
 */

As to parallelism, I also second HotSpot developers that JVM should take cpu.quota and cpu.shares into account when estimating the number of available CPUs. When a container has a certain number of vcores assigned to it (in either way), it can rely only on this amount of resources, since there is no guarantee that more resources will be ever available to the process. Consider a container with 4 vcores running on a 64-core machine. Any CPU intensive task (GC is an example of such task) running in 64 parallel threads will quickly exhaust the quota, and the OS will throttle the container for a long period. E.g. each 94 out of 100 ms the application will be in a stop-the-world pause, since the default period for accounting quota (cpu.cfs_period_us) is 100 ms.

Anyway, if the algorithm does not work well in your particular case, it's always possible to override the number of available processors with -XX:ActiveProcessorCount option, or disable container awareness entirely with -XX:-UseContainerSupport.

Cephalochordate answered 13/1, 2020 at 1:55 Comment(7)
"most containers require the certain minimum guaranteed amount of CPU resources, and thus also have a quota they cannot exceed, in order to guarantee CPU for other containers." As far as I understand scheduling, this is not necessary. If two containers need CPU time, it will be distributed between them according to the cpu shares. No one container can block the cpu completely.Tiberius
"Consider a container with 4 vcores running on a 64-core machine." I do. And the JVM should set available processors to 4 in that case. Not 64, as you rightly explained. But neither should it assume 1, which it currently does in many scenarios.Tiberius
I did not consider threads eating up the quota. That is a good point to consider. But I feel like that is solving the problem from the wrong angle. If you have background tasks that kick in that often and block the actual tasks, you should consider fixing the app, not the platform. After all, having a quota on cpu time isn't too different from running on a slow machine, conceptually. If your app cannot do that, either make it more faster or give it a higher quota.Tiberius
@Tiberius Well, if you accept the "agreement" that cpu-shares=vcores*1024, JVM won't assign 1 CPU. Just don't set too small cpu-shares - after all, the numbers are relative.Cephalochordate
"If you have background tasks that kick in that often and block the actual tasks, you should consider fixing the app, not the platform." - I would argue that. There can be different classes of tasks: low-latency, high-throughput (batch), and idle (background). Linux already has corresponding scheduling policies. The problem is that even popular cloud environments do not support them, i.e. do not make difference between task classes. So, users need to "simulate" those kinds of tasks by setting weird cpu-quota and cpu-shares.Cephalochordate
In contrast, our own cloud knows about task classes and has different cgroups for them. So, e.g. idle tasks can utilize all available processors (without app being aware that it will ever compete for cpu resources with other containers), but this will not affect low-latency tasks.Cephalochordate
To summarize my point, I think it's more like a problem of container management system rather than Java. The "protocol" of communicating resource constraints from cgroup to JVM is more or less clear, though it definitely has annoying caveats, e.g. cpu-shares=1024 means "do not use cpu-shares".Cephalochordate

© 2022 - 2024 — McMap. All rights reserved.