Why does select() consume so much CPU time in my program?
Asked Answered
L

4

35

I have several Java applications that use MINA, and all of them use 20 MINA threads. One application serves around 10,000 concurrent connections, that are usually idle but receive input sometimes. 20 is likely a reasonable threadcount for that application, although I haven't exactly profiled it (which this question is getting at). Another application serves only around 15 connections at a time but initiates the IO work so they are very busy, and has 20 MINA threads anyway, which is obviously too many.

The thing that's strange to me is both applications always devote about 30%, sometimes as high as 60%, of their CPU time into MINA's select() method, profiled in VisualVM. The call stack looks like this:

java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <40ca5d54> (a sun.nio.ch.Util$2)
- locked <24649fe8> (a java.util.Collections$UnmodifiableSet)
- locked <3fae9662> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:72)
at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1093)
at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

It seems to be based in a busy poll, which sounds really wrong to me.

Should I be worried when I see the number that high? What causes this? Is it something I need to optimize out or is it more akin to a sleep or idle routine? If it's more like a sleep routine is it somehow scheduled to be lower priority than other CPU work?

Update: this thread seems to be the same issue. I followed its advice, and am now running Java 1.7.0_45, but I am still seeing select taking as high as 90% of CPU time in an application with 10k connections.

We are using MINA 2.0.4, which means this relevant bug is fixed.

Lottielotto answered 9/12, 2013 at 16:14 Comment(3)
What version of MINA are you using? maybe is a bug on NioProcessor, issues.apache.org/jira/browse/DIRMINA-678Digest
@Digest we're on 2.0.4; it said fixed in 2.0.3.Lottielotto
Maybe is related to another bug solved in 2.0.5, issues.apache.org/jira/browse/DIRMINA-681Digest
H
20

Unfortunately, this is wrong interpretation of the numbers.

I've faced this situation many times (and ask a question on stackoverflow too).

The main reason, is VisualVM doesn't show correct CPU time. It showing percentage of the thread time in RUNNING state. But from documentation on Thread.State:

Thread state for a runnable thread. A thread in the runnable state is executing in the Java virtual machine but it may be waiting for other resources from the operating system such as processor.

This is exactly what's going on. Actually, thread is blocked inside OS epoll_wait() call. On Linux box there are several ways you can confirm it's the case.

strace'ing thread

$ strace -tttT -f -p [thread-id]

Thread id could be obtained from jstack output:

$ jstack [java-pid]
[...]
"Netty Builtin Server 1" #17 prio=5 os_prio=31 tid=0x00000001013dd800 nid=0xe12f runnable [0x0000700001fe4000]
  java.lang.Thread.State: RUNNABLE
  at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
  at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
[...]

in this case thread id is 0xe12f (should be converted to decimal). You will see the most of the time thread will be in epoll_wait() call.

pidstating thread

$ pidstat -tu -p [java-pid] | grep [thread pid]

you will see low system as well as user CPU time by this thread, meaning it doesn't consume CPU.

polling thread state using ps

$ ps -eL -o pid,tid,state | grep [thread-id]

you will see most of the time thread in state S or Sl (interruptible sleep) instead of R (runnable).

In the end you should not worrying about that if there are no operational issues with the service.

Helgoland answered 25/5, 2016 at 10:31 Comment(6)
Sounds good, but the next question would be how to CPU profile. (FYI I don't work on this code anymore so can't follow up with whether something worked.)Lottielotto
VisualVM will do as well as Java Flight Recorder. But only for the CPU bounded threads., which could be confirmed by pidstat. Unfortunately, there is no universal method. One should use system wide utilities (eg. pidstat, vmstat) as well as application wide (JFR, VisualVM, jstack).Helgoland
@DenisBazhenov a RUNNABLE java thread is stilling consume cpu(user space), even the related native thread is in S state and not using cpu. because it is executing in vm (maybe spinning for native' io return) Am I right?Huba
@Huba not exactly, JVM has 1:1 mapping between JVM thread and OS thread. So if OS thread sleeping (S state) JVM thread also sleeping and can't consumer any CPU, it's the same thread after all.Helgoland
See Brendan Gregg's in-depth post brendangregg.com/blog/2014-06-09/…Colorcast
BTW There is nice sampler profiler which based on Brendan Greggs perf info and tries to dig upto OS native calls github.com/jvm-profiling-tools/async-profilerSwett
S
3

Firstly, it's good that both applications have the same issue; it probably indicates that the issue is with either the JVM or the OS and not your application :-)

As jzd mentioned, nio.select() has had issues. The multiplication of {various versions of Java} x {various platforms, kernel versions} makes it seem like an all-over-the-place issue. I hope one of these works for you :

Try

  • {older version of JRE (6), older version of Linux kernel} or
  • {newer version of JRE (7), newer version of Linux kernel}

instead of mixing them up as in {older, newer} or {newer, older}.

Shirty answered 25/12, 2013 at 11:19 Comment(1)
On Kernel version 3.2.Lottielotto
B
1

One app is polling 10,000 connections, using very little CPU per connection, but all together it could add up to a good fraction of CPU time. All priority does is let some other job get in line first.

The other app that has fewer connections but does more crunch per connection could also show the higher percent, but it should show a lower fraction of time waiting, and a higher fraction CPU-ing.

Better answered 9/12, 2013 at 18:43 Comment(0)
R
1

Like the answer from the linked question referred to, the common problem is older JDK bugs. Since you are now on an updated version, I think the this might actually be a hardware bottleneck.

Here is a link to a glassfish issue describing where they discuss the possibility of hardware (network and servers) being the source of the problem.

https://www.java.net//forum/topic/glassfish/glassfish/glassfish-31-deadlock-epollarraywrapperepollwait-how-handle

Also, here is another similar question with no answer yet: SelectorImpl is BLOCKED

Relegate answered 20/12, 2013 at 20:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.