Performance tuning for Netty 4.1 on linux machine

Asked 21/5, 2015 at 7:34 Answered 2/8, 2018 at 12:45

I am building a messaging application using Netty 4.1 Beta3 for designing my server and the server understands MQTT protocol.

This is my MqttServer.java class that sets up the Netty server and binds it to a specific port.

        EventLoopGroup bossPool=new NioEventLoopGroup();
        EventLoopGroup workerPool=new NioEventLoopGroup();

        try {

            ServerBootstrap boot=new ServerBootstrap();

            boot.group(bossPool,workerPool);
            boot.channel(NioServerSocketChannel.class);
            boot.childHandler(new MqttProxyChannel());

            boot.bind(port).sync().channel().closeFuture().sync();

        } catch (Exception e) {
            e.printStackTrace();
        }finally {          
            workerPool.shutdownGracefully();
            bossPool.shutdownGracefully();
        }
    }

Now I did a load testing of my application on my Mac having the following configuration enter image description here

The netty performance was exceptional. I had a look at the jstack while executing my code and found that netty NIO spawns about 19 threads and none of them seem to be stuck up waiting for channels or something else.

Then I executed my code on a linux machine

enter image description here

This is a 2 core 15GB machine. The problem is that the packet sent by my MQTT client seems to take a long time to pass through the netty pipeline and also on taking jstack I found that there were 5 netty threads and all were stuck up like this

    ."nioEventLoopGroup-3-4" #112 prio=10 os_prio=0 tid=0x00007fb774008800 nid=0x2a0e runnable [0x00007fb768fec000]
        java.lang.Thread.State: RUNNABLE
             at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
             at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
             at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
             at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
             - locked <0x00000006d0fdc898> (a 
io.netty.channel.nio.SelectedSelectionKeySet)
             - locked <0x00000006d100ae90> (a java.util.Collections$UnmodifiableSet)
             - locked <0x00000006d0fdc7f0> (a sun.nio.ch.EPollSelectorImpl)
             at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
             at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:621)
             at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:309)
             at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:834)
             at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
             at java.lang.Thread.run(Thread.java:745)

Is this some performance issue related to epoll on linux machine. If yes then what changes should be made to netty configuration to handle this or to improve performance.

Edit

Java Version on local system is :-

java version "1.8.0_40" Java(TM) SE Runtime Environment (build 1.8.0_40-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

Java version on AWS is :-

openjdk version "1.8.0_40-internal" OpenJDK Runtime Environment (build 1.8.0_40-internal-b09) OpenJDK 64-Bit Server VM (build 25.40-b13, mixed mode)

Hamlen answered 21/5, 2015 at 7:34 Comment(4)

Are you sure you have the same java versions on both machines? sames JVM? – Selfcontrol 21/5, 2015 at 13:58

try the newest version 4.1.0.Beta5. i read about some fixes for epoll. – Genous 22/5, 2015 at 7:57

@ArnaudPotier . The JVM versions are different. – Hamlen 22/5, 2015 at 9:26

could you run "java -version" in both machines and update your answer please? – Selfcontrol 22/5, 2015 at 13:57

Here are my findings from implementing a very simple HTTP → Kafka forklift:

Consider switching to EpollEventLoopGroup. Simple autoreplace NioEventLoopGroup → EpollEventLoopGroup gave me 30% perfomance boost.
Removing LoggingHandler from the pipeline (if you have any) can give you a CPU usage drop (in my case CPU the drop was almost unbelievable: 80%).

Lap answered 2/8, 2018 at 12:45 Comment(0)

Play around with the worker threads to see if this improves performance. The standard constructor of NioEventLoopGroup() creates the default amount of event loop threads:

DEFAULT_EVENT_LOOP_THREADS = Math.max(1, SystemPropertyUtil.getInt(
            "io.netty.eventLoopThreads", Runtime.getRuntime().availableProcessors() * 2));

As you can see you can pass io.netty.eventLoopThreads as a launch argument but I usually don't do that.

You can also pass the amount of threads in the constructor of NioEventLoopGroup().

In our environment we have netty servers that accept communication from hundreds of clients. Usually one boss thread to handle the connections is enough. The worker thread amount needs to be scaled though. We use this:

private final static int BOSS_THREADS = 1;
private final static int MAX_WORKER_THREADS = 12;

EventLoopGroup bossGroup = new NioEventLoopGroup(BOSS_THREADS);
EventLoopGroup workerGroup = new NioEventLoopGroup(calculateThreadCount());

private int calculateThreadCount() {
    int threadCount;
    if ((threadCount = SystemPropertyUtil.getInt("io.netty.eventLoopThreads", 0)) > 0) {
        return threadCount;
    } else {
        threadCount = Runtime.getRuntime().availableProcessors() * 2;
        return threadCount > MAX_WORKER_THREADS ? MAX_WORKER_THREADS : threadCount;
    }
}

So in our case we use just one boss thread. The worker threads depend on if a launch argument has been given. If not then use cores * 2 but never more than 12.

You will have to test yourself though what numbers work best for your environment.

Genous answered 21/5, 2015 at 10:32 Comment(3)

I have already tried this before but to no avail. We had used about 10k worker threads ;) and also specified a CachedPoolExecutor however this did not reduce the latency in any way. The problem still persists. Thanks though :) – Hamlen 21/5, 2015 at 10:56

10k threads on a dual core might be contra productive and could also cause slowness. https://mcmap.net/q/86257/-how-many-threads-is-too-many – Genous 21/5, 2015 at 16:14

I tried the 12 threads as well. Still did not give me the required performance :( – Hamlen 22/5, 2015 at 9:26

Recommended topics

Hot tags