Highly Concurrent Apache Async HTTP Client IOReactor issues

Application description :

I'm using Apache HTTP Async Client ( Version 4.1.1 ) Wrapped By Comsat's Quasar FiberHttpClient ( version 0.7.0 ) in order to run & execute a highly concurrent Java application that uses fibers to internally send http requests to multiple HTTP end-points
The Application is running on top of tomcat( however , fibers are used only for internal request dispatching. tomcat servlet requests are still handled the standard blocking way )
Each external request opens 15-20 Fibers internally , each fiber builds an HTTP request and uses the FiberHttpClient to dispatch it
I'm using a c44xlarge server ( 16 cores ) to test my application
The end-points i'm connecting to preempt keep-alive connections, meaning if I try to maintain by resusing sockets , conncetions get closed during requests execution attempts. Therefor , I disable connection recycling.

According to the above sections, here's the tunning for my fiber http client ( which of course I'm using a single instance of ):

PoolingNHttpClientConnectionManager connectionManager = 
new PoolingNHttpClientConnectionManager(
    new DefaultConnectingIOReactor(
        IOReactorConfig.
            custom().
            setIoThreadCount(16).
            setSoKeepAlive(false).
            setSoLinger(0).
            setSoReuseAddress(false).
            setSelectInterval(10).
            build()
            )
    );

connectionManager.setDefaultMaxPerRoute(32768);
connectionManager.setMaxTotal(131072);
FiberHttpClientBuilder fiberClientBuilder = FiberHttpClientBuilder.
        create().
        setDefaultRequestConfig(
                RequestConfig.
                custom().
                setSocketTimeout(1500).
                setConnectTimeout(1000).
                build()
        ).
       setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE).
       setConnectionManager(connectionManager).
       build();

ulimits for open-files are set super high ( 131072 for both soft and hard values )
Eden is set for 18GB , Total heap size is 24GB
OS Tcp stack is also well tuned :

kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1

Problem description

Under low-medium load all is well , connections are leased , cloesd and the pool replenishes
Beyond some concurrency point , the IOReactor Threads ( 16 of them ) seem to stop functioning properly, prior to dying.
I've written a small thread to get the pool stats and print them each second. At around 25K leased connections , actual data is not sent anymore over the socket connections , The Pending stat clibms to a sky-rocketing 30K pending connection requests as well
This situation persists and basically renders the application useless. At some point the I/O Reactor threads die, not sure when and I haven't been able to catch the exceptions so far
lsofing the java process , I can see it has tens of thousands of file descriptors , almost all of them are in CLOSE_WAIT ( which makes sense , as the I/O reactor thread die/stop functioning and never get to actually closing them
During the time the application breaks, the server is not heavily overloaded/cpu stressed

Questions

I'm guessing I am reaching some sort of boundary somewhere , though I'm rather clueless as to what or where it may reside. Except from the following
Is it possible I'm reaching an OS port ( all applicative requests are originating from a single internal IP after all) limits and creates an error that sends IO Reactor threads to die ( something similar to open files limit errors ) ?

Forgot to answer this, but I got what's going on roughly a week after posting the question :

There was some sort of miss-configuration that caused the io-reactor to spawn with only 2 threads.
Even after providing more reactor threads, the issue persisted. It turns out that our outgoing requests were mostly SSL. Apache SSL connection handling propagates the core handling to the JVM's SSL facilities which simply - are not efficient enough for handling thousands of SSL connections requests per second. Being more specific, some methods inside SSLEngine(If I recall correctly) are synchronized. doing thread-dumps under high loads shows the IORecator threads blocking each-other while trying to open SSL connections.
Even trying to create a pressure release valve in the form of connection lease-timeout didn't work because the backlogs created were to large, rendering the application useless.
Offloading SSL outgoing requests handling to nginx performed even worse - because the remote endpoints are terminating the requests preemptively, SSL client session cache could not be used ( same goes for the JVM implementation ).

Wound up putting a semaphore in-front of the entire module, limiting the whole thing to ~6000 at any given moment, which solved the issue.

Recommended topics

Hot tags