Application description :
- I'm using Apache HTTP Async Client ( Version 4.1.1 ) Wrapped By Comsat's Quasar FiberHttpClient ( version 0.7.0 ) in order to run & execute a highly concurrent Java application that uses fibers to internally send http requests to multiple HTTP end-points
- The Application is running on top of tomcat( however , fibers are used only for internal request dispatching. tomcat servlet requests are still handled the standard blocking way )
- Each external request opens 15-20 Fibers internally , each fiber builds an HTTP request and uses the FiberHttpClient to dispatch it
- I'm using a c44xlarge server ( 16 cores ) to test my application
- The end-points i'm connecting to preempt keep-alive connections, meaning if I try to maintain by resusing sockets , conncetions get closed during requests execution attempts. Therefor , I disable connection recycling.
According to the above sections, here's the tunning for my fiber http client ( which of course I'm using a single instance of ):
PoolingNHttpClientConnectionManager connectionManager = new PoolingNHttpClientConnectionManager( new DefaultConnectingIOReactor( IOReactorConfig. custom(). setIoThreadCount(16). setSoKeepAlive(false). setSoLinger(0). setSoReuseAddress(false). setSelectInterval(10). build() ) ); connectionManager.setDefaultMaxPerRoute(32768); connectionManager.setMaxTotal(131072); FiberHttpClientBuilder fiberClientBuilder = FiberHttpClientBuilder. create(). setDefaultRequestConfig( RequestConfig. custom(). setSocketTimeout(1500). setConnectTimeout(1000). build() ). setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE). setConnectionManager(connectionManager). build();
ulimits for open-files are set super high ( 131072 for both soft and hard values )
- Eden is set for 18GB , Total heap size is 24GB
- OS Tcp stack is also well tuned :
kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1
Problem description
- Under low-medium load all is well , connections are leased , cloesd and the pool replenishes
- Beyond some concurrency point , the IOReactor Threads ( 16 of them ) seem to stop functioning properly, prior to dying.
- I've written a small thread to get the pool stats and print them each second. At around 25K leased connections , actual data is not sent anymore over the socket connections , The
Pending
stat clibms to a sky-rocketing 30K pending connection requests as well - This situation persists and basically renders the application useless. At some point the I/O Reactor threads die, not sure when and I haven't been able to catch the exceptions so far
lsof
ing the java process , I can see it has tens of thousands of file descriptors , almost all of them are in CLOSE_WAIT ( which makes sense , as the I/O reactor thread die/stop functioning and never get to actually closing them- During the time the application breaks, the server is not heavily overloaded/cpu stressed
Questions
- I'm guessing I am reaching some sort of boundary somewhere , though I'm rather clueless as to what or where it may reside. Except from the following
- Is it possible I'm reaching an OS port ( all applicative requests are originating from a single internal IP after all) limits and creates an error that sends IO Reactor threads to die ( something similar to open files limit errors ) ?