We are experiencing a problem where our incoming client socket connections to our socket server are being denied when a relatively small number of nodes (16 to 24, but we will need to handle more in the future) are trying to connect simultaneously.
Some specifics:
- server is running on Windows 2008 or 7
- our main server is written in Java using a ServerSocket
- the clients are also Windows running on grid nodes in our data center
When we try and do a test run on the grid, the client nodes attempt to connect to the server and send a 40-100K packet and then drop the connection. Using between 16 and 24 nodes we start seeing problems with client connections failing to be able to connect to the server. Given this setup, we are trying to potentially handle a max of 16-24 simultaneous client connections and failing, which does not seem right to us at all.
The main server loop is listening on a regular SocketServer and when it gets a connection it spawns a new Thread to handle the connection, returning immediately to listen on the socket. We also have a dummy python server that simply reads and discards the incoming data and a C++ server that logs the data before dumping it and both are also experiencing the same problem with clients being unable to connect with minor variations in how many successful client connections before the failures start. This has lead us to believe that any specific server is not at fault in this issue and that it is probably environmental.
Our first thoughts were to up the TCP backlog on the socket. This did not alleviate the issue even when pushed to very high levels. The default for a Java SocketServer is 50, much lower than we are able to handle.
We have run the test between machines on the same subnet, and disabled all local firewalls on the machines in case the FW is doing rate limiting our connections to the server; no success.
We have tried some tuning of the network on the Windows machine running the servers:
- Decreasing the TimedWaitDelay, but to no effect (and in my Python test it shouldn’t because that test only runs for a few milliseconds).
- Increasing the MaxUserPort to a large value, around 65000, but to no effect (which is odd given my Python test only ever sends 240 messages, so I shouldn’ even be getting close to this type of limit).
- Increasing the TcpNumConnection to a large value (can’t remember the exact number). Again, we should never have more than 24 connections at a time so this can’t be a limit.
- Starting the “Dynamic Backlog” feature which allows the message backlog to increase dynamically. I think we set the max to 2000 connections with min 1000 connections, but to no effect. Again, Python should never make more than 240 connections so we shouldn’t be even activating the dynamic backlog.
- In addition to the above disabling Windows “autotuning” for TCP ports. Again, to no effect.
My feeling is that Windows is somehow limiting the number of inbound connections but we aren't sure what to modify to allow a greater number of connections. The thoughts of an agent on the network limiting the connection rate also don't seem to be true. We highly doubt that the number of simultaneous connections is overloading the physical GB network.
We're stumped. Has anybody else experienced a problem like this and found a solution?