I am experiencing an odd problem every now and then (too often actually).
I am running a server application, which is binding a socket for itself.
But once in a while, the socket is not released. The process dies, although Eclipse reports that Terminate failed, however it disappears properly from 'ps' and JConsole/JVisualVM. 'lsof' also displays nothing for the port anymore. But still, I get this error when I try to start the server again to the same port:
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
The problem is worst in my unit tests, which never run fully, because this will for sure occur after one of the tests (which all recreate the server).
I am running MacOSX 10.7.3
Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)
I have also Parallels, and often the problem looks like it's caused by the Parallels network adapter, but I am not sure if it has anything to do with this problem after all (I have contacted their support without any help so far).
The only thing that helps to resolve the situation is to reboot OSX.
Any ideas?
--
This is the relevant code to open the socket:
channel = (ServerSocketChannel) ServerSocketChannel.open().configureBlocking(false);
channel.socket().bind( addr, 0 );
and it is closed by
channel.close();
But I assume that the process gets stuck here and then Eclipse kills it.
--
netstat -an (for port 6007):
tcp4 73 0 127.0.0.1.6007 127.0.0.1.51549 ESTABLISHED
tcp4 0 0 127.0.0.1.51549 127.0.0.1.6007 ESTABLISHED
tcp4 73 0 127.0.0.1.6007 127.0.0.1.51544 CLOSE_WAIT
tcp4 0 0 127.0.0.1.6007 127.0.0.1.51543 CLOSE_WAIT
tcp4 0 0 10.37.129.2.6007 *.* LISTEN
tcp4 0 0 10.211.55.2.6007 *.* LISTEN
tcp4 0 0 127.0.0.1.6007 *.* LISTEN
tcp4 0 0 10.50.100.236.6007 *.* LISTEN
--
And now I get this exception after the socket is opened for every test (netstat output from this situation):
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.net.SocketInputStream.read(SocketInputStream.java:182)
--
Stopping the process from eclipse I got "Terminate failed", but lsof -i TCP:6007 is displaying nothing and the process is no longer found by 'ps'. netstat output did not change...
Can I somehow kill the socket without rebooting (that would help a litte bit already)?
--
UPDATE 5.5.12:
I ran the tests now in Eclipse debugger. This time the tests got stuck after 18 methods. I stopped the main thread after it was stuck around 15 minutes. This is the stack:
Thread [main] (Suspended)
FileDispatcher.preClose0(FileDescriptor) line: not available [native method]
SocketDispatcher.preClose(FileDescriptor) line: 41
ServerSocketChannelImpl.implCloseSelectableChannel() line: 208 [local variables unavailable]
ServerSocketChannelImpl(AbstractSelectableChannel).implCloseChannel() line: 201
ServerSocketChannelImpl(AbstractInterruptibleChannel).close() line: 97
...
--
Hmm, it looks like the process is not killed, after all - and does not die to kill -9 either (I noticed that process 712 and probably also 710 are the TestNG processes):
$ kill -9 712
$ ps xa | grep java
700 ?? ?E 0:00.00 (java)
712 ?? ?E 0:00.00 (java)
797 s005 S+ 0:00.00 grep java
-- Edit: 10.5.12:
?E in the ps output above means that the process is exiting. I could not find any means to kill such a process fully without rebooting. The same issue has been noticed with some other applications. No solutions found:
netstat -an
immediately after the test fails to see if the socket is in aTIME_WAIT
state. – Berzeliuskill -9
it. alternatively you run a VM on linux and you can still use eclipse to debug. – CruckSelector.select()
has been woken up, and has exited? – Ensconce