Java process on Mac OSX does not release socket

Asked 25/4, 2012 at 9:3 Answered 24/5, 2012 at 15:24

I am experiencing an odd problem every now and then (too often actually).

I am running a server application, which is binding a socket for itself.

But once in a while, the socket is not released. The process dies, although Eclipse reports that Terminate failed, however it disappears properly from 'ps' and JConsole/JVisualVM. 'lsof' also displays nothing for the port anymore. But still, I get this error when I try to start the server again to the same port:

Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)

The problem is worst in my unit tests, which never run fully, because this will for sure occur after one of the tests (which all recreate the server).

I am running MacOSX 10.7.3

Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)

I have also Parallels, and often the problem looks like it's caused by the Parallels network adapter, but I am not sure if it has anything to do with this problem after all (I have contacted their support without any help so far).

The only thing that helps to resolve the situation is to reboot OSX.

Any ideas?

This is the relevant code to open the socket:

channel = (ServerSocketChannel) ServerSocketChannel.open().configureBlocking(false);
 channel.socket().bind( addr, 0 );

and it is closed by

  channel.close();

But I assume that the process gets stuck here and then Eclipse kills it.

netstat -an (for port 6007):

tcp4      73      0  127.0.0.1.6007         127.0.0.1.51549        ESTABLISHED
tcp4       0      0  127.0.0.1.51549        127.0.0.1.6007         ESTABLISHED
tcp4      73      0  127.0.0.1.6007         127.0.0.1.51544        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.6007         127.0.0.1.51543        CLOSE_WAIT 
tcp4       0      0  10.37.129.2.6007       *.*                    LISTEN     
tcp4       0      0  10.211.55.2.6007       *.*                    LISTEN     
tcp4       0      0  127.0.0.1.6007         *.*                    LISTEN     
tcp4       0      0  10.50.100.236.6007     *.*                    LISTEN

And now I get this exception after the socket is opened for every test (netstat output from this situation):

Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.net.SocketInputStream.read(SocketInputStream.java:182)

Stopping the process from eclipse I got "Terminate failed", but lsof -i TCP:6007 is displaying nothing and the process is no longer found by 'ps'. netstat output did not change...

Can I somehow kill the socket without rebooting (that would help a litte bit already)?

UPDATE 5.5.12:

I ran the tests now in Eclipse debugger. This time the tests got stuck after 18 methods. I stopped the main thread after it was stuck around 15 minutes. This is the stack:

Thread [main] (Suspended)   
    FileDispatcher.preClose0(FileDescriptor) line: not available [native method]    
    SocketDispatcher.preClose(FileDescriptor) line: 41  
    ServerSocketChannelImpl.implCloseSelectableChannel() line: 208 [local variables unavailable]    
    ServerSocketChannelImpl(AbstractSelectableChannel).implCloseChannel() line: 201 
    ServerSocketChannelImpl(AbstractInterruptibleChannel).close() line: 97  
...

Hmm, it looks like the process is not killed, after all - and does not die to kill -9 either (I noticed that process 712 and probably also 710 are the TestNG processes):

$ kill -9 712
$ ps xa | grep java
  700   ??  ?E     0:00.00 (java)
  712   ??  ?E     0:00.00 (java)
  797 s005  S+     0:00.00 grep java

-- Edit: 10.5.12:

?E in the ps output above means that the process is exiting. I could not find any means to kill such a process fully without rebooting. The same issue has been noticed with some other applications. No solutions found:

http://www.google.com/search?q=ps+process+is+exiting+osx

Montemayor answered 25/4, 2012 at 9:3 Comment(19)

In your tests, do you repeatedly bind and unbind to the socket? If you are doing this very quickly, maybe you are running into some timing-sensitive bug. – Giagiacamo 25/4, 2012 at 9:28

Can you show your code for creating and binding the socket plus any options you set. – Berzelius 25/4, 2012 at 11:4

Also run netstat -an immediately after the test fails to see if the socket is in a TIME_WAIT state. – Berzelius 25/4, 2012 at 11:5

Is your open() contained within a try/finally block? The code to close your socket should be in the finally{}. Please post more code to show how it is closed and Exceptions are handled. – Hulsey 1/5, 2012 at 13:58

No it's not in a try-finally block. This is server code. The server is started up and it is closed on request. The question is why the socket is not closed when the process dies, which I think should always happen. And also how can I release the socket for reuse without rebooting, if it happens. – Montemayor 1/5, 2012 at 16:43

As other people said, please put more code. It will help to determine whether it is a programming issue or you are facing other problems related with the socket life-cycle. – Hydrated 4/5, 2012 at 11:23

The problem is the code is all in a pretty big library and closing the socket channel is just closed - I simply cannot drag all the code that is related. But the same code has been running without problems in Windows and Linux for several years already. Now that I've recently switched to OSX I have seen this pretty often on my own computer. I have also heard odd complaints from our customers using the library that their server applications do not always close in OSX. I am not sure, but I've started to consider that this is probably the reason for that as well. – Montemayor 4/5, 2012 at 16:3

I edited the code part - to show the 'close', which is nothing special. – Montemayor 4/5, 2012 at 16:9

the process is most likely non terminated. unfortunately netstat on mac can't show the owning process (to my very limited knowledge). if the process is terminated you have hit a bug in macos. if you know the pid of the process, you can kill -9 it. alternatively you run a VM on linux and you can still use eclipse to debug. – Cruck 5/5, 2012 at 6:15

The process is killed by eclipse, although it reports "Terminate failed". It just does not release the socket. It sounds like a bug in OSX to me, too. In which case, how can I proceed to get some action for it? – Montemayor 5/5, 2012 at 12:46

Added kill -9 and ps xa output – Montemayor 5/5, 2012 at 13:11

try "sudo kill -9 712", while killing stuff i prefer to make sure i run it as root (usually). Btw if the process is a zombie, you need to kill the parent too (likely the eclipse) – Cruck 5/5, 2012 at 14:52

again, if all the affair ends up a bug in macos, run a linux VM under macos, and debug the application under linux. you told it's a server application, will you run it under macos in production? – Cruck 5/5, 2012 at 14:58

sudo kill did not kill it either... This is actually library code and the servers created with it are run in Windows/Linux/OSX and probably in other unix variants too. OSX is the only one with this kind of problems. – Montemayor 5/5, 2012 at 17:17

Have you reproduced the problem on a system that is not running Parallels? – Pisces 10/5, 2012 at 6:14

Question: Can you post a thread dump of what application looks like when it is trying to be shut down by TestNG? Just a shot in the dark here, but can you also make sure that any thread that is waiting on Selector.select() has been woken up, and has exited? – Ensconce 10/5, 2012 at 14:49

@Sam Goldberg That tip seems to have hit to the correct address! The server was using a global Selector instance. It wasn't closed, but nevertheless in subsequent opens/closes of the server it somehow got stuck. I changed the server code to use a fresh Selector every time it's created and now the tests are all run through. I will need to study the code a bit more (not my own originally) and find out if the change has any other effects or if this is the way to go. – Montemayor 10/5, 2012 at 20:29

@jouniaro: I ran into a similar problem on Linux, where it seemed that threads were hanging on some of the SocketChannel methods when another thread was waiting on Selector.select(). Similar to what you saw, this problem also didn't happen on Windows. It seems particular to the Unix C Library selector implementation. – Ensconce 10/5, 2012 at 23:21

OK. I thought this was tested on Linux, but now I am not 100% certain if that was really the case. This has occurred with the normal server process as well, after several startup/shutdowns. Perhaps the selector hasn't been closed properly and that is the exact reason. On Linux the unit tests have not been run, but also the actual server process has never got stuck, which has happened in OSX. Maybe it's just more probable there or the server just hasn't been developed that much on Linux that this would have happened there. – Montemayor 11/5, 2012 at 7:50

So it seems that the problem lies in the implementation of Selector in the Mac version of JDK 6. Installing the new Oracle JDK 7u4 fixes the issue, independent of how the Selector is used.

Montemayor answered 24/5, 2012 at 15:24 Comment(3)

I need to add that I am still experiencing the same problem occasionally with the latest JDK7u21 as well, although much less frequently than with JDK6. – Montemayor 26/4, 2013 at 9:41

The issue has been experienced on Linux as well. I can reproduce it sometimes (much more rarely than on Java 6/OSX) on Ubuntu. Another person says he can reproduce it easily on Redhat Linux. – Montemayor 8/7, 2015 at 9:26

Turned out that the issue on Linux is a bit different to the original problem in Mac. Somehow related to Selector, but it still hasn't fully revealed itself. – Montemayor 12/10, 2015 at 7:29

try closing the socket with http://docs.oracle.com/javase/1.4.2/docs/api/java/net/ServerSocket.html#close() after each test, in the teardown, if you're not already.

Carilla answered 4/5, 2012 at 15:35 Comment(7)

See my new comment to the question. The socket should be closed by the server objects that's created for each test, when the server is closed. – Montemayor 4/5, 2012 at 16:5

Are you checking for exceptions on 'channel.close()' ? – Carilla 4/5, 2012 at 16:17

Yes, but they were eaten. I ran the tests now in debugger with a breakpoint set in the catch clause. It turned out that there occurred once an error in server finalization, which prevented the close to be called. However, TestNG (which I am using) stopped the process - and the socket was again left in a reserved state and I had to assign a new socket for the tests to be able to run again. In Windows I have never had a problem that however the tests fail, they would prevent the socket to be used again. – Montemayor 4/5, 2012 at 16:43

The next run (with a new socket) got stuck again with a SocketTimeoutException: Read timed out, when the client tried to access the socket. And now all the tests time out for the same reason. The debugger did not stop at the catch clause in channel.close(), nor does it log any error (I've added logging there as well)... – Montemayor 4/5, 2012 at 16:47

And the same result for the next run with a new socket, after 14th test method (I have a few hundred in the suite) – Montemayor 4/5, 2012 at 16:52

"It turned out that there occurred once an error in server finalization, which prevented the close to be called." Sounds like the close needs to be called in a 'finally' block, in the test teardown, so that it's guaranteed to be called after every test. – Carilla 4/5, 2012 at 21:44

Yes you might argue like that. But it does not seem to be relevant, since this happened just once. Still the socket reads begin to timeout after a few tests have been executed, although there are no errors from socket closing or from anywhere else. – Montemayor 5/5, 2012 at 12:44

Just a shot in the dark here, but make sure that any thread that is waiting on Selector.select() has been woken up, and has exited.

Ensconce answered 14/5, 2012 at 19:7 Comment(3)

I will need to verify it fully when I find the time. It seems the actual server is still blocking at close sometimes, although the tests began to run better. – Montemayor 18/5, 2012 at 20:59

It seems that the Selector is the guilty one here, but there seems to be something wrong with it. I have a version which is creating a new Selector for every test and closing it, but nevertheless, close may still occasionally hang. I have now installed the new Oracle JDK 7u4, which is the first one to include Mac support - and it seems to have fixed the issue, independent of how the selector is used. I would like to accept your answer, since you pointed to the correct direction, but it did not eventually help to solve it properly. – Montemayor 24/5, 2012 at 15:22

@jouniaro: Thanks for update. It's good to know that JDK 7 fixed the issue. Now that I remember, I should say also that I think the issue we saw with Selector was worse when we were using JRockit JDK, and the hanging was definitely in the JNI portion of the code. So it seems likely that new JDK could remove the entire problem. – Ensconce 25/5, 2012 at 13:30

So it seems that the problem lies in the implementation of Selector in the Mac version of JDK 6. Installing the new Oracle JDK 7u4 fixes the issue, independent of how the Selector is used.

Montemayor answered 24/5, 2012 at 15:24 Comment(3)

I need to add that I am still experiencing the same problem occasionally with the latest JDK7u21 as well, although much less frequently than with JDK6. – Montemayor 26/4, 2013 at 9:41

Turned out that the issue on Linux is a bit different to the original problem in Mac. Somehow related to Selector, but it still hasn't fully revealed itself. – Montemayor 12/10, 2015 at 7:29

I have also Parallels, and often the problem looks like it's caused by the Parallels network adapter....

I'd say that's a fair bet if this problem is not cropping up on other platforms. What have you done to exclude Parallels as the culprit?

Pisces answered 10/5, 2012 at 6:17 Comment(4)

Yes, good question. I have not been able to disable the Parallels network interfaces, yet, since if the VM is suspended or stopped it seems to leave the interfaces up anyway. I will need to retry that a bit more - somehow I got the impression that it may not affect it after all. Also what I just added in edit, gives the impression that it's a "common" OSX issue. – Montemayor 10/5, 2012 at 6:48

@jouniaro, unkillable exiting processes are often the result of kernel deadlocks, which result from buggy kernel extensions, such as (perhaps) the Parallels' network interfaces. You need to either uninstall Parallels or try to reproduce the problem on a different Mac that does not have Parallels (or Fusion) installed. – Pisces 10/5, 2012 at 7:4

I agree with Old Pro, you should test in a mac without parallels to determine whether Parallels has something to do with the problem. If it works, it seems the problems is on the parallels side. It not, at least you know the problem is related with the OSX. It will be helpful if you can provide a minimum class and test to reproduce the problem. – Hydrated 10/5, 2012 at 9:57

Yes, I will try to work on it. Unfortunately I am very busy with "real issues" and this is just a nasty side track that I want to get solved at some point as well. Not sure if I have time to test it in the next days... – Montemayor 10/5, 2012 at 16:41

if you think that the resources are not properly released, you can try to do the release in a shutdownhook. like this at least when its shut down the resouces will be released (not though if you hard kill)

an example for a very basic shutdownhook:

public void shutDownProceedure(){
    Runtime.getRuntime().addShutdownHook(new Thread() {
        public void run() {
            /* my shutdown code here */
        }
    });
}

This helped me release resources that somehow weren't entirely released before. I don't know if this works for sockets as well, i think it should.

It also allowed me to see loggings i haven't seen before

Mol answered 10/5, 2012 at 19:49 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags