Java process on Mac OSX does not release socket
Asked Answered
M

5

12

I am experiencing an odd problem every now and then (too often actually).

I am running a server application, which is binding a socket for itself.

But once in a while, the socket is not released. The process dies, although Eclipse reports that Terminate failed, however it disappears properly from 'ps' and JConsole/JVisualVM. 'lsof' also displays nothing for the port anymore. But still, I get this error when I try to start the server again to the same port:

Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)

The problem is worst in my unit tests, which never run fully, because this will for sure occur after one of the tests (which all recreate the server).

I am running MacOSX 10.7.3

Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)

I have also Parallels, and often the problem looks like it's caused by the Parallels network adapter, but I am not sure if it has anything to do with this problem after all (I have contacted their support without any help so far).

The only thing that helps to resolve the situation is to reboot OSX.

Any ideas?

--

This is the relevant code to open the socket:

channel = (ServerSocketChannel) ServerSocketChannel.open().configureBlocking(false);
 channel.socket().bind( addr, 0 );

and it is closed by

  channel.close();

But I assume that the process gets stuck here and then Eclipse kills it.

--

netstat -an (for port 6007):

tcp4      73      0  127.0.0.1.6007         127.0.0.1.51549        ESTABLISHED
tcp4       0      0  127.0.0.1.51549        127.0.0.1.6007         ESTABLISHED
tcp4      73      0  127.0.0.1.6007         127.0.0.1.51544        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.6007         127.0.0.1.51543        CLOSE_WAIT 
tcp4       0      0  10.37.129.2.6007       *.*                    LISTEN     
tcp4       0      0  10.211.55.2.6007       *.*                    LISTEN     
tcp4       0      0  127.0.0.1.6007         *.*                    LISTEN     
tcp4       0      0  10.50.100.236.6007     *.*                    LISTEN     

--

And now I get this exception after the socket is opened for every test (netstat output from this situation):

Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.net.SocketInputStream.read(SocketInputStream.java:182)

--

Stopping the process from eclipse I got "Terminate failed", but lsof -i TCP:6007 is displaying nothing and the process is no longer found by 'ps'. netstat output did not change...

Can I somehow kill the socket without rebooting (that would help a litte bit already)?

--

UPDATE 5.5.12:

I ran the tests now in Eclipse debugger. This time the tests got stuck after 18 methods. I stopped the main thread after it was stuck around 15 minutes. This is the stack:

Thread [main] (Suspended)   
    FileDispatcher.preClose0(FileDescriptor) line: not available [native method]    
    SocketDispatcher.preClose(FileDescriptor) line: 41  
    ServerSocketChannelImpl.implCloseSelectableChannel() line: 208 [local variables unavailable]    
    ServerSocketChannelImpl(AbstractSelectableChannel).implCloseChannel() line: 201 
    ServerSocketChannelImpl(AbstractInterruptibleChannel).close() line: 97  
...

--

Hmm, it looks like the process is not killed, after all - and does not die to kill -9 either (I noticed that process 712 and probably also 710 are the TestNG processes):

$ kill -9 712
$ ps xa | grep java
  700   ??  ?E     0:00.00 (java)
  712   ??  ?E     0:00.00 (java)
  797 s005  S+     0:00.00 grep java

-- Edit: 10.5.12:

?E in the ps output above means that the process is exiting. I could not find any means to kill such a process fully without rebooting. The same issue has been noticed with some other applications. No solutions found:

http://www.google.com/search?q=ps+process+is+exiting+osx

Montemayor answered 25/4, 2012 at 9:3 Comment(19)
In your tests, do you repeatedly bind and unbind to the socket? If you are doing this very quickly, maybe you are running into some timing-sensitive bug.Giagiacamo
Can you show your code for creating and binding the socket plus any options you set.Berzelius
Also run netstat -an immediately after the test fails to see if the socket is in a TIME_WAIT state.Berzelius
Is your open() contained within a try/finally block? The code to close your socket should be in the finally{}. Please post more code to show how it is closed and Exceptions are handled.Hulsey
No it's not in a try-finally block. This is server code. The server is started up and it is closed on request. The question is why the socket is not closed when the process dies, which I think should always happen. And also how can I release the socket for reuse without rebooting, if it happens.Montemayor
As other people said, please put more code. It will help to determine whether it is a programming issue or you are facing other problems related with the socket life-cycle.Hydrated
The problem is the code is all in a pretty big library and closing the socket channel is just closed - I simply cannot drag all the code that is related. But the same code has been running without problems in Windows and Linux for several years already. Now that I've recently switched to OSX I have seen this pretty often on my own computer. I have also heard odd complaints from our customers using the library that their server applications do not always close in OSX. I am not sure, but I've started to consider that this is probably the reason for that as well.Montemayor
I edited the code part - to show the 'close', which is nothing special.Montemayor
the process is most likely non terminated. unfortunately netstat on mac can't show the owning process (to my very limited knowledge). if the process is terminated you have hit a bug in macos. if you know the pid of the process, you can kill -9 it. alternatively you run a VM on linux and you can still use eclipse to debug.Cruck
The process is killed by eclipse, although it reports "Terminate failed". It just does not release the socket. It sounds like a bug in OSX to me, too. In which case, how can I proceed to get some action for it?Montemayor
Added kill -9 and ps xa outputMontemayor
try "sudo kill -9 712", while killing stuff i prefer to make sure i run it as root (usually). Btw if the process is a zombie, you need to kill the parent too (likely the eclipse)Cruck
again, if all the affair ends up a bug in macos, run a linux VM under macos, and debug the application under linux. you told it's a server application, will you run it under macos in production?Cruck
sudo kill did not kill it either... This is actually library code and the servers created with it are run in Windows/Linux/OSX and probably in other unix variants too. OSX is the only one with this kind of problems.Montemayor
Have you reproduced the problem on a system that is not running Parallels?Pisces
Question: Can you post a thread dump of what application looks like when it is trying to be shut down by TestNG? Just a shot in the dark here, but can you also make sure that any thread that is waiting on Selector.select() has been woken up, and has exited?Ensconce
@Sam Goldberg That tip seems to have hit to the correct address! The server was using a global Selector instance. It wasn't closed, but nevertheless in subsequent opens/closes of the server it somehow got stuck. I changed the server code to use a fresh Selector every time it's created and now the tests are all run through. I will need to study the code a bit more (not my own originally) and find out if the change has any other effects or if this is the way to go.Montemayor
@jouniaro: I ran into a similar problem on Linux, where it seemed that threads were hanging on some of the SocketChannel methods when another thread was waiting on Selector.select(). Similar to what you saw, this problem also didn't happen on Windows. It seems particular to the Unix C Library selector implementation.Ensconce
OK. I thought this was tested on Linux, but now I am not 100% certain if that was really the case. This has occurred with the normal server process as well, after several startup/shutdowns. Perhaps the selector hasn't been closed properly and that is the exact reason. On Linux the unit tests have not been run, but also the actual server process has never got stuck, which has happened in OSX. Maybe it's just more probable there or the server just hasn't been developed that much on Linux that this would have happened there.Montemayor
M
2

So it seems that the problem lies in the implementation of Selector in the Mac version of JDK 6. Installing the new Oracle JDK 7u4 fixes the issue, independent of how the Selector is used.

Montemayor answered 24/5, 2012 at 15:24 Comment(3)
I need to add that I am still experiencing the same problem occasionally with the latest JDK7u21 as well, although much less frequently than with JDK6.Montemayor
The issue has been experienced on Linux as well. I can reproduce it sometimes (much more rarely than on Java 6/OSX) on Ubuntu. Another person says he can reproduce it easily on Redhat Linux.Montemayor
Turned out that the issue on Linux is a bit different to the original problem in Mac. Somehow related to Selector, but it still hasn't fully revealed itself.Montemayor
C
3

try closing the socket with http://docs.oracle.com/javase/1.4.2/docs/api/java/net/ServerSocket.html#close() after each test, in the teardown, if you're not already.

Carilla answered 4/5, 2012 at 15:35 Comment(7)
See my new comment to the question. The socket should be closed by the server objects that's created for each test, when the server is closed.Montemayor
Are you checking for exceptions on 'channel.close()' ?Carilla
Yes, but they were eaten. I ran the tests now in debugger with a breakpoint set in the catch clause. It turned out that there occurred once an error in server finalization, which prevented the close to be called. However, TestNG (which I am using) stopped the process - and the socket was again left in a reserved state and I had to assign a new socket for the tests to be able to run again. In Windows I have never had a problem that however the tests fail, they would prevent the socket to be used again.Montemayor
The next run (with a new socket) got stuck again with a SocketTimeoutException: Read timed out, when the client tried to access the socket. And now all the tests time out for the same reason. The debugger did not stop at the catch clause in channel.close(), nor does it log any error (I've added logging there as well)...Montemayor
And the same result for the next run with a new socket, after 14th test method (I have a few hundred in the suite)Montemayor
"It turned out that there occurred once an error in server finalization, which prevented the close to be called." Sounds like the close needs to be called in a 'finally' block, in the test teardown, so that it's guaranteed to be called after every test.Carilla
Yes you might argue like that. But it does not seem to be relevant, since this happened just once. Still the socket reads begin to timeout after a few tests have been executed, although there are no errors from socket closing or from anywhere else.Montemayor
E
3

Just a shot in the dark here, but make sure that any thread that is waiting on Selector.select() has been woken up, and has exited.

Ensconce answered 14/5, 2012 at 19:7 Comment(3)
I will need to verify it fully when I find the time. It seems the actual server is still blocking at close sometimes, although the tests began to run better.Montemayor
It seems that the Selector is the guilty one here, but there seems to be something wrong with it. I have a version which is creating a new Selector for every test and closing it, but nevertheless, close may still occasionally hang. I have now installed the new Oracle JDK 7u4, which is the first one to include Mac support - and it seems to have fixed the issue, independent of how the selector is used. I would like to accept your answer, since you pointed to the correct direction, but it did not eventually help to solve it properly.Montemayor
@jouniaro: Thanks for update. It's good to know that JDK 7 fixed the issue. Now that I remember, I should say also that I think the issue we saw with Selector was worse when we were using JRockit JDK, and the hanging was definitely in the JNI portion of the code. So it seems likely that new JDK could remove the entire problem.Ensconce
M
2

So it seems that the problem lies in the implementation of Selector in the Mac version of JDK 6. Installing the new Oracle JDK 7u4 fixes the issue, independent of how the Selector is used.

Montemayor answered 24/5, 2012 at 15:24 Comment(3)
I need to add that I am still experiencing the same problem occasionally with the latest JDK7u21 as well, although much less frequently than with JDK6.Montemayor
The issue has been experienced on Linux as well. I can reproduce it sometimes (much more rarely than on Java 6/OSX) on Ubuntu. Another person says he can reproduce it easily on Redhat Linux.Montemayor
Turned out that the issue on Linux is a bit different to the original problem in Mac. Somehow related to Selector, but it still hasn't fully revealed itself.Montemayor
P
0

I have also Parallels, and often the problem looks like it's caused by the Parallels network adapter....

I'd say that's a fair bet if this problem is not cropping up on other platforms. What have you done to exclude Parallels as the culprit?

Pisces answered 10/5, 2012 at 6:17 Comment(4)
Yes, good question. I have not been able to disable the Parallels network interfaces, yet, since if the VM is suspended or stopped it seems to leave the interfaces up anyway. I will need to retry that a bit more - somehow I got the impression that it may not affect it after all. Also what I just added in edit, gives the impression that it's a "common" OSX issue.Montemayor
@jouniaro, unkillable exiting processes are often the result of kernel deadlocks, which result from buggy kernel extensions, such as (perhaps) the Parallels' network interfaces. You need to either uninstall Parallels or try to reproduce the problem on a different Mac that does not have Parallels (or Fusion) installed.Pisces
I agree with Old Pro, you should test in a mac without parallels to determine whether Parallels has something to do with the problem. If it works, it seems the problems is on the parallels side. It not, at least you know the problem is related with the OSX. It will be helpful if you can provide a minimum class and test to reproduce the problem.Hydrated
Yes, I will try to work on it. Unfortunately I am very busy with "real issues" and this is just a nasty side track that I want to get solved at some point as well. Not sure if I have time to test it in the next days...Montemayor
M
0

if you think that the resources are not properly released, you can try to do the release in a shutdownhook. like this at least when its shut down the resouces will be released (not though if you hard kill)

an example for a very basic shutdownhook:

public void shutDownProceedure(){
    Runtime.getRuntime().addShutdownHook(new Thread() {
        public void run() {
            /* my shutdown code here */
        }
    });
}

This helped me release resources that somehow weren't entirely released before. I don't know if this works for sockets as well, i think it should.

It also allowed me to see loggings i haven't seen before

Mol answered 10/5, 2012 at 19:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.