What is the reason for Broken Pipe on Unix Domain Sockets?
Asked Answered
G

2

6

I have a server application which received requests and forwards them on a Unix Domain Socket. This works perfectly under reasonable usage but when I am doing some load tests with a few thousand requests I am getting a Broken Pipe error.

I am using Java 7 with junixsocket to send the requests. I have lots of concurrent requests, but I have a thread pool of 20 workers which is writing to the unix domain socket, so there is no issue of too many concurrent open connections.

For each request I am opening, sending and closing the connection with the Unix Domain Socket.

What is the reason that could cause a Broken Pipe on Unix Domain Sockets?

UPDATE:

Putting a code sample if required:

byte[] mydata = new byte[1024];
//fill the data with bytes ...

AFUNIXSocketAddress socketAddress = new AFUNIXSocketAddress(new File("/tmp/my.sock"));
Socket socket = AFUNIXSocket.connectTo(socketAddress);
OutputStream out = new BufferedOutputStream(socket.getOutputStream());
InputStream in = new BufferedInputStream(socket.getInputStream()));

out.write(mydata);
out.flush();  //The Broken Pipe occurs here, but only after a few thousand times

//read the response back...

out.close();
in.close();
socket.close();

I have a thread pool of 20 workers, and they are doing the above concurrently (so up to 20 concurrent connections to the same Unix Domain Socket), with each one opening, sending and closing. This works fine for a load test of a burst of 10,000 requests but when I put a few thousand more I suddenly get this error, so I am wondering whether its coming from some OS limit.

Keep in mind that this is a Unix Domain Socket, not a network TCP socket.

Gorget answered 15/4, 2012 at 16:17 Comment(3)
see (What causes the Broken Pipe Error)[#4585404Insurgence
@Gorget I'm also seeing this behaviour from an AFUNIXSocket client. Did you ever find the underlying cause?Tollmann
@Tollmann Its a bit too long for me to remember. However, I think I had increased the file limits like the number of open files and open sockets.Gorget
H
5

'Broken pipe' means you have written to a connection that had already been closed by the other end. It is detected somewhat asynchronously due to buffering. It basically means you have an error in your application protocol.

Hazelton answered 15/4, 2012 at 22:39 Comment(14)
Thanks, but this is a Unix Domain Socket, its not a normal TCP socket where a broken pipe is typically caused by network issues or the server closing the connection in a non-graceful manner.Gorget
@Gorget None of that is true. 'Broken pipe' always means the peer closed the connection, and nothing else, whether TCP or Unix domain. Network errors do not cause this problem. Both graceful and ungrateful closes will cause this problem.Hazelton
Yes, but why is it happening on a Unix Domain Socket? It is essentially a local file handle on the OS. There is no other side which is closing anything its all local.Gorget
@Gorget Because the peer closed the connection. The peer in this case is another process in the same OS but it is still the peer.Hazelton
OK, and why is the process closing the connection? Why would it work for the first 10,000 requests, and then out of the blue this occurs? (10,000 is not the exact number, its actually more than that but it does not reach the 20,000 load test limit)Gorget
@Gorget I don't know. It's your process, not mine. But it is closing. That's what the exception means.Hazelton
No its not closing, thats the whole point I am trying to understand. The listening process is not my process, its the php fcgi and it has no reason to close. Some load condition is triggering this but nothing in the logs, which is why the question.Gorget
@Gorget As they say at AA, the first step is to get out of denial mode. Your are getting a 'broken pipe' error. That happens when the peer closes its socket, and in no other circumstance. Ergo the peer is closing its socket. Period. Punto basta. Finis. Ende. Why, is another question.Hazelton
Lol denial mode. OK. Could it be... just could it... that my load test is filling up some OS buffer, which is why I am getting the error exactly when I trying to flush() the data through? I am just trying to find the relationship between my test and the behaviour.Gorget
@Gorget That's exactly when I would expect you to get it. There is an OS buffer all right, and it can fill up all right: that would block your write or your flush, not cause this error. The cause remains what I said above, several times.Hazelton
Disconnected on the other end is not the only possible reason for broken pipe on a domain socket. I'm in the middle of trying to solve this problem for a C++ program I'm working on. The read side of the domain socket is fine but the write side produces SIGPIPE. I've traced execution on both processes and neither one ever closes the socket.Drop
@BrianVandenberg What other cause did you come up with?Hazelton
@Hazelton I'm only ~70% sure what I'm about to say was the same problem. Before O_NONBLOCK there was FNDELAY and O_NDELAY. In Solaris at least the differences are easy to miss. For example, with the former if read() returns 0 that means EOF; for the latter it could just mean zero bytes were read. I know I learned this around the same time I solved that SIGPIPE problem, but if they're the same problem I don't remember how they're related.Drop
@BrianVandenberg Wow somehow this thread was revived after 9 years. In my case it was a case of reaching OS limits (that was why it was happening under load conditions). Increasing things like the max file handles and somaxconn solved it.Gorget
V
0

From the Linux Programmer's Manual (similar language is also in the socket man page on Mac):

The communications protocols which implement a SOCK_STREAM ensure that data is not lost or duplicated. If a piece of data for which the peer protocol has buffer space cannot be successfully transmitted within a reasonable length of time, then the connection is considered to be dead. When SO_KEEPALIVE is enabled on the socket the protocol checks in a protocol-specific manner if the other end is still alive. A SIGPIPE signal is raised if a process sends or receives on a broken stream; this causes naive processes, which do not handle the signal, to exit.

In other words, if data gets stuck in a stream socket for too long, you'll end up with a SIGPIPE. It's reasonable that you would end up with this if you can't keep up with your load test.

Vickey answered 14/9, 2021 at 17:21 Comment(3)
This is a very old post, but still... If you read my question I clearly said it is not a TCP socket, it is a UNIX Domain Socket, so completely local interprocess communication.Gorget
@Gorget Apologies, I realize my terminology was wrong. UDP/datagram shouldn't even be part of this conversation, and I should have said "stream socket" instead of TCP. I've been inclined to use the latter since UNIX stream sockets present exactly like TCP sockets to your program once you plug in their address, since you use the same syscalls and listen/connect dynamic on them, and you can run TCP protocols such as HTTP on them. The bulk of my answer still stands, though, since it's taken from the documentation which does not mention TCP - the error is fully mine.Vickey
In my case there was no data getting stuck as such. I think if I remember well (since 9 years passed since) the problem was that I was reaching the default max connection and file limits of the system. I had increased things like fs.file-max and net.core.somaxconn and the problem went away.Gorget

© 2022 - 2024 — McMap. All rights reserved.