TCP: When is EPOLLHUP generated?
Asked Answered
S

1

23

Also see this question, unanswered as of now.

There is a lot of confusion about EPOLLHUP, even in the man and Kernel docs. People seem to believe it is returned when polling on a descriptor locally closed for writing, i.e. shutdown(SHUT_WR), i.e. the same call that causes an EPOLLRDHUP at the peer. But this is not true, in my experiments I get EPOLLOUT, and no EPOLLHUP, after shutdown(SHUT_WR) (yes, it's counterintuitive to get writable, as the writing half is closed, but this is not the main point of the question).

The man is poor, because it says EPOLLHUP comes when Hang up happened on the associated file descriptor, without saying what "hang up" means - what did the peer do? what packets were sent? This other article just confuses things further and seems outright wrong to me.

My experiments show EPOLLHUP arrives once EOF (FIN packets) are exchanged both ways, i.e. once both sides issue shutdown(SHUT_WR). It has nothing to do with SHUT_RD, which I never call. Also nothing to do with close. In terms of packets, I have the suspicion that EPOLLHUP is raised on the ack of the hosts' sent FIN, i.e. the termination initiator raises this event in step 3 of the 4-way shutdown handshake, and the peer, in step 4 (see here). If confirmed, this is great, because it fills a gap that I've been looking for, namely how to poll non-blocking sockets for the final ack, without LINGER. Is this correct?

(note: I'm using ET, but I don't think it's relevant for this)

Sample code and output.

The code being in a framework, I extracted the meat of it, with the exception of TcpSocket::createListener, TcpSocket::connect and TcpSocket::accept, which do what you'd expect (not shown here).

void registerFd(int pollFd, int fd, const char* description)
{
    epoll_event ev = {
        EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET,
        const_cast<char*>(description) // union aggregate initialisation, initialises first member (void* ptr)
    };
    epoll_ctl(pollFd, EPOLL_CTL_ADD, fd, &ev);
}

struct EventPrinter
{
    friend std::ostream& operator<<(std::ostream& stream, const EventPrinter& obj)
    {
        return stream << "0x" << std::hex << obj.events_ << " = "
            << ((obj.events_& EPOLLIN) ? "EPOLLIN " : " ")
            << ((obj.events_& EPOLLOUT) ? "EPOLLOUT " : " ")
            << ((obj.events_& EPOLLERR) ? "EPOLLERR " : " ")
            << ((obj.events_& EPOLLRDHUP) ? "EPOLLRDHUP " : " ")
            << ((obj.events_& EPOLLHUP) ? "EPOLLHUP " : " ");
    }

    const uint32_t events_;
};

void processEvents(int pollFd)
{
    static int iterationCount = 0;
    ++iterationCount;

    std::array<epoll_event, 25> events;
    int eventCount;
    if (-1 ==
        (eventCount = epoll_wait(pollFd, events.data(), events.size(), 1)))
    {
        throw Exception("fatal: epoll_wait failed");
    }

    for (int i = 0; i < eventCount; ++i)
    {
        std::cout << "iteration #" << iterationCount << ": events on [" << static_cast<const char*>(events[i].data.ptr) << "]: [" << EventPrinter{events[i].events} << "]" << std::endl;
    }
}

TEST(EpollhupExample, SmokeTest)
{
    int pollFd_;
    if (-1 ==
        (pollFd_ = epoll_create1(0)))
    {
        throw Exception("fatal: could not create epoll socket");
    }

    const TcpSocket listener_ = TcpSocket::createListener(13500);
    if (!listener_.setFileStatusFlag(O_NONBLOCK, true))
        throw Exception("could not make listener socket non-blocking");
    registerFd(pollFd_, listener_.fd(), "listenerFD");

    const TcpSocket client = TcpSocket::connect("127.0.0.1", AF_INET, 13500);
    if (!client.valid()) throw;
    registerFd(pollFd_, client.fd(), "clientFD");





    //////////////////////////////////////////////
    /// start event processing ///////////////////
    //////////////////////////////////////////////

    processEvents(pollFd_); // iteration 1

    const TcpSocket conn = listener_.accept();
    if (!conn.valid()) throw;
    registerFd(pollFd_, conn.fd(), "serverFD");

    processEvents(pollFd_); // iteration 2

    conn.shutdown(SHUT_WR);

    processEvents(pollFd_); // iteration 3

    client.shutdown(SHUT_WR);

    processEvents(pollFd_); // iteration 4
}

Output:

    Info| TCP connection established to [127.0.0.1:13500]
iteration #1: events on [listenerFD]: [1 = EPOLLIN     ]
iteration #1: events on [clientFD]: [4 =  EPOLLOUT    ]
    Info| TCP connection accepted from [127.0.0.1:35160]

iteration #2: events on [serverFD]: [4 =  EPOLLOUT    ]
    // calling serverFD.shutdown(SHUT_WR) here

iteration #3: events on [clientFD]: [2005 = EPOLLIN EPOLLOUT  EPOLLRDHUP  ]           // EPOLLRDHUP arrives, nice.
iteration #3: events on [serverFD]: [4 =  EPOLLOUT    ]                               // serverFD (on which I called SHUT_WR) just reported as writable, not cool... but not the main point of the question
    // calling clientFD.shutdown(SHUT_WR) here

iteration #4: events on [serverFD]: [2015 = EPOLLIN EPOLLOUT  EPOLLRDHUP EPOLLHUP ]   // EPOLLRDHUP arrives, nice. EPOLLHUP too!
iteration #4: events on [clientFD]: [2015 = EPOLLIN EPOLLOUT  EPOLLRDHUP EPOLLHUP ]   // EPOLLHUP on the other side as well. Why? What does EPOLLHUP mean actually?

There is not better way to rephrase the question, other than, what does EPOLLHUP mean? I made the case that documentation is poor, and information in other places (e.g. here and here) is wrong or useless.

Note: To consider the Q answered, I want confirmation that EPOLLHUP is raised on the final FIN-ACKs of both directions.

Shell answered 24/10, 2018 at 18:56 Comment(1)
(note: this question is a re-post after adding clarifications and sample code, as requested by community members)Shell
M
19

For these kind of questions, use the source! Among other interesting comments, there is this text:

EPOLLHUP is UNMASKABLE event (...). It means that after we received EOF, poll always returns immediately, making impossible poll() on write() in state CLOSE_WAIT. One solution is evident --- to set EPOLLHUP if and only if shutdown has been made in both directions.

And then the only code that sets EPOLLHUP:

if (sk->sk_shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
    mask |= EPOLLHUP;

Being SHUTDOWN_MASK equal to RCV_SHUTDOWN |SEND_SHUTDOWN.

TL; DR; You are right, this flag is only sent when the shutdown has been both for read and write (I reckon that the peer shutdowning the write equals to my shutdowning the read). Or when the connection is closed, of course.

UPDATE: From reading the source code with more detail, these are my conclusions.

About shutdown:

  1. Doing shutdown(SHUT_WR) sends a FIN and marks the socket with SEND_SHUTDOWN.
  2. Doing shutdown(SHUT_RD) sends nothing and marks the socket with RCV_SHUTDOWN.
  3. Receiving a FIN marks the socket with RCV_SHUTDOWN.

And about epoll:

  1. If the socket is marked with SEND_SHUTDOWN and RCV_SHUTDOWN, poll will return EPOLLHUP.
  2. If the socket is marked with RCV_SHUTDOWN, poll will return EPOLLRDHUP.

So the HUP events can be read as:

  1. EPOLLRDHUP: you have received FIN or you have called shutdown(SHUT_RD). In any case your reading half-socket is hung, that is, you will read no more data.
  2. EPOLLHUP: you have both half-sockets hung. The reading half-socket is just like the previous point, For the sending half-socket you did something like shutdown(SHUT_WR).

To complete a a graceful shutdown I would do:

  1. Do shutdown(SHUT_WR) to send a FIN and mark the end of sending data.
  2. Wait for the peer to do the same by polling until you get a EPOLLRDHUP.
  3. Now you can close the socket with grace.

PS: About your comment:

it's counterintuitive to get writable, as the writing half is closed

It is actually expected if you understand the output of epoll not as ready but as will not block. That is, if you get EPOLLOUT you have the guarantee that calling write() will not block. And certainly, after shutdown(SHUT_WR), write() will return immediately.

Marseillaise answered 24/10, 2018 at 19:9 Comment(7)
I reckon that the peer shutdowning the write equals to my shutdowning the read No, you see, that is confusing. Why do you say that when there are 2 different things obviously: receiving EPOLLRDHUP and shutdown(SHUT_RD).Shell
... the rest of what you say makes sense. But, from this: RCV_SHUTDOWN |SEND_SHUTDOWN I gather that in the kernel code you are pasting SHUTDOWN refers to receiving and sending FIN (NOT to SHUT_RD and SHUT_WR). SHUT_RD is bollocks, from what I can tell.Shell
@haelix. My understanding is that calling shutdown(SHUT_WR) sends a FIN and sets SEND_SHUTDOWN, calling shutdown(SHUT_RD) just sets RCV_SHUTDOWN, and getting a FIN from the peer also sets RCV_SHUTDOWN. And from the linked coude, see that you get a EPOLLRDHUP only when RCV_SHUTDOWN is set.Marseillaise
I understand, but I'm trying to highlight the (important, in my opinion) aspect that EPOLLHUP is generated as part of graceful shutdown. SHUT_RD is not graceful, it means "I'm refusing any more data", so the emphasis has to be on sending FIN and receiving FIN for the HUP to happen. I.e., the last paragraph in your Answer should be improved. If we are correct that is.Shell
@haelix. I've browsed the source code a bit more and added my conclusions. Please, let me know if you agree.Marseillaise
Yep, it's good. Particularly the paragraph To complete a a graceful shutdown I would do. One last question, there is nothing in the sources about FIN ACKs? I was wondering also when do ACKs come into play. For example, if a FIN is not ACKed for a specific time, the FIN gets retransmitted - so I'm wondering if that still counts as SEND_SHUTDOWN.Shell
@haelix: The retransmission code is complicated, the FIN code is complicated, the retransmission of FINs is... well, very complicated. But I don't think it makes any difference to the epoll usage, because the SEND_SHUTDOWN flag is set as soon as shutdown() is called.Marseillaise

© 2022 - 2024 — McMap. All rights reserved.