Node echo server degrades 10x when stream pipes are used over buffering
Asked Answered
M

2

41

On node v8.1.4 and v6.11.1

I started out with the following echo server implementation, which I will refer to as pipe.js or pipe.

const http = require('http');

const handler = (req, res) => req.pipe(res);
http.createServer(handler).listen(3001);

And I benchmarked it with wrk and the following lua script (shortened for brevity) that will send a small body as a payload.

wrk.method = "POST"
wrk.body   = string.rep("a", 10)

At 2k requests per second and 44ms of average latency, performance is not great.

So I wrote another implementation that uses intermediate buffers until the request is finished and then writes those buffers out. I will refer to this as buffer.js or buffer.

const http = require('http');

const handler = (req, res) => {
  let buffs = [];
  req.on('data', (chunk) => {
    buffs.push(chunk);
  });
  req.on('end', () => {
    res.write(Buffer.concat(buffs));
    res.end();
  });
};
http.createServer(handler).listen(3001);

Performance drastically changed with buffer.js servicing 20k requests per second at 4ms of average latency.

Visually, the graph below depicts the average number of requests serviced over 5 runs and various latency percentiles (p50 is median).

buffer-pipe-comparison

So, buffer is an order of magnitude better in all categories. My question is why?

What follows next are my investigation notes, hopefully they are at least educational.

Response Behavior

Both implementations have been crafted so that they will give the same exact response as returned by curl -D - --raw. If given a body of 10 d's, both will return the exact same response (with modified time, of course):

HTTP/1.1 200 OK
Date: Thu, 20 Jul 2017 18:33:47 GMT
Connection: keep-alive
Transfer-Encoding: chunked

a
dddddddddd
0

Both output 128 bytes (remember this).

The Mere Fact of Buffering

Semantically, the only difference between the two implementations is that pipe.js writes data while the request hasn't ended. This might make one suspect that there could be multiple data events in buffer.js. This is not true.

req.on('data', (chunk) => {
  console.log(`chunk length: ${chunk.length}`);
  buffs.push(chunk);
});
req.on('end', () => {
  console.log(`buffs length: ${buffs.length}`);
  res.write(Buffer.concat(buffs));
  res.end();
});

Empirically:

  • Chunk length will always be 10
  • Buffers length will always be 1

Since there will only ever be one chunk, what happens if we remove buffering and implement a poor man's pipe:

const http = require('http');

const handler = (req, res) => {
  req.on('data', (chunk) => res.write(chunk));
  req.on('end', () => res.end());
};
http.createServer(handler).listen(3001);

Turns out, this has as abysmal performance as pipe.js. I find this interesting because the same number of res.write and res.end calls are made with the same parameters. My best guess so far is that the performance differences are due to sending response data after the request data has ended.

Profiling

I profiled both application using the simple profiling guide (--prof).

I've included only the relevant lines:

pipe.js

 [Summary]:
   ticks  total  nonlib   name
   2043   11.3%   14.1%  JavaScript
  11656   64.7%   80.7%  C++
     77    0.4%    0.5%  GC
   3568   19.8%          Shared libraries
    740    4.1%          Unaccounted

 [C++]:
   ticks  total  nonlib   name
   6374   35.4%   44.1%  syscall
   2589   14.4%   17.9%  writev

buffer.js

 [Summary]:
   ticks  total  nonlib   name
   2512    9.0%   16.0%  JavaScript
  11989   42.7%   76.2%  C++
    419    1.5%    2.7%  GC
  12319   43.9%          Shared libraries
   1228    4.4%          Unaccounted

 [C++]:
   ticks  total  nonlib   name
   8293   29.6%   52.7%  writev
    253    0.9%    1.6%  syscall

We see that in both implementations, C++ dominates time; however, the functions that dominate are swapped. Syscalls account for nearly half the time for pipe, yet only 1% for buffer (forgive my rounding). Next step, which syscalls are the culprit?

Strace Here We Come

Invoking strace like strace -c node pipe.js will give us a summary of the syscalls. Here are the top syscalls:

pipe.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.91    0.014974           2      9492           epoll_wait
 25.57    0.008720           0    405693           clock_gettime
 20.09    0.006851           0     61748           writev
  6.11    0.002082           0     61803       106 write

buffer.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 42.56    0.007379           0    121374           writev
 32.73    0.005674           0    617056           clock_gettime
 12.26    0.002125           0    121579           epoll_ctl
 11.72    0.002032           0    121492           read
  0.62    0.000108           0      1217           epoll_wait

The top syscall for pipe (epoll_wait) with 44% of the time is only 0.6% of the time for buffer (a 140x increase). While there is a large time discrepancy, the number of times epoll_wait is invoked is less lopsided with pipe calling epoll_wait ~8x more often. We can derive a couple bits of useful information from that statement, such that pipe calls epoll_wait constantly and an average, these calls are heavier than the epoll_wait for buffer.

For buffer, the top syscall is writev, which is expected considering most of the time should be spent writing data to a socket.

Logically the next step is to take a look at these epoll_wait statements with regular strace, which showed buffer always contained epoll_wait with 100 events (representing the hundred connections used with wrk) and pipe had less than 100 most of the time. Like so:

pipe.js

epoll_wait(5, [.16 snip.], 1024, 0) = 16

buffer.js

epoll_wait(5, [.100 snip.], 1024, 0) = 100

Graphically:

buffer-pipe-events

This explains why there are more epoll_wait in pipe, as epoll_wait doesn't service all the connections in one event loop. The epoll_wait for zero events makes it look like the event loop is idle! All this doesn't explain why epoll_wait takes up more time for pipe, as from the man page it states that epoll_wait should return immediately:

specifying a timeout equal to zero cause epoll_wait() to return immediately, even if no events are available.

While the man page says the function returns immediately, can we confirm this? strace -T to the rescue:

buffer pipe events

Besides supporting that buffer has fewer calls, we can also see that nearly all calls took less than 100ns. Pipe has a much more interesting distribution showing that while most calls take under 100ns, a non-negligible amount take longer and land into the microsecond land.

Strace did find another oddity, and that's with writev. The return value is the number of bytes written.

pipe.js

writev(11, [{"HTTP/1.1 200 OK\r\nDate: Thu, 20 J"..., 109},
  {"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}], 4) = 123

buffer.js

writev(11, [{"HTTP/1.1 200 OK\r\nDate: Thu, 20 J"..., 109},
  {"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}, {"0\r\n\r\n", 5}], 5) = 128

Remember when I said that both output 128 bytes? Well, writev returned 123 bytes for pipe and 128 for buffer. The five bytes difference for pipe is reconciled in a subsequent write call for each writev.

write(44, "0\r\n\r\n", 5)

And if I'm not mistaken, write syscalls are blocking.

Conclusion

If I have to make an educated guess, I would say that piping when the request is not finished causes write calls. These blocking calls significantly reduce the throughput partially through more frequent epoll_wait statements. Why write is called instead of a single writev that is seen in buffer is beyond me. Can someone explain why everything I saw is happening?

The kicker? In the official Node.js guide you can see how the guide starts with the buffer implementation and then moves to pipe! If the pipe implementation is in the official guide there shouldn't be such a performance hit, right?

Aside: Real world performance implications of this question should be minimal, as the question is quite contrived especially in regards to the functionality and the body side, though this doesn't mean this is any less of a useful question. Hypothetically, an answer could look like "Node.js uses write to allow for better performance under x situations (where x is a more real world use case)"


Disclosure: question copied and slightly modified from my blog post in the hopes this is a better avenue for getting this question answered


July 31st 2017 EDIT

My initial hypothesis that writing the echoed body after the request stream has finished increases performance has been disproved by @robertklep with his readable.js (or readable) implementation:

const http   = require('http');
const BUFSIZ = 2048;

const handler = (req, res) => {
  req.on('readable', _ => {
    let chunk;
    while (null !== (chunk = req.read(BUFSIZ))) {
      res.write(chunk);
    }
  });
  req.on('end', () => {
    res.end();
  });
};
http.createServer(handler).listen(3001);

Readable performed at the same level as buffer while writing data before the end event. If anything this makes me more confused because the only difference between readable and my initial poor man's pipe implementation is the difference between the data and readable event and yet that caused a 10x performance increase. But we know that the data event isn't inherently slow because we used it in our buffer code.

For the curious, strace on readable reported writev outputs all 128 bytes output like buffer

This is perplexing!

Melantha answered 25/7, 2017 at 13:57 Comment(9)
You dug further than I would have, and thank you for this. But have you considered the fact that piping will greatly reduce response times if you are echo'ing MB's of data? Also, having to buffer everything before echo'ing will saturate RAM and get slower than pipes on high concurrency.Olympia
That was my thought too, but what I get stuck on is when I did the poor man's pipe implementation I saw the same performance drop even though write and end were being called the same number of times with the same parameters. I realize the example is contrived but why would node choose write over writev when piping?Melantha
What version of node did you test this on? Have you tried other versions? Have you tried larger request bodies? I also highly recommend that you file this research as an issue, since this may very well be a performance regression.Varus
This is node v8.1.4 (so one of the newest). I tried with the latest LTS (6.11.1) and saw the same performance characteristics. I will edit post with this updated information. Also larger request bodies see performance differences shrink, but I chose the (contrived) body size to highlight differences and make the investigation easier.Melantha
I wish I could upvote this twice. Well-formed, well-researched questions such as this are unfortunately rare on this site.Inevitable
If you feel like it, try this: gist.github.com/robertklep/1682ae054ebac9e822a9f0c30ac2ab8f It's similar to your poor man's pipe, but on my MBP is the fastest of the three (pipe.js: 13.8K/s, buffer.js: 19K/s, readable.js: 21K/s)Conah
@Conah thanks for more info! I tested and found buffer and readable to not be too statistically different, so my initial hypothesis about the slowdown coming from writing while reading is now disproved!Melantha
BTW - the docs say that readable is considered faster than both pipe and buffer - fwiwIronstone
How would I use this in conjunction with Express?Pterodactyl
C
11

That's a funny question you have!

In fact, buffered vs piped is not the question here. You have a small chunk; it is processed in one event. To show the issue at hand, you can write your handler like this:

let chunk;
req.on('data', (dt) => {
    chunk=dt
});
req.on('end', () => {
    res.write(chunk);
    res.end();
});

or

let chunk;
req.on('data', (dt) => {
    chunk=dt;
    res.write(chunk);
    res.end();
});
req.on('end', () => {
});

or

let chunk;
req.on('data', (dt) => {
    chunk=dt
    res.write(chunk);
});
req.on('end', () => {
    res.end();
});

If write and end are on the same handler, latency is 10 times less.

If you check the write function code, there is around this line

msg.connection.cork();
process.nextTick(connectionCorkNT, msg.connection);

cork and uncork connection on next event. This means that you use a cache for the data, then you force the data to be sent on the next event before other events are processed.

To sum up, if you have write and end on different handlers, you will have:

  1. cork connection (+ create a tick to uncork)
  2. create buffer with data
  3. uncork connection from another event (send data)
  4. call end process (that send another packet with the final chunk and close)

If they are on same handler, the end function is called before the uncork event is processed, so the final chunk will be in the cache.

  1. cork connection
  2. create buffer with data
  3. add "end" chunk on buffer
  4. uncork connection to send everything

Also, the end function runs cork / uncork synchronously, which will be a little bit faster.

Now why does this matter? Because on the TCP side, if you send a packet with data, and wish to send more, process will wait for an acknowledge from the client before sending more:

write + end on different handlers:

About 40ms for ack

  • 0.044961s: POST / => it is the request
  • 0.045322s: HTTP/1.1 => 1st chunk : header + "aaaaaaaaa"
  • 0.088522s: acknowledge of packet
  • 0.088567s: Continuation => 2nd chunk (ending part, 0\r\n\r\n)

There is ~40 ms before ack after the 1st buffer is sent.

write + end in the same handler:

No ack needed

The data is complete in a single packet, no ack needed.

Why the 40ms on ACK? This is a built-in feature in the OS to improve performance overall. It is described in section 4.2.3.2 of IETF RFC 1122: When to Send an ACK Segment'. Red Hat (Fedora/CentOS/RHEL) uses 40ms: it is a parameter and can be modified. On Debian (Ubuntu included), it seems to be hardcoded to 40ms, so it's not modifiable (except if you create a connection with the TCP_NO_DELAY option).

I hope this is enough detail to understand a little bit more about the process. This answer is already big, so I will stop here, I guess.

Readable

I checked your note about readable. Wild guess: if readable detects an empty input it closes the stream on the same tick.

Edit: I read the code for readable. As I suspected:

https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#L371

https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#L1036

If read finishes an event, end is immediately emitted to be processed next.

So the event processing is:

  1. readable event: reads data
  2. readable detects it has finished => creates end event
  3. You write data so that it creates an event to uncork
  4. end event processed (uncork done)
  5. uncork processed (but do nothing as everything is already done)

If you reduce the buffer:

req.on('readable',()=> {
    let chunk2;
    while (null !== (chunk2 = req.read(5))) {
        res.write(chunk2);
    }
});

This forces two writes. The process will be:

  1. readable event: reads data. You get five as.
  2. You write data that creates an uncork event
  3. You read data. readable detects it has finished => create end event
  4. You write data and it is added to the buffered data
  5. uncork processed (because it was launched before end); you send data
  6. end event processed (uncork done) => wait for ACK to send final chunk
  7. Process will be slow (it is; I checked)
Convery answered 1/8, 2017 at 7:6 Comment(1)
Wow, thanks for cracking the case! I'll accept this answer and give you the bounty. With your lead I'll continue to investigate, especially considering that node v8.2.0 came out with changes related to corkMelantha
I
0

The clue is in the latency, the latencies are roughly 10x difference. I think that because the buffering approach moves the write call to the req.on('end', ...), the server can optimize the response. Even though only one buffer of ten bytes is read and written in any given request, there are many simultaneous requests being made.

Roughly estimating with 2K 10 byte requests per second, and ~50ms latency, I figure the time spent in actually transmitting the 'data' is negligible. That would suggest the server is handling about 100 simultaneous requests at any given time.

1 / .05 = 20.  2000/20 = 100

Now switch to ~5ms latency, and again considering the actual data tx time is 0.

1 / .005 = 200.  20000/200 = 100.

Still we have the server handling roughly 100 requests coming in concurrently at any point in time.

I don't know the server internals, but if your server is hitting and upper limit like this, it is probably introducing latency to have the 'data' event handler also handling the writing of data to the response.

By buffering and returning immediately, the handler can be released sooner and thus drastically cut latency on the read side. The open question in my mind is: would the handler really need almost 50ms overhead to write the response? I would think not, but if 100 request are competing for resources to write their data, that might begin to add up. Couple that with the fact that a res.end() still needs to be called (in a different handler) and you've potentially found your latency hog.

On the 'end' processing, 20K 10byte responses can hardly be called much of a data load, so it is a gain a question of resource management, i.e., the response completion handler(s). If the res.write() and res.end() happen on the same handler, it is presumably more efficient than writing the 10 bytes on one and ending the response on another. Either way, I cannot imaging that the response completion code is introducing any lag. It is more likely that it is starved for work to do (even in the buffered approach).

EDIT

You could also try res.end(data) in your buffering approach instead of calling res.write(data) followed by res.end(), and see if that adds any clarifying data points to your analysis.

EDIT

I just tried the same test on my systems. I used a Ubuntu Linux VM on another physical machine as the client, wrk as the test bench, a lua script like yours, and default settings. I used Windows 8 desktop to run nodejs, same script except using port 8080. My performance for both pipe() and and buffering were much lower than yours, but compared to each other, buffers were about 9x faster than pipe(). So, this is simply an independent confirmation.

Ironstone answered 1/8, 2017 at 2:25 Comment(2)
Node.js does not use threads in the event loop for socket connections.Isadora
@MatteoCollina - yes, poor choice of words, I used the word thread conceptually, but after re-reading, saw how it would be read concretely despite my summarizing the whole process as threadless. I have updated my answer, and since have run the same test as the OP.Ironstone

© 2022 - 2024 — McMap. All rights reserved.