Reporting upload progress from node.js
Asked Answered
C

1

10

I'm writing a small node.js application that receives a multipart POST from an HTML form and pipes the incoming data to Amazon S3. The formidable module provides the multipart parsing, exposing each part as a node Stream. The knox module handles the PUT to s3.

var form = new formidable.IncomingForm()
 ,  s3   = knox.createClient(conf);

form.onPart = function(part) {
    var put = s3.putStream(part, filename, headers, handleResponse);
    put.on('progress', handleProgress);
};

form.parse(req);

I'm reporting the upload progress to the browser client via socket.io, but am having difficulty getting these numbers to reflect the real progress of the node to s3 upload.

When the browser to node upload happens near instantaneously, as it does when the node process is running on the local network, the progress indicator reaches 100% immediately. If the file is large, i.e. 300MB, the progress indicator rises slowly, but still faster than our upstream bandwidth would allow. After hitting 100% progress, the client then hangs, presumably waiting for the s3 upload to finish.

I know putStream uses Node's stream.pipe method internally, but I don't understand the detail of how this really works. My assumption is that node gobbles up the incoming data as fast as it can, throwing it into memory. If the write stream can take the data fast enough, little data is kept in memory at once, since it can be written and discarded. If the write stream is slow though, as it is here, we presumably have to keep all that incoming data in memory until it can be written. Since we're listening for data events on the read stream in order to emit progress, we end up reporting the upload as going faster than it really is.

Is my understanding of this problem anywhere close to the mark? How might I go about fixing it? Do I need to get down and dirty with write, drain and pause?

Combings answered 9/11, 2012 at 15:45 Comment(4)
Are you reporting the progress back to the browser inside the handleProgress callback? You have not posted any code that might have anything to do with the actual progress reporting. Posting more code is likely to help.Assume
What version of Node.JS you are using? Apperently there was a bug with request.pause() ( in your case: part variable ) in Node.JS v0.6.x, which is internally used by .pipe(). This should be fixed in v0.7+.Alixaliza
@Assume - That's right. The actual implementation isn't really relevant though: for the purposes of the question it might as well be console.log.Combings
@Alixaliza - Thanks, that's useful to know. I'm running 0.8.8 though.Combings
S
8

Your problem is that stream.pause isn't implemented on the part, which is a very simple readstream of the output from the multipart form parser.

Knox instructs the s3 request to emit "progress" events whenever the part emits "data". However since the part stream ignores pause, the progress events are emitted as fast as the form data is uploaded and parsed.

The formidable form, however, does know how to both pause and resume (it proxies the calls to the request it's parsing).

Something like this should fix your problem:

form.onPart = function(part) {

    // once pause is implemented, the part will be able to throttle the speed
    // of the incoming request
    part.pause = function() {
      form.pause();
    };

    // resume is the counterpart to pause, and will fire after the `put` emits
    // "drain", letting us know that it's ok to start emitting "data" again
    part.resume = function() {
      form.resume();
    };

    var put = s3.putStream(part, filename, headers, handleResponse);
    put.on('progress', handleProgress);
};
Seroka answered 13/11, 2012 at 0:58 Comment(3)
Thanks @numbers1311407, great answer. I'm bound to ask: can you see any significant drawbacks to implementing pause and resume this way? In effect I suppose it makes our server at most only as responsive as s3 is. I've implemented it in the test code here.Combings
As I'm no I/O whiz myself, I tend to wonder the same thing. But the node.js stream doc page does mention upload throttling as a useful case for pause. This newsgroup discussion about the request.pause "bug" is worth perusing (Mikeal and Marco's comments).Seroka
In the end it solves two problems for you: 1.) it keeps the client on the line until the actual upload has completed, and 2.) it allows this to happen without buffering potentially large amounts of data on the server. You could solve also solve this problem by piping to a buffered stream before the s3 request, monitoring the progress there, and calling back the client when the upload finishes. But this throws out #2.Seroka

© 2022 - 2024 — McMap. All rights reserved.