Streaming large files in a java servlet
Asked Answered
P

8

46

I am building a java server that needs to scale. One of the servlets will be serving images stored in Amazon S3.

Recently under load, I ran out of memory in my VM and it was after I added the code to serve the images so I'm pretty sure that streaming larger servlet responses is causing my troubles.

My question is : is there any best practice in how to code a java servlet to stream a large (>200k) response back to a browser when read from a database or other cloud storage?

I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.

Any thoughts would be appreciated. Thanks.

Pearlene answered 11/9, 2008 at 2:40 Comment(0)
C
57

When possible, you should not store the entire contents of a file to be served in memory. Instead, aquire an InputStream for the data, and copy the data to the Servlet OutputStream in pieces. For example:

ServletOutputStream out = response.getOutputStream();
InputStream in = [ code to get source input stream ];
String mimeType = [ code to get mimetype of data to be served ];
byte[] bytes = new byte[FILEBUFFERSIZE];
int bytesRead;

response.setContentType(mimeType);

while ((bytesRead = in.read(bytes)) != -1) {
    out.write(bytes, 0, bytesRead);
}

// do the following in a finally block:
in.close();
out.close();

I do agree with toby, you should instead "point them to the S3 url."

As for the OOM exception, are you sure it has to do with serving the image data? Let's say your JVM has 256MB of "extra" memory to use for serving image data. With Google's help, "256MB / 200KB" = 1310. For 2GB "extra" memory (these days a very reasonable amount) over 10,000 simultaneous clients could be supported. Even so, 1300 simultaneous clients is a pretty large number. Is this the type of load you experienced? If not, you may need to look elsewhere for the cause of the OOM exception.

Edit - Regarding:

In this use case the images can contain sensitive data...

When I read through the S3 documentation a few weeks ago, I noticed that you can generate time-expiring keys that can be attached to S3 URLs. So, you would not have to open up the files on S3 to the public. My understanding of the technique is:

  1. Initial HTML page has download links to your webapp
  2. User clicks on a download link
  3. Your webapp generates an S3 URL that includes a key that expires in, lets say, 5 minutes.
  4. Send an HTTP redirect to the client with the URL from step 3.
  5. The user downloads the file from S3. This works even if the download takes more than 5 minutes - once a download starts it can continue through completion.
Cytochemistry answered 11/9, 2008 at 3:53 Comment(4)
Hmm, since no content length is set the servlet container must buffer because it need to set the content length header before it can stream any data. So not sure how much memory you save?Deluge
Peter, if you cannot point users directly to a cloud service URL, and you want to set the content length header, and you don't already know the size, and you cannot query the cloud service for the size, then I guess your best bet is to stream to a temp file on the server first. Of course, saving a copy on the server before sending the first byte to the client may cause the user to think the request failed depending on how long the cloud -> server transfer takes.Cytochemistry
@PeterKriens the content-length header is not mandatory. also, you can use chunked transfers where you only need to specify the length of a chunk.Astrogate
When the Content-Length header is set by the servlet, the output can be sent right away without any intermediate buffers. Otherwise the server must buffer or implement chunked transfers. In general, you should be able to get the needed info with a HEAD from the blob store.Deluge
U
17

Why wouldn't you just point them to the S3 url? Taking an artifact from S3 and then streaming it through your own server to me defeats the purpose of using S3, which is to offload the bandwidth and processing of serving the images to Amazon.

Usance answered 11/9, 2008 at 2:45 Comment(1)
pointing to the s3 url - I assume you're saying just givem the browser the s3 url to begin with. What if your images or videos are medical artifacts and are sensitive? S3 indeed supports an expiring url. But you can't send an expiring URL via email? A url that is not expired yet can still be used by someone else in the history all of which is insecure e.g for health care products.Erse
S
11

I've seen a lot of code like john-vasilef's (currently accepted) answer, a tight while loop reading chunks from one stream and writing them to the other stream.

The argument I'd make is against needless code duplication, in favor of using Apache's IOUtils. If you are already using it elsewhere, or if another library or framework you're using is already depending on it, it's a single line that is known and well-tested.

In the following code, I'm streaming an object from Amazon S3 to the client in a servlet.

import java.io.InputStream;
import java.io.OutputStream;
import org.apache.commons.io.IOUtils;

InputStream in = null;
OutputStream out = null;

try {
    in = object.getObjectContent();
    out = response.getOutputStream();
    IOUtils.copy(in, out);
} finally {
    IOUtils.closeQuietly(in);
    IOUtils.closeQuietly(out);
}

6 lines of a well-defined pattern with proper stream closing seems pretty solid.

Spile answered 23/4, 2014 at 18:31 Comment(3)
I agree with using what is available, but your code has a problem: If response.getOutputStream() generates an exception your InputStream in object will not be closed. Also Java 7's try-with-resources feature should be the pattern now, and future CommonsIO will have this. Nice hat :)Antiparticle
@Evandro Good catch-- What about this? (Also, thanks :)Spile
closeQuitely is now deprecated. commons.apache.org/proper/commons-io/apidocs/org/apache/commons/…Mezzotint
S
2

toby is right, you should be pointing straight to S3, if you can. If you cannot, the question is a little vague to give an accurate response: How big is your java heap? How many streams are open concurrently when you run out of memory?
How big is your read write/bufer (8K is good)?
You are reading 8K from the stream, then writing 8k to the output, right? You are not trying to read the whole image from S3, buffer it in memory, then sending the whole thing at once?

If you use 8K buffers, you could have 1000 concurrent streams going in ~8Megs of heap space, so you are definitely doing something wrong....

BTW, I did not pick 8K out of thin air, it is the default size for socket buffers, send more data, say 1Meg, and you will be blocking on the tcp/ip stack holding a large amount of memory.

Sulla answered 11/9, 2008 at 3:38 Comment(1)
What do you mean when you say - pointing straight to S3. Do you mean pass the S3 url to the browser so that they can stream it?Erse
I
2

I agree strongly with both toby and John Vasileff--S3 is great for off loading large media objects if you can tolerate the associated issues. (An instance of own app does that for 10-1000MB FLVs and MP4s.) E.g.: No partial requests (byte range header), though. One has to handle that 'manually', occasional down time, etc..

If that is not an option, John's code looks good. I have found that a byte buffer of 2k FILEBUFFERSIZE is the most efficient in microbench marks. Another option might be a shared FileChannel. (FileChannels are thread-safe.)

That said, I'd also add that guessing at what caused an out of memory error is a classic optimization mistake. You would improve your chances of success by working with hard metrics.

  1. Place -XX:+HeapDumpOnOutOfMemoryError into you JVM startup parameters, just in case
  2. take use jmap on the running JVM (jmap -histo <pid>) under load
  3. Analyize the metrics (jmap -histo out put, or have jhat look at your heap dump). It very well may be that your out of memory is coming from somewhere unexpected.

There are of course other tools out there, but jmap & jhat come with Java 5+ 'out of the box'

I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.

Ah, I don't think you can't do that. And even if you could, it sounds dubious. The tomcat thread that is managing the connection needs to in control. If you are experiencing thread starvation then increase the number of available threads in ./conf/server.xml. Again, metrics are the way to detect this--don't just guess.

Question: Are you also running on EC2? What are your tomcat's JVM start up parameters?

Ida answered 11/9, 2008 at 6:24 Comment(0)
H
0

You have to check two things:

  • Are you closing the stream? Very important
  • Maybe you're giving stream connections "for free". The stream is not large, but many many streams at the same time can steal all your memory. Create a pool so that you cannot have a certain number of streams running at the same time
Hanson answered 11/9, 2008 at 3:23 Comment(0)
D
0

In addition to what John suggested, you should repeatedly flush the output stream. Depending on your web container, it is possible that it caches parts or even all of your output and flushes it at-once (for example, to calculate the Content-Length header). That would burn quite a bit of memory.

Dannadannel answered 11/9, 2008 at 6:16 Comment(0)
B
0

If you can structure your files so that the static files are separate and in their own bucket, the fastest performance today can likely be achieved by using the Amazon S3 CDN, CloudFront.

Beanstalk answered 1/10, 2009 at 15:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.