Best strategy to upload files with unknown size to S3
Asked Answered
O

1

14

I have a server-side application that runs through a large number of image URLs and uploads the images from these URLs to S3. The files are served over HTTP. I download them using InputStream I get from an HttpURLConnection using the getInputStream method. I hand the InputStream to AWS S3 Client putObject method (AWS Java SDK v1) to upload the stream to S3. So far so good.

I am trying to introduce a new external image data source. The problem with this data source is that the HTTP server serving these images does not return a Content-Length HTTP header. This means I cannot tell how many bytes the image will be, which is a number required by the AWS S3 client to validate the image was correctly uploaded from the stream to S3.

The only ways I can think of dealing with this issue is to either get the server owner to add Content-Length HTTP header to their response (unlikely), or to download the file to a memory buffer first and then upload it to S3 from there.

These are not big files, but I have many of them.

When considering downloading the file first, I am worried about the memory footprint and concurrency implications (not being able to upload and download chunks of the same file at the same time).

Since I am dealing with many small files, I suspect that concurrency issues might be "resolved" if I focus on the concurrency of the multiple files instead of a single file. So instead of concurrently downloading and uploading chunks of the same file, I will use my IO effectively downloading one file while uploading another.

I would love your ideas on how to do this, best practices, pitfalls or any other thought on how to best tackle this issue.

Outpost answered 13/2, 2019 at 18:3 Comment(7)
putObject javadoc states that "if the caller doesn't provide [the content length], the library will make a best effort to compute the content length by buffer the contents of the input stream into the memory". Did you try not specify the length?Caracalla
@Caracalla that's a good point. thanks for pointing it out. However, without knowing the length, I am left without a way to validate the success of the upload. I will dig into AWS SDK code to see if I can get some confidence from the it computes the length.Outpost
For small files, an in-memory buffer looks to be the best solution. For big files, you can also buffer on disk in a temporary file. How big are they?Indignation
I don't see any concurrency issues, really. "not being able to upload and download chunks of the same file at the same time": You cannot do that anyway because you need to complete the download to count the bytes before you start uploading.Indignation
"not being able to upload and download chunks of the same file at the same time" There is a way to do this. I've done it. The multipart upload API allows you to upload objects in chunks as small as 5 MB each, with no knowledge of the final object size required. Each chunk must be a minimum of 5 MB except the last chunk, and if the last chunk also happens to be the first chunk (only true when the entire object is less than 5 MB total), that's still a valid "multipart" upload.Sissy
You might also be able to "convince" the other side to send a Content-Length header by setting Accept-Encoding: identity in your request headers. Whether that works depends on why they aren't sending it now, but it's a valid request, so it wouldn't hurt to try.Sissy
@Michael-sqlbot the question is old, but perhaps you could turn your comment about multipart into an answer?Anarthrous
E
0

I have used minio S3 API before and my conclusion was to store content in temporary file, determine its size and then upload it to S3 specifying content size. If I counted on S3 to compute the size, I would have a hard time downloading file, it was sometimes damaged. The flow with temporary directory (emptyDir in k8s mostly) has given me 100% correctness.

Engels answered 2/7 at 6:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.