Best strategy to upload files with unknown size to S3 - McMap

About

Best strategy to upload files with unknown size to S3

Asked 13/2, 2019 at 18:3 Answered 2/7, 2024 at 6:42

java scala amazon-s3 concurrency io

O

1

14

I have a server-side application that runs through a large number of image URLs and uploads the images from these URLs to S3. The files are served over HTTP. I download them using InputStream I get from an HttpURLConnection using the getInputStream method. I hand the InputStream to AWS S3 Client putObject method (AWS Java SDK v1) to upload the stream to S3. So far so good.

I am trying to introduce a new external image data source. The problem with this data source is that the HTTP server serving these images does not return a Content-Length HTTP header. This means I cannot tell how many bytes the image will be, which is a number required by the AWS S3 client to validate the image was correctly uploaded from the stream to S3.

The only ways I can think of dealing with this issue is to either get the server owner to add Content-Length HTTP header to their response (unlikely), or to download the file to a memory buffer first and then upload it to S3 from there.

These are not big files, but I have many of them.

When considering downloading the file first, I am worried about the memory footprint and concurrency implications (not being able to upload and download chunks of the same file at the same time).

Since I am dealing with many small files, I suspect that concurrency issues might be "resolved" if I focus on the concurrency of the multiple files instead of a single file. So instead of concurrently downloading and uploading chunks of the same file, I will use my IO effectively downloading one file while uploading another.

I would love your ideas on how to do this, best practices, pitfalls or any other thought on how to best tackle this issue.

Outpost answered 13/2, 2019 at 18:3 Comment(7)

putObject javadoc states that "if the caller doesn't provide [the content length], the library will make a best effort to compute the content length by buffer the contents of the input stream into the memory". Did you try not specify the length? – Caracalla 13/2, 2019 at 19:58

@Caracalla that's a good point. thanks for pointing it out. However, without knowing the length, I am left without a way to validate the success of the upload. I will dig into AWS SDK code to see if I can get some confidence from the it computes the length. – Outpost 13/2, 2019 at 21:50

For small files, an in-memory buffer looks to be the best solution. For big files, you can also buffer on disk in a temporary file. How big are they? – Indignation 13/2, 2019 at 23:46

I don't see any concurrency issues, really. "not being able to upload and download chunks of the same file at the same time": You cannot do that anyway because you need to complete the download to count the bytes before you start uploading. – Indignation 13/2, 2019 at 23:48

"not being able to upload and download chunks of the same file at the same time" There is a way to do this. I've done it. The multipart upload API allows you to upload objects in chunks as small as 5 MB each, with no knowledge of the final object size required. Each chunk must be a minimum of 5 MB except the last chunk, and if the last chunk also happens to be the first chunk (only true when the entire object is less than 5 MB total), that's still a valid "multipart" upload. – Sissy 14/2, 2019 at 0:43

You might also be able to "convince" the other side to send a Content-Length header by setting Accept-Encoding: identity in your request headers. Whether that works depends on why they aren't sending it now, but it's a valid request, so it wouldn't hurt to try. – Sissy 14/2, 2019 at 0:55

@Michael-sqlbot the question is old, but perhaps you could turn your comment about multipart into an answer? – Anarthrous 1/9, 2021 at 11:19

E

0

I have used minio S3 API before and my conclusion was to store content in temporary file, determine its size and then upload it to S3 specifying content size. If I counted on S3 to compute the size, I would have a hard time downloading file, it was sometimes damaged. The flow with temporary directory (emptyDir in k8s mostly) has given me 100% correctness.

Engels answered 2/7, 2024 at 6:42 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2025 — McMap. All rights reserved.