Is it possible to perform a batch upload to amazon s3?
Asked Answered
M

7

103

Does amazon s3 support batch uploads? I have a job that needs to upload each night ~100K of files that can be up to 1G but is strongly skewed towards small files (90% are less than 100 bytes and 99% are less than 1000 bytes long).

Does the s3 API support uploading multiple objects in a single HTTP call?

All the objects must be available in S3 as individual objects. I cannot host them anywhere else (FTP, etc) or in another format (Database, EC2 local drive, etc). That is an external requirement that I cannot change.

Main answered 24/2, 2013 at 8:53 Comment(4)
is it ok for me to ask these questions?Circumcision
I am wondering why such requirement appears. If you need to replace all files at once, maybe there's some way to upload them to temporary bucket in a regular way and then change bucket names?Finnougrian
You could have a look to JetS3t, which is quite fully featured in regard to S3 syncing with multithreading.Corelation
Is the accepted answer for this question still valid? It's been 5 years so just curious if anything has changed in that time...Hajji
F
45

Does the s3 API support uploading multiple objects in a single HTTP call?

No, the S3 PUT operation only supports uploading one object per HTTP request.

You could install S3 Tools on your machine that you want to synchronize with the remote bucket, and run the following command:

s3cmd sync localdirectory s3://bucket/

Then you could place this command in a script and create a scheduled job to run this command each night.

This should do what you want.

The tool performs the file synchronization based on MD5 hashes and filesize, so collision should be rare (if you really want you could just use the "s3cmd put" command to force blind overwriting of objects in your target bucket).

EDIT: Also make sure that you read the documentation on the site I linked for S3 Tools - there are different flags needed for whether you want files deleted locally to be deleted from the bucket or ignored etc.

Finn answered 24/2, 2013 at 10:40 Comment(7)
This method still uses individual put operations and is not inherently faster than anything else. The answer was accepted but it seems that all you've done is point to a tool that does the same thing he could do in code.Weigle
you could do a sync from the node api also - check out node s3-clientAquanaut
s3cmd requires a license for continued useWeixel
Is your answer still valid about uploading one object at a time 5 years after?Somewhat
I agree with @WeigleGuardrail
I have approximately 350K images, which are 45gb of data in total... can I use sync to transfer all of them?Bridgman
Yes you can with a single call. See my answer below.Shoveler
O
54

Alternatively, you can upload S3 via AWS CLI tool using the sync command.

aws s3 sync local_folder s3://bucket-name

You can use this method to batch upload files to S3 very fast.

Orten answered 16/6, 2014 at 19:7 Comment(2)
as with the previous answer, the implication here seems to be that these tools are somehow doing something that can't otherwise be accomplished with the API and I don't believe that is the caseWeigle
I'm currently using the AWS CLI tool to sync between a local directory and a S3 bucket. I'd like to know if there is an argument or parameter that can be used to output the "upload" or sync results to a local TXT file that I can then email to someone via blat.exe. All of this is to be put into a batch file for a scheduled sync of thousands of files that are to be downloaded by our other servers. (Using S3 bucket as a cloud source to overcome upload speed issues of our source server)Coagulant
F
45

Does the s3 API support uploading multiple objects in a single HTTP call?

No, the S3 PUT operation only supports uploading one object per HTTP request.

You could install S3 Tools on your machine that you want to synchronize with the remote bucket, and run the following command:

s3cmd sync localdirectory s3://bucket/

Then you could place this command in a script and create a scheduled job to run this command each night.

This should do what you want.

The tool performs the file synchronization based on MD5 hashes and filesize, so collision should be rare (if you really want you could just use the "s3cmd put" command to force blind overwriting of objects in your target bucket).

EDIT: Also make sure that you read the documentation on the site I linked for S3 Tools - there are different flags needed for whether you want files deleted locally to be deleted from the bucket or ignored etc.

Finn answered 24/2, 2013 at 10:40 Comment(7)
This method still uses individual put operations and is not inherently faster than anything else. The answer was accepted but it seems that all you've done is point to a tool that does the same thing he could do in code.Weigle
you could do a sync from the node api also - check out node s3-clientAquanaut
s3cmd requires a license for continued useWeixel
Is your answer still valid about uploading one object at a time 5 years after?Somewhat
I agree with @WeigleGuardrail
I have approximately 350K images, which are 45gb of data in total... can I use sync to transfer all of them?Bridgman
Yes you can with a single call. See my answer below.Shoveler
C
12

Survey

Is it possible to perform a batch upload to Amazon S3?

Yes*.

Does the S3 API support uploading multiple objects in a single HTTP call?

No.

Explanation

Amazon S3 API doesn't support bulk upload, but awscli supports concurrent (parallel) upload. From the client perspective and bandwidth efficiency these options should perform roughly the same way.

 ────────────────────── time ────────────────────►

    1. Serial
 ------------------
   POST /resource
 ────────────────► POST /resource
   payload_1     └───────────────► POST /resource
                   payload_2     └───────────────►
                                   payload_3
    2. Bulk
 ------------------
   POST /bulk
 ┌────────────┐
 │resources:  │
 │- payload_1 │
 │- payload_2 ├──►
 │- payload_3 │
 └────────────┘

    3. Concurrent
 ------------------
   POST /resource
 ────────────────►
   payload_1

   POST /resource
 ────────────────►
   payload_2

   POST /resource
 ────────────────►
   payload_3

AWS Command Line Interface

Documentation on how can I improve the transfer performance of the sync command for Amazon S3? suggests to increase concurrency in two ways. One of them is this:

To potentially improve performance, you can modify the value of max_concurrent_requests. This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10, and you can increase it to a higher value. However, note the following:

  • Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
  • Too many concurrent requests can overwhelm a system, which might cause connection timeouts or slow the responsiveness of the system. To avoid timeout issues from the AWS CLI, you can try setting the --cli-read-timeout value or the --cli-connect-timeout value to 0.

A script setting max_concurrent_requests and uploading a directory can look like this:

aws configure set s3.max_concurrent_requests 64
aws s3 cp local_path_from s3://remote_path_to --recursive

To give a clue about running more threads consumes more resources, I did a small measurement in a container running aws-cli (using procpath) by uploading a directory with ~550 HTML files (~40 MiB in total, average file size ~72 KiB) to S3. The following chart shows CPU usage, RSS and number of threads of the uploading aws process.

aws s3 cp --recursive, max_concurrent_requests=64

Conjunctive answered 8/7, 2021 at 17:41 Comment(1)
what are the hardware specs of the container you used?Dentist
B
3

To add on to what everyone is saying, if you want your java code (instead of the CLI) to do this without having to put all of the files in a single directory, you can create a list of files to upload and then supply that list to the AWS TransferManager's uploadFileList method.

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#uploadFileList-java.lang.String-java.lang.String-java.io.File-java.util.List-

Beckiebeckley answered 4/10, 2018 at 22:21 Comment(0)
S
2

Here's a comprehensive batch solution that copies files from one folder to another using a single call of CommandPool::batch, although under the hood it runs a executeAsync command for each file, so not sure it counts as a single API call.

As I understand you should be able to copy hundreds of thousands of files using this method as there's no way to send a batch to AWS to be processed there, but if you're hosting this at an AWS instance or even running it on Lambda then its "technically" processed at AWS.

Install the SDK:

composer require aws/aws-sdk-php
use Aws\ResultInterface;
use Aws\S3\S3Client;
use Aws\S3\Exception\S3Exception;
use Aws\S3\Exception\DeleteMultipleObjectsException;

$bucket = 'my-bucket-name';

// Setup your credentials in the .aws folder
// See: https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/guide_credentials_profiles.html
$s3 = new S3Client([
    'profile' => 'default',
    'region'  => 'us-east-2',
    'version' => 'latest'
]);

// Get all files in S3
$files = array();
try {
    $results = $s3->getPaginator('ListObjects', [
        'Bucket' => $bucket,
        'Prefix' => 'existing-folder' // Folder within bucket, or remove this to get all files in the bucket
    ]);

    foreach ($results as $result) {
        foreach ($result['Contents'] as $object) {
            $files[] = $object['Key'];
        }
    }
} catch (AwsException $e) {
    error_log($e->getMessage());
}

if(count($files) > 0){
    // Perform a batch of CopyObject operations.
    $batch = [];
    foreach ($files as $file) {
        $batch[] = $s3->getCommand('CopyObject', array(
            'Bucket'     => $bucket,
            'Key'        => str_replace('existing-folder/', 'new-folder/', $file),
            'CopySource' => $bucket . '/' . $file,
        ));
    }

    try {
        $results = CommandPool::batch($s3, $batch);

        // Check if all files were copied in order to safely delete the old directory
        $count = 0;
        foreach($results as $result) {
            if ($result instanceof ResultInterface) {
                $count++;
            }
            if ($result instanceof AwsException) {
            }
        }

        if($count === count($files)){
            // Delete old directory
            try {
                $s3->deleteMatchingObjects(
                    $bucket, // Bucket
                    'existing-folder' // Prefix, folder within bucket, as indicated above
                );
            } catch (DeleteMultipleObjectsException $exception) {
                return false;
            }

            return true;
        }

        return false;

    } catch (AwsException $e) {
        return $e->getMessage();
    }
}
Shoveler answered 20/2, 2020 at 2:41 Comment(0)
G
1

If you want to use Java program to do it you can do:

public  void uploadFolder(String bucket, String path, boolean includeSubDirectories) {
    File dir = new File(path);
    MultipleFileUpload upload = transferManager.uploadDirectory(bucket, "", dir, includeSubDirectories);
    try {
        upload.waitForCompletion();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

Creation of s3client and transfer manager to connect to local S3 if you wish to test is as below:

    AWSCredentials credentials = new BasicAWSCredentials(accessKey, token);
    s3Client = new AmazonS3Client(credentials); // This is deprecated but you can create using standard beans provided by spring/aws
    s3Client.setEndpoint("http://127.0.0.1:9000");//If you wish to connect to local S3 using minio etc...
    TransferManager transferManager = TransferManagerBuilder.standard().withS3Client(s3Client).build();
Gresham answered 17/8, 2018 at 9:33 Comment(0)
S
0

One file (or part of a file) = one HTTP request, but the Java API now supports efficient multiple file upload without having to write the multithreading on your own, by using TransferManager

Sanity answered 5/7, 2017 at 20:55 Comment(1)
"When possible, TransferManager attempts to use multiple threads to upload multiple parts of a single upload at once." It doesn't do a batch upload as far as I know.Somewhat

© 2022 - 2024 — McMap. All rights reserved.