Batch Uploading Huge Sets of Images to Azure Blob Storage
Asked Answered
E

6

11

I have about 110,000 images of various formats (jpg, png and gif) and sizes (2-40KB) stored locally on my hard drive. I need to upload them to Azure Blob Storage. While doing this, I need to set some metadata and the blob's ContentType, but otherwise it's a straight up bulk upload.

I'm currently using the following to handle uploading one image at a time (paralleled over 5-10 concurrent Tasks).

static void UploadPhoto(Image pic, string filename, ImageFormat format)
{
    //convert image to bytes
    using(MemoryStream ms = new MemoryStream())
    {
        pic.Save(ms, format);
        ms.Position = 0;

        //create the blob, set metadata and properties
        var blob = container.GetBlobReference(filename);
        blob.Metadata["Filename"] = filename;
        blob.Properties.ContentType = MimeHandler.GetContentType(Path.GetExtension(filename));

        //upload!
        blob.UploadFromStream(ms);
        blob.SetMetadata();
        blob.SetProperties();
    }
}

I was wondering if there was another technique I could employ to handle the uploading, to make it as fast as possible. This particular project involves importing a lot of data from one system to another, and for customer reasons it needs to happen as quickly as possible.

Emerson answered 10/10, 2011 at 22:25 Comment(5)
The obvious answer is to find a faster (upload) connection. You could do a temporary upgrade of your connection, or perhaps try to borrow or rent time (e.g., via cragislist, local professional group, etc).Contrarily
I have a 50mb line up and down. The issue I'm having is the amount of time it's taking for UploadFromStream() to return, and I've run into some pretty strange garbage collection issues with the Azure Blob objects if I attempt to run more than 10 Tasks in parallel.Emerson
I know Rackspace lets you just FedEx a drive over to them and they'll put it on their cloud free of charge. Does Microsoft have anything similar?Uxoricide
MS do offer a similar service; azure.microsoft.com/en-gb/documentation/articles/…Subduct
One thing I did to optimize a very large bulk upload was to put everything on a VHD, upload it, attach to a VM in the same datacenter, then run the upload tool from there. Just another thing in addition to other optimizations.Boltrope
E
7

Okay, here's what I did. I tinkered around with running BeginUploadFromStream(), then BeginSetMetadata(), then BeginSetProperties() in an asynchronous chain, paralleled over 5-10 threads (a combination of ElvisLive's and knightpfhor's suggestions). This worked, but anything over 5 threads had terrible performance, taking upwards of 20 seconds for each thread (working on a page of ten images at a time) to complete.

So, to sum up the performance differences:

  • Asynchronous: 5 threads, each running an async chain, each working on ten images at a time (paged for statistical reasons): ~15.8 seconds (per thread).
  • Synchronous: 1 thread, ten images at a time (paged for statistical reasons): ~3.4 seconds

Okay, that's pretty interesting. One instance uploading blobs synchronously performed 5x better than each thread in the other approach. So, even running the best async balance of 5 threads nets essentially the same performance.

So, I tweaked my image file importing to separate the images into folders containing 10,000 images each. Then I used Process.Start() to launch an instance of my blob uploader for each folder. I have 170,000 images to work with in this batch, so that means 17 instances of the uploader. When running all of those on my laptop, performance across all of them leveled out at ~4.3 seconds per set.

Long story short, instead of trying to get threading working optimally, I just run a blob uploader instance for every 10,000 images, all on the one machine at the same time. Total performance boost?

  • Async Attempts: 14-16 hours, based on average execution time when running it for an hour or two.
  • Synchronous with 17 separate instances: ~1 hour, 5 minutes.
Emerson answered 12/10, 2011 at 0:26 Comment(0)
D
3

You should definitely upload in parallel in several streams (ie. post multiple files concurrently), but before you do any experiment showing (erroneously) that there is not benefit, make sure you actually increase the value of ServicePointManager.DefaultConnectionLimit:

The maximum number of concurrent connections allowed by a ServicePoint object. The default value is 2.

With a default value of 2, you can have at most two outstanding HTTP requests against any destination.

Dishonor answered 12/10, 2011 at 0:37 Comment(3)
...I was not aware of that setting. Fascinating that I never found a mention of it near any of the async blob storage stuff I read. I've already executed my solution and don't have time to try this one, but I'll definitely keep it in mind in the future. That was probably the main bottleneck. /rage.Emerson
msdn.microsoft.com/en-us/library/7af54za5%28v=VS.100%29.aspx. The default value of 2 is to 'conform' with HTTP/1.1 specifications, but this usually misses the point that any server worth its salt is actually behind a net load balancer and the you would actually target possible hundreds of 'servers' with one URL (Certainly the case with Azure Blob Storage)Dishonor
Setting this value in this way affects all HTTP connections for the assembly... you may want to set the service point manager just for the connection your dealing with.Bechler
P
1

As the files that you're uploading are pretty small, I think the code that you've written is probably about as efficient as you can get. Based on your comment it looks like you've tried running these uploads in parallel which was really the only other code suggestion I had.

I suspect that in order to get the greatest throughput will be about finding the right number of threads for your hardware, your connection and your file size. You could try using the Azure Throughput Analyzer to make finding this balance easier.

Microsoft's Extreme Computing group have also benchmarks and suggestions on improving throughput. It's focused on throughput from worker roles deployed on Azure, but it will give you an idea of the best you could hope for.

Phonics answered 11/10, 2011 at 21:3 Comment(1)
I ended up running a bunch of separate instances of the uploader, focused on different sets of images (10,000 at a time). Thanks for the pointers though, I upvoted your answer anyway :).Emerson
B
1

You may want to increase ParallelOperationThreadCount as shown below. I haven't checked the latest SDK, but in 1.3 the limit was 64. Not setting this value resulted in lower concurrent operations.

CloudBlobClient blobStorage = new CloudBlobClient(config.AccountUrl, creds);
// todo: set this in blob extensions
blobStorage.ParallelOperationThreadCount = 64
Bechler answered 12/10, 2011 at 21:39 Comment(0)
R
1

If the parallel method takes 5 times more to upload than the serial one, then you either

  • have awful bandwidth
  • have a very slow computer
  • do something wrong

My command-line util gets quite a boost when running in parallel even though I don't use memory streams nor any other nifty stuff like that, I simply generate a string array of the filenames, then upload them with Parallel.ForEach.

Additionally, the Properties.ContentType call probably sets you back quite a bit. Personally I never use them and I guess they shouldn't even matter unless you want to view them right in the browser via direct URLs.

Rhizoid answered 1/2, 2014 at 13:1 Comment(0)
H
0

You could always try the async methods of uploading.

public override IAsyncResult BeginUploadFromStream (
Stream source,
AsyncCallback callback,
Object state

)

http://msdn.microsoft.com/en-us/library/windowsazure/ee772907.aspx

Hinckley answered 11/10, 2011 at 5:25 Comment(3)
Tried this out. It works, but has some serious performance issues. For some reason running Blob uploads in parallel threads eats a ton of CPU and runs rather slowly beyond about 5 threads. See my answer for the details on my eventual approach.Emerson
Another follow up... Did you try turning off the nagle algorithm at all? I totally spaced this earlier :) blogs.msdn.com/b/windowsazurestorage/archive/2010/06/25/…Hinckley
Wow, that looks crazy. I don't have the time to employ that for this solution (this is one of those "needs to be fast but will only ever be run once in the history of ever" projects), but I've bookmarked it for future imports I'll be doing. Thanks dude!Emerson

© 2022 - 2024 — McMap. All rights reserved.