ContentHash not calculated in Azure Blob Storage v12
Asked Answered
S

2

2

Continuing the saga, here is part I: ContentHash is null in Azure.Storage.Blobs v12.x.x

After a lot of debugging, root cause appears to be that the content hash was not calculated after uploading a blob, therefore the BlobContentInfo or BlobProperties were returning a null content hash and my whole flow is based on receiving the hash from Azure.

What I've discovered is that it depends on which HttpRequest stream method I call and upload to azure:

HttpRequest.GetBufferlessInputStream(), the content hash is not calculated, even if I go into azure storage explorer, the ContentMD5 of the blob is empty.

HttpRequest.InputStream() everything works as expected.


Do you know why this different behavior? And do you know how to make to receive content hash for streams received by GetBufferlessInputStream method.

So the code flow looks like this:

var stream = HttpContext.Current.Request.GetBufferlessInputStream(disableMaxRequestLength: true)

var container = _blobServiceClient.GetBlobContainerClient(containerName);
var blob = container.GetBlockBlobClient(blobPath);

BlobHttpHeaders blobHttpHeaders = null;
if (!string.IsNullOrWhiteSpace(fileContentType))
{
     blobHttpHeaders = new BlobHttpHeaders()
     {
          ContentType = fileContentType,
     };
}

// retry already configured of Azure Storage API
await blob.UploadAsync(stream, httpHeaders: blobHttpHeaders);

return await blob.GetPropertiesAsync();

In the code snippet from above ContentHash is NOT calculated, but if I change the way I am getting the stream from the http request with following snippet ContentHash is calculated.

var stream = HttpContext.Current.Request.InputStream

P.S. I think its obvious, but with the old sdk, content hash was calculated for streams received by GetBufferlessInputStream method

P.S2 you can find also an open issue on github: https://github.com/Azure/azure-sdk-for-net/issues/14037

P.S3 added code snipet

Suggestion answered 11/8, 2020 at 5:25 Comment(5)
hello, do you still have any more issues about the question?Almaraz
@IvanYang I did a quick test on your workaround its working. Right now I am doing some performance test, to see how is affected.Suggestion
Please provide any feedback later:)Almaraz
Hello, any feedback?Almaraz
@IvanYang For the moment I am going further with your proposed solution. Thanks!Suggestion
A
0

A workaround is that when get the stream via GetBufferlessInputStream() method, convert it to MemoryStream, then upload the MemoryStream. Then it can generate the contenthash. Sample code like below:

        var stream111 = System.Web.HttpContext.Current.Request.GetBufferlessInputStream(disableMaxRequestLength: true);
        //convert to memoryStream.
        MemoryStream stream = new MemoryStream();
        stream111.CopyTo(stream);
        stream.Position = 0;

        //other code
        // retry already configured of Azure Storage API
        await blob.UploadAsync(stream, httpHeaders: blobHttpHeaders);

Not sure why, but as per my debug, I can see when using the method GetBufferlessInputStream() in the latest SDK, during upload, it actually calls the Put Block api in the backend. And in this api, MD5 hash is not stored with the blob(Refer to here for details.). Screenshot as below:

enter image description here

However, when using InputStream, it calls the Put Blob api. Screenshot as below:

enter image description here

Almaraz answered 12/8, 2020 at 7:56 Comment(1)
I see. I will do some performance tests using MemoryStream. The main reason I want to use GetBufferlessStream is depicted very well by Microsoft: "The InputStream property waits until the whole request has been received before it returns a Stream object. In contrast, the GetBufferlessInputStream method returns the Stream object immediately. You can use the method to begin processing the entity body before the complete contents of the body have been received."Suggestion
M
1

Ran into this today. From my digging, it appears this is a symptom of the type of Stream you use to upload, and it's not really a bug. In order to generate a hash for your blob (which is done on the client side before uploading by the looks of it), it needs to read the stream. Which means it would need to reset the position of your stream back to 0 (for the actual upload process) after generating the hash. Doing this requires the ability to perform the Seek operation on the stream. If your stream doesn't support Seek, then it looks like it doesn't generate the hash.

To get around the issue, make sure the stream you provide supports Seek (CanSeek). If it doesn't, then use a different Stream/copy your data to a stream that does (for example MemoryStream). The alternative would be for the internals of the Blob SDK to do this for you.

Miyamoto answered 9/9, 2020 at 20:13 Comment(5)
Hmm - I'm not sure this is the whole story. I've just discovered this issue in one of my containers. All files are written using exactly the same mechanism (using MemoryStream) and only some have a missing hash. It looks like the size of the blob may be a contributing factor - small blobs (<50k?) appear to be unaffected.Boustrophedon
You may be encountering a different issue. I haven't noticed what you're describing. In my case I actually verity that the hash from Azure matches my local one before considering the task complete, and I haven't had any issues like you're experiencing?Miyamoto
I posted an issue at github.com/Azure/azure-sdk-for-net/issues/17676. In my case it appears that the root cause is that I'm using StorageTransferOptions to optimise upload times. That causes the SDK to use a different upload mechanism on larger files and this prevents the server from calculating/adding a checksum. The upshot is that clients should probably not rely on the server adding a checksum and should always add it explicitly with BlobUploadOptions.HttpHeaders.ContentHash if it is needed.Boustrophedon
Thanks for the follow up!Miyamoto
Yep for any kind of block (large) upload it doesn't calculate a "global MD5" for you: https://mcmap.net/q/1132508/-how-to-check-azure-storage-blob-file-uploaded-correctly I assume in this case since it's uploading from an inputstream, it doesn't know "how big" it's going to be so uploads it as if it's big...hence not generating an MD5. And also agree that it would be nice to at least have an option in the client to always set an MD5 for consistency in this case...Pituri
A
0

A workaround is that when get the stream via GetBufferlessInputStream() method, convert it to MemoryStream, then upload the MemoryStream. Then it can generate the contenthash. Sample code like below:

        var stream111 = System.Web.HttpContext.Current.Request.GetBufferlessInputStream(disableMaxRequestLength: true);
        //convert to memoryStream.
        MemoryStream stream = new MemoryStream();
        stream111.CopyTo(stream);
        stream.Position = 0;

        //other code
        // retry already configured of Azure Storage API
        await blob.UploadAsync(stream, httpHeaders: blobHttpHeaders);

Not sure why, but as per my debug, I can see when using the method GetBufferlessInputStream() in the latest SDK, during upload, it actually calls the Put Block api in the backend. And in this api, MD5 hash is not stored with the blob(Refer to here for details.). Screenshot as below:

enter image description here

However, when using InputStream, it calls the Put Blob api. Screenshot as below:

enter image description here

Almaraz answered 12/8, 2020 at 7:56 Comment(1)
I see. I will do some performance tests using MemoryStream. The main reason I want to use GetBufferlessStream is depicted very well by Microsoft: "The InputStream property waits until the whole request has been received before it returns a Stream object. In contrast, the GetBufferlessInputStream method returns the Stream object immediately. You can use the method to begin processing the entity body before the complete contents of the body have been received."Suggestion

© 2022 - 2024 — McMap. All rights reserved.