Azure Blob Storage : Virtual Folder structure vs Blob Index Tags
Asked Answered
I

1

6

I am trying to figure out what is the benefit of Index Tags vs Creating a full Virtual Folder tree structure in azure blob storage, when i have full programatic control over the creation of the blobs.

Inset answered 26/8, 2021 at 9:19 Comment(11)
Index Tags are a recent feature only added in the past year - and cost $$$ to maintain - whereas Virtual Folders are an ugly hack and slow to enumerate. I recommend avoiding both and instead using Azure Blob Storage as a Content Addressable Store only.Beerbohm
What @Beerbohm said; if you have lots of files and a natural tree structure, consider enabling the hierarchical namespace feature and using the ADLS2 endpoint. Retrieving dir listings is a LOT faster that way.Onstad
@HongOoi What's ADLS2?Beerbohm
@Beerbohm Azure Data Lake Storage Gen2Led
I use it in one of my apps, not had much trouble. I use mongo to actually hold the indexes and just address blobs directly. Found it quick and cheap for my purpose. I think Gen2 is better than Gen 1... But I didn't use Gen one so I'm not positive.Led
@Beerbohm ADLS2, not ADLSOnstad
@HongOoi Thanks for the comments i will look into this hierarchical namespace, since i also have been looking into putting Datalake on top. If a Datalake is the end goal on top of blob storage, is the hierarchical feature the best option?Inset
@FrederikVigen "best" depends on what your specific requirements are, but I haven't seen any major downsides with using the ADLS2 endpoint. Note that both the blob and ADLS2 endpoints give you access to the same storage, it's just a matter of choosing which API to useOnstad
@HongOoi Okay great then i will look into the Hierarchical Namespace solution to get a better understanding of that :)Inset
@HongOoi Is the ADLSv2 endpoint the same as the NFS3.0 endpoint? I read that when using NFS3.0 there's a long list of traditional blob storage functionality that's completely unavailable - so I wasn't sure if that also applied other endpoints. (See the "Blob storage features" section on this page: learn.microsoft.com/en-us/azure/storage/blobs/… )Beerbohm
@dai learn.microsoft.com/en-us/azure/storage/blobs/…Onstad
B
8

Virtual Folder structure vs Blob Index Tags

You're asking us to compare just two separate features of Azure Blob Storage as though they were mutually exclusive, when in-fact they can be used together, and there are more options for organizing blobs than just those 2 options:

TL;DR:

  • Azure Blob Index Tags - arbitrary mutable tags on your blobs.
  • Virtual folder structure - this is just a naming convention where your blobs are named with slash-separated "directory" names.
  • NFS 3.0 Blob Storage and Data Lake Storage Gen2 - this is a major new version (or revision) of Azure Blob Storage that makes it behave almost exactly like a traditional disk file-system (hence the NFS 3.0-compliance) however it (currently) comes with major shortcomings.

In detail:

  • Azure Blob Index Tags is a recently introduced new feature to Azure Blob Storage: it entered preview in May 2020 and left the preview-stage in June 2021 (2 months ago at the time of writing).

    • Your storage account needs to be "General Purpose v2" - so if you have a an older-style storage account you'll need to update it.
    • Advantages:
      • It's built-in to Azure Blob Storage, so you don't need to maintain your own indexing infrastructure (which is what we used to have to do: I stored my own blob index in a table in Azure Table Storage in the same storage account, and had a process that ran on a disposable Azure VM nightly to index new blobs).
      • As it's a tagging system it means you can have your own taxonomy and don't have to force your nomenclature into a single hierarchy as with virtual folders.
      • Tags are mutable: you can add/remove/edit them as you like.
    • Disadvantages:
      • As with maintaining your own blob index the index updates are not instantaneous (unlike compared to an RDBMS where indexes are always up-to-date). The blog article linked handwaves this away by saying:

        and the account indexing engine exposes the new blob index shortly after."

        ...note that they don't define what "shortly" means.

      • As of August 2021, Azure charges $0.03 per 10,000 tags (regardless of the storage-tier in use). So if you have 1,000,000 blobs and 3 tags per blob, then that's $9/mo.

        • This isn't a significant cost by any means, but the cost-per-information-theoretic-unit is kinda-high, which is disappointing.
  • "Virtual Folder tree structure" - By this I assume you mean giving your blob's hierarchical naming system and using Azure Blob Storage's blob-name-prefix search filter.

    • Advantages:
      • Tried-and-tested. Simple.
      • Doesn't cost you anything.
      • No indexing delay.
    • Disadvantages:
      • It's still as slow as enumerating blobs lexicographically.
      • You cannot conceptually move or rename blobs.
        • (You can, technically, provided source and destination are in the same container by doing a copy+delete, and the copy operation should be instantaneous as I understand that Blob Storage uses COW for same-container copies, but it's still imperfect: the client API still exposes it as an asynchronous operation with an unbounded time-to-copy rather than giving hard guarantees)
        • The fact this has been a limitation of Azure Blob Storage for a decade now utterly confounds me.
  • NFS 3.0 Blob Storage - Also new in 2020/2021 with Blob Index Tags is NFS 3.0 Blob Storage, which gives you a full "real" hierarchical filesystem for your blobs.

    • The Hierarchical Namespace feature is powered by Azure Data Lake Storage Gen 2. I don't know any technical details of this so I can't say anything.
    • Advantages:
      • NFS 3.0-compliant (that's huge!) so Linux clients can even mount it directly.
      • It's cheaper than normal blob storage (whaaaaat?!):
        • In West US 2, NFS+LRS+Hot is $0.018/GB while the old-school flat namespace with LRS+Hot is $0.0184/GB.
        • In other Azure locations and with other redundancy options then NFS can be slightly more expensive, but otherwise they're generally within $0.01 of each other.
    • Disadvantages:
    • Notes from the Known Issues page:
      • NFS can only be used with new accounts: you cannot update an existing account. You also cannot disable it once you enable it.
      • You cannot (currently) lock blobs/files - though this looks to come in a future version.
      • You cannot use both Blob Index Tags and NFS in the same storage account - or in fact most features of Blob Storage (ooo-er!).
      • The documentation for operations exclusively to Hierarchical namespace blobs only lists Set Blob Expiry - there (still) doesn't seem to be a synchronous/atomic "move blob" or "rename blob" operation, instead the Protocol Support page implies that an operation to rename an NFS file will be translated into raw blob storage operations behind-the-scenes... so I'm curious how they do that atomically.

        When your application makes a request by using the NFS 3.0 protocol, that request is translated into combination of block blob operations. For example, NFS 3.0 read Remote Procedure Call (RPC) requests are translated into Get Blob operation. NFS 3.0 write RPC requests are translated into a combination of Get Block List, Put Block, and Put Block List.

  • Alternative concept: Content-addressable-storage

    • Because blobs cannot be atomically/synchronously renamed so a few years ago I simply gave up trying to come up with a perfect blob nomenclature that would stand the test of time because business requirements always change.
    • Instead, I noticed that my blobs were invariably immutable: once they've been uploaded to storage they're never updated, or when they are updated they're saved to new, separate blobs - which means that a content-addressable naming strategy suited my projects perfectly.
    • In short: give your immutable blobs a name which is a string-representation of a hash of their content, and store their hashes in a traditional RDBMS where you have much greater flexibility (and ideally: performance) with how they're indexed and referenced by the rest of your system.
      • In my case, I set my blobs' names to the Base-16 representation of their SHA-256 hash.
    • Advantages:
      • You get de-duping for free: blobs with identical content will have identical hashes, so you can avoid uploading/downloading the same huge blob twice.
      • You get integrity checks for free: if you download a blob and its hash doesn't match its blob-name then your storage account likely got hacked)
    • Disadvantages:
      • You still need to maintain your own index in your RDBMS (if applicable) - but you can still use Blob Index Tags with content-addressable storage if you like.
Beerbohm answered 26/8, 2021 at 9:57 Comment(1)
+1 Thanks for the very nicely detailed answer @Beerbohm i will go straight to work looking into the Hierarchical Namespace stuff because that seems to be the solution to what i was contemplatingInset

© 2022 - 2024 — McMap. All rights reserved.