I am trying to figure out what is the benefit of Index Tags vs Creating a full Virtual Folder tree structure in azure blob storage, when i have full programatic control over the creation of the blobs.
Virtual Folder structure vs Blob Index Tags
You're asking us to compare just two separate features of Azure Blob Storage as though they were mutually exclusive, when in-fact they can be used together, and there are more options for organizing blobs than just those 2 options:
TL;DR:
- Azure Blob Index Tags - arbitrary mutable tags on your blobs.
- Virtual folder structure - this is just a naming convention where your blobs are named with slash-separated "directory" names.
- NFS 3.0 Blob Storage and Data Lake Storage Gen2 - this is a major new version (or revision) of Azure Blob Storage that makes it behave almost exactly like a traditional disk file-system (hence the NFS 3.0-compliance) however it (currently) comes with major shortcomings.
In detail:
Azure Blob Index Tags is a recently introduced new feature to Azure Blob Storage: it entered preview in May 2020 and left the preview-stage in June 2021 (2 months ago at the time of writing).
- Your storage account needs to be "General Purpose v2" - so if you have a an older-style storage account you'll need to update it.
- Advantages:
- It's built-in to Azure Blob Storage, so you don't need to maintain your own indexing infrastructure (which is what we used to have to do: I stored my own blob index in a table in Azure Table Storage in the same storage account, and had a process that ran on a disposable Azure VM nightly to index new blobs).
- As it's a tagging system it means you can have your own taxonomy and don't have to force your nomenclature into a single hierarchy as with virtual folders.
- Tags are mutable: you can add/remove/edit them as you like.
- Disadvantages:
As with maintaining your own blob index the index updates are not instantaneous (unlike compared to an RDBMS where indexes are always up-to-date). The blog article linked handwaves this away by saying:
and the account indexing engine exposes the new blob index shortly after."
...note that they don't define what "shortly" means.
As of August 2021, Azure charges $0.03 per 10,000 tags (regardless of the storage-tier in use). So if you have 1,000,000 blobs and 3 tags per blob, then that's $9/mo.
- This isn't a significant cost by any means, but the cost-per-information-theoretic-unit is kinda-high, which is disappointing.
"Virtual Folder tree structure" - By this I assume you mean giving your blob's hierarchical naming system and using Azure Blob Storage's blob-name-prefix search filter.
- Advantages:
- Tried-and-tested. Simple.
- Doesn't cost you anything.
- No indexing delay.
- Disadvantages:
- It's still as slow as enumerating blobs lexicographically.
- You cannot conceptually move or rename blobs.
- (You can, technically, provided source and destination are in the same container by doing a copy+delete, and the copy operation should be instantaneous as I understand that Blob Storage uses COW for same-container copies, but it's still imperfect: the client API still exposes it as an asynchronous operation with an unbounded time-to-copy rather than giving hard guarantees)
- The fact this has been a limitation of Azure Blob Storage for a decade now utterly confounds me.
- Advantages:
NFS 3.0 Blob Storage - Also new in 2020/2021 with Blob Index Tags is NFS 3.0 Blob Storage, which gives you a full "real" hierarchical filesystem for your blobs.
- The Hierarchical Namespace feature is powered by Azure Data Lake Storage Gen 2. I don't know any technical details of this so I can't say anything.
- Advantages:
- NFS 3.0-compliant (that's huge!) so Linux clients can even mount it directly.
- It's cheaper than normal blob storage (whaaaaat?!):
- In West US 2, NFS+LRS+Hot is $0.018/GB while the old-school flat namespace with LRS+Hot is $0.0184/GB.
- In other Azure locations and with other redundancy options then NFS can be slightly more expensive, but otherwise they're generally within $0.01 of each other.
- Disadvantages:
- Apparently you're limited to only block-blobs: not page-blobs or append-blobs.
- Notes from the Known Issues page:
- NFS can only be used with new accounts: you cannot update an existing account. You also cannot disable it once you enable it.
- You cannot (currently) lock blobs/files - though this looks to come in a future version.
- You cannot use both Blob Index Tags and NFS in the same storage account - or in fact most features of Blob Storage (ooo-er!).
- The documentation for operations exclusively to Hierarchical namespace blobs only lists
Set Blob Expiry
- there (still) doesn't seem to be a synchronous/atomic "move blob" or "rename blob" operation, instead the Protocol Support page implies that an operation to rename an NFS file will be translated into raw blob storage operations behind-the-scenes... so I'm curious how they do that atomically.When your application makes a request by using the NFS 3.0 protocol, that request is translated into combination of block blob operations. For example, NFS 3.0 read Remote Procedure Call (RPC) requests are translated into Get Blob operation. NFS 3.0 write RPC requests are translated into a combination of Get Block List, Put Block, and Put Block List.
Alternative concept: Content-addressable-storage
- Because blobs cannot be atomically/synchronously renamed so a few years ago I simply gave up trying to come up with a perfect blob nomenclature that would stand the test of time because business requirements always change.
- Instead, I noticed that my blobs were invariably immutable: once they've been uploaded to storage they're never updated, or when they are updated they're saved to new, separate blobs - which means that a content-addressable naming strategy suited my projects perfectly.
- In short: give your immutable blobs a name which is a string-representation of a hash of their content, and store their hashes in a traditional RDBMS where you have much greater flexibility (and ideally: performance) with how they're indexed and referenced by the rest of your system.
- In my case, I set my blobs' names to the Base-16 representation of their SHA-256 hash.
- Advantages:
- You get de-duping for free: blobs with identical content will have identical hashes, so you can avoid uploading/downloading the same huge blob twice.
- You get integrity checks for free: if you download a blob and its hash doesn't match its blob-name then your storage account likely got hacked)
- Disadvantages:
- You still need to maintain your own index in your RDBMS (if applicable) - but you can still use Blob Index Tags with content-addressable storage if you like.
© 2022 - 2024 — McMap. All rights reserved.