There are definitions available for what is ABFS[S] and WASB[S]. But no clear demarcation of when to use what. What are the suitable and most appropriate use cases for both?
1) Blob Storage with HTTP
Azure introduced blob storage which is an object storage with flat structure. No concept of folders or hierarchy. Although the use of slash(/) in file name gives the illusion of hierarchy.
blob endpoint (blob.core.windows.net) with HTTP protocol can be used to read and write blobs
https://storageaccount.blob.core.windows.net/container/path/to/blob
2) Blob Storage with WASBS
If Hadoop applications wanted to interact with azure blob storage, then HDFS compatibility was provided using the WASBS driver. This driver performed the complex task of mapping file system semantics (as required by the Hadoop Filesystem interface) to that of the object store style interface exposed by Azure Blob Storage.
wasbs://[email protected]
With WASB driver, tools like HDInsight using the driver can connect to blob storage on the same blob endpoint (blob.core.windows.net).
3) ADLS with ABFSS
(Ignore ADLS gen 1 which is a separate service and is now deprecated)
check this answer for diff b/w blob storage and ADLS
Then came ADLS Gen2 (Azure's HDFS offering) which supports hierarchical storage (concept of folders) with features like ACL on the files and folders. Storage accounts with hierarchical namespace feature enabled is converted from blob storage to ADLS Gen2. In order to talk to ADLS gen2, DFS endpoint (dfs.core.windows.net) is used.
abfss://[email protected]
Hadoop applications can now use ABFS driver to connect to ADLS. Because of the new DFS endpoints, the driver is now very efficient and there is no requirement for a complex mapping in the driver. Solutions like Horton works, HDInsight, azure Databricks can connect to ADLS far more efficiently using the ABFSS driver.
Also, you will notice some of the tools like powerBI supports both WASBS and ABFSS.
What to use?
If ADLS is used,
- In case of Hadoop / Data processing tools like Databricks, HD Insight will have to use ABFSS on DFS endpoint.
- ADLS HTTP rest endpoint docs. To make HTTP calls if needed. Eg: A python app trying to list the paths. etc.
- ADLS is built on top of blob storage hence the blob endpoint can also be used to read and write the data.
If Blob storage is used,
- In case of Hadoop / Data processing tools, WASBS on blob endpoint can be used. (WASB will be deprecated in the future)
- ABFS Driver is also cross compatible, and this driver can also be used.
- Other use cases can simply use HTTP endpoints without needing any special drivers. Eg: A python app reading and writing files to blob storage using http endpoint.
- ADLS - Azure Data Lake Storage
- WASB - Windows Azure Storage Blob (provides unencrypted access)
- WASBS - Windows Azure Storage Blob Secure (TLS encrypted access)
- ABFS - Azure blob file system
- ABFSS - Azure blob file system secure
- DFS - Distributed file system
Update 1:
Microsoft has deprecated the Windows Azure Storage Blob driver (WASB) in favor of the Azure Blob Filesystem driver (ABFS). ABFS has numerous benefits over WASB. Use ABFS for both Blob Storage and Data Lake for newer workloads.
The difference and use case are as below:
ABFS[S] is used for Azure Data Lake Storage Gen2 which is based on normal Azure storage(during creating Azure storage account, enable Hierarchical namespace, then you create a Azure Data Lake Storage Gen2). An example is here.
WASB[S] is used for the normal Azure storage. An example is here.
© 2022 - 2024 — McMap. All rights reserved.