Azure Data Factory connecting to Blob Storage via Access Key
Asked Answered
U

2

9

I'm trying to build a very basic data flow in Azure Data Factory pulling a JSON file from blob storage, performing a transformation on some columns, and storing in a SQL database. I originally authenticated to the storage account using Managed Identity, but I get the error below when attempting to test the connection to the source:

com.microsoft.dataflow.broker.MissingRequiredPropertyException: account is a required property for [myStorageAccountName]. com.microsoft.dataflow.broker.PropertyNotFoundException: Could not extract value from [myStorageAccountName] - RunId: xxx

I also see the following message in the Factory Validation Output:

[MyDataSetName] AzureBlobStorage does not support SAS, MSI, or Service principal authentication in data flow.

With this I assumed that all I would need to do is switch my Blob Storage Linked Service to an Account Key authentication method. After I switched to Account Key authentication though and select my subscription and storage account, when testing the connection I get the following error:

Connection failed Fail to connect to https://[myBlob].blob.core.windows.net/: Error Message: The remote server returned an error: (403) Forbidden. (ErrorCode: 403, Detail: This request is not authorized to perform this operation., RequestId: xxxx), make sure the credential provided is valid. The remote server returned an error: (403) Forbidden.StorageExtendedMessage=, The remote server returned an error: (403) Forbidden. Activity ID: xxx.

I've tried selecting from Azure directly and also entering the key manually and get the same error either way. One thing to note is the storage account only allows access to specified networks. I tried connecting to a different, public storage account and am able to access fine. The ADF account has the Storage Account Contributor role and I've added the IP address of where I am working currently as well as the IP range of Azure Data Factory that I found here: https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses

Also note, I have about 5 copy data tasks working perfectly fine with Managed Identity currently, but I need to start doing more complex operations.

This seems like a similar issue as Unable to create a linked service in Azure Data Factory but the Storage Account Contributor and Owner roles I have assigned should supersede the Reader role as suggested in the reply. I'm also not sure if the poster is using a public storage account or private.

Thank you in advance.

Undry answered 10/5, 2020 at 2:35 Comment(0)
U
6

At the very bottom of the article listed above about white listing IP ranges of the integration runtime, Microsoft says the following:

When connecting to Azure Storage account, IP network rules have no effect on requests originating from the Azure integration runtime in the same region as the storage account. For more details, please refer this article.

I spoke to Microsoft support about this and the issue is that white listing public IP addresses does not work for resources within the same region because since the resources are on the same network, they connect to each other using private IP's rather than public.

There are four options to resolve the original issue:

  • Allow access from all networks under Firewalls and Virtual Networks in the storage account (obviously this is a concern if you are storing sensitive data). I tested this and it works.
  • Create a new Azure hosted integration runtime that runs in a different region. I tested this as well. My ADF data flow is running in East region and I created a runtime that runs in East 2 and it worked immediately. The issue for me here is I would have to have this reviewed by security before pushing to prod because we'd be sending data across the public network, even though it's encrypted, etc, it's still not as secure as having two resources talking to each other in the same network.
  • Use a separate activity such as an HDInsight activity like Spark or an SSIS package. I'm sure this would work, but the issue with SSIS is cost as we would have to spin up an SSIS DB and then pay for the compute. You also need to execute multiple activities in the pipeline to start and stop the SSIS pipeline before and after execution. Also I don't feel like learning Spark just for this.
  • Finally, the solution that works that I used is I created a new connection that replaced the Blob Storage with a Data Lakes Gen 2 connection for the data set. It worked like a charm. Unlike Blob Storage connection, Managed Identity is supported for Azure Data Lakes Storage Gen 2 as per this article. In general, the more specific the connection type, the more likely the features will work for the specific need.
Undry answered 13/5, 2020 at 21:10 Comment(3)
Hi Mike, I am having the same issue. Can you please explain how did you setup Option 2? I have blob storage in different region and my ADF is in another but I am still having the permission issue. If i Set it to "All Networks" then it works. In option 2 I am not able to understand what 2 servers that you are running in different region? Data factory can only be run in single region or am i missing anything ?Comte
ty for researching this. The product appears to be sub-optimal.Viewer
I came across this issue and can't believe that azure didn't sort out this issue as this is so critical for data security & privacy.Torpedo
V
1

This is what you faced now:

enter image description here

From the description we know that is a connection error of storage. I also set the contributer role to the data factory, but still get the problem.

The problem comes from the network and firewall of your storage account. Please have a check of it.

enter image description here

Make sure you have add the client id and the 'Trusted Microsoft services' exception.

Have a look of this doc:

https://learn.microsoft.com/en-us/azure/storage/common/storage-network-security#trusted-microsoft-services

Then, go to your adf, choose these:

enter image description here

After that, it should be ok.

Verduzco answered 10/5, 2020 at 11:56 Comment(2)
Hi Bowman, Unfortunately this does not work. Managed Identity works perfectly fine for copy tasks within a pipeline, but when you try to perform a data flow task with a Blob Storage as the source, you get the error I mentioned in the post.Undry
Try to assign Contributor Role to Datafactory.Incudes

© 2022 - 2024 — McMap. All rights reserved.