Create External table in Azure databricks
Asked Answered
B

1

7

I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.

From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.

Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.

# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint", 
"https://login.microsoftonline.com/tenant_id/oauth2/token")

DDL

create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container@account_name.dfs.core.windows.net/dev/data/employee

Error Received

Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);

I need help in knowing if this is possible to refer to ADLS location directly in DDL?

Thanks.

Batton answered 27/6, 2019 at 13:26 Comment(4)
have you verified all your values for tenant_id, client_id and client_secret are correct and the service principle has the required permissions?Loraineloralee
Yes, because after setting up the spark configurations i am able to read the file in a data frame and use it.Batton
looks like only way is to mount the ADLS Gen2 account.Batton
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);Batton
B
4

Sort of if you can use Python (or Scala).

Start by making the connection:

TenantID = "blah"

def connectLake():
  spark.conf.set("fs.azure.account.auth.type", "OAuth")
  spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
  spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
  spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
  spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")

connectLake()
lakePath = "abfss://[email protected]/"

Using Python you can register a table using:

spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")

You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.

The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.

You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

Note that this requires using a High Concurrency cluster.

Bogbean answered 2/7, 2019 at 13:34 Comment(5)
Does not work for ADLS gen2. This method is working for Gen1 storage accounts.Batton
I tested the above on gen2, have you? If so please describe the error.Bogbean
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);Batton
That session has not run the connectLake() function. As explained every session needs to run it.Bogbean
do we need to run connectLake() function even if we run the ddl in same spark session?Batton

© 2022 - 2024 — McMap. All rights reserved.