Using external parquet tables in a DBT pipeline

Asked 19/8, 2020 at 15:52 Answered 19/8, 2020 at 18:55

I'm trying to set up a simple DBT pipeline that uses a parquet tables stored on Azure Data Lake Storage and creates another tables that is also going to be stored in the same location.

Under my models/ (which is defined as my sources path) I have 2 files datalake.yml and orders.sql. datalake.yml looks like this:

version:2
sources:
   - name: datalake
     tables:
        - name: customers
          external:
             location: path/to/storage1 # I got this by from file properties in Azure
             file_format: parquet
          columns:
             - name: id
               data_type: int
               description: "ID"
             - name: ...

My orders.sql table looks like this:

{{config(materialized='table', file_format='parquet', location_root='path/to/storage2')}}
select name, age from {{ source('datalake', 'customers') }}

I'm also using the dbt-external-tables package. Also note that when I run dbt debug everything is fine and I can connect to my database (which happens to be Databricks).

I tried running dbt run-operation stage_external_sources which returns Error: staging external sources is not implemented for the default adapter. When I run dbt run, I get Error: UnresolvedRelation datalake.customers.

Or perhaps I could make use of the hive metastore instead somehow? Any tips on how I could fix this would be highly appreciated!

Amil answered 19/8, 2020 at 15:52 Comment(0)

I help maintain both the dbt-spark plugin and the dbt-external-tables package. I can confirm that their interoperability is still preliminary, and I'd strongly welcome contributions to improve it. I don't think it's a big lift, though one of the challenges is the fact that Spark/Databricks supports two different create external table syntaxes (as noted in that issue).

FWIW I see that you specified path/to/storage as the external location of your source and the location_root config of your orders model. The former is the place to read data from, and the latter is the place to materialize the model as a table. I wasn't sure if you meant that to represent the same placeholder or different ones.

Edit: TIL that you can select from some file types directly in SparkSQL as select * from filetype.filepath. I believe you could register a source like:

version:2
sources:
 - name: datalake
   schema: parquet
   tables:
     - name: customers
       identifier: "path/to/storage1"
       quoting:
         identifier: true

This means you could have templated code like:

select * from {{ source('datalake', 'customers') }}

Which will resolve to:

select * from parquet.`path/to/storage1`

Devest answered 19/8, 2020 at 18:55 Comment(5)

Thank you for looking into this, Jeremy, I really appreciate it. I updated my original question to reflect that path/to/storage in source and destination are different locations. I still consider myself to be a beginner with respect to DBT, but I love this tool and I will see if I (or my team) can contribute here. – Amil 20/8, 2020 at 7:2

One more question: in the absence of an appropriate solution, are there any disadvantages to registering external tables as views with select a,b from parquet.'path/to/file' without using {{ref()}}? – Amil 20/8, 2020 at 10:5

No real downside to that approach! I didn't know you could select from parquet files directly based on their path in SparkSQL. That's... really cool. I'll edit my answer above to include an approach that still leverages sources. – Devest 20/8, 2020 at 19:24

Awesome! Well, that means that the problem is indeed solved :) – Amil 21/8, 2020 at 20:27

@JeremyCohen is there a way to use freshness reporting using this approach? Cause the errorr i'm getting is that its unable to find the view/table hive_metastore.parquet.{path}. – Feral 28/10, 2021 at 7:54

This might not yet be available. Looks like it's still an open issue with no dev work so far.

Related issue for dbt-external-tables package repo: Support Spark external tables

Do you have the dependencies from dbt-spark installed?

Here are some relevant issues there:

Spark_connection_url do not contain workspace_id while connecting to databricks

Support Delta Lake format

I realize those aren't exactly help with the easy dbt-external-tables use-case but it looks like development is still on-going to support the azure databricks / datalake stack.

Gonna try to dig into this a bit more later because this is a use-case that's relevant to me also.

Charqui answered 19/8, 2020 at 16:20 Comment(1)

Thank you for sharing your thoughts! My plan B is to create views for my external tables as select a,b from parquet.path/to/file`` which works, but isn't ideal. I was hoping that I could define these external dependencies in a yaml, but as you said, it might still be in the backlog. – Amil 19/8, 2020 at 17:1

Recommended topics

Hot tags