Using external parquet tables in a DBT pipeline
Asked Answered
A

2

7

I'm trying to set up a simple DBT pipeline that uses a parquet tables stored on Azure Data Lake Storage and creates another tables that is also going to be stored in the same location.

Under my models/ (which is defined as my sources path) I have 2 files datalake.yml and orders.sql. datalake.yml looks like this:

version:2
sources:
   - name: datalake
     tables:
        - name: customers
          external:
             location: path/to/storage1 # I got this by from file properties in Azure
             file_format: parquet
          columns:
             - name: id
               data_type: int
               description: "ID"
             - name: ...

My orders.sql table looks like this:

{{config(materialized='table', file_format='parquet', location_root='path/to/storage2')}}
select name, age from {{ source('datalake', 'customers') }}

I'm also using the dbt-external-tables package. Also note that when I run dbt debug everything is fine and I can connect to my database (which happens to be Databricks).

I tried running dbt run-operation stage_external_sources which returns Error: staging external sources is not implemented for the default adapter. When I run dbt run, I get Error: UnresolvedRelation datalake.customers.

Or perhaps I could make use of the hive metastore instead somehow? Any tips on how I could fix this would be highly appreciated!

Amil answered 19/8, 2020 at 15:52 Comment(0)
D
7

I help maintain both the dbt-spark plugin and the dbt-external-tables package. I can confirm that their interoperability is still preliminary, and I'd strongly welcome contributions to improve it. I don't think it's a big lift, though one of the challenges is the fact that Spark/Databricks supports two different create external table syntaxes (as noted in that issue).

FWIW I see that you specified path/to/storage as the external location of your source and the location_root config of your orders model. The former is the place to read data from, and the latter is the place to materialize the model as a table. I wasn't sure if you meant that to represent the same placeholder or different ones.

Edit: TIL that you can select from some file types directly in SparkSQL as select * from filetype.filepath. I believe you could register a source like:

version:2
sources:
 - name: datalake
   schema: parquet
   tables:
     - name: customers
       identifier: "path/to/storage1"
       quoting:
         identifier: true

This means you could have templated code like:

select * from {{ source('datalake', 'customers') }}

Which will resolve to:

select * from parquet.`path/to/storage1`
Devest answered 19/8, 2020 at 18:55 Comment(5)
Thank you for looking into this, Jeremy, I really appreciate it. I updated my original question to reflect that path/to/storage in source and destination are different locations. I still consider myself to be a beginner with respect to DBT, but I love this tool and I will see if I (or my team) can contribute here.Amil
One more question: in the absence of an appropriate solution, are there any disadvantages to registering external tables as views with select a,b from parquet.'path/to/file' without using {{ref()}}?Amil
No real downside to that approach! I didn't know you could select from parquet files directly based on their path in SparkSQL. That's... really cool. I'll edit my answer above to include an approach that still leverages sources.Devest
Awesome! Well, that means that the problem is indeed solved :)Amil
@JeremyCohen is there a way to use freshness reporting using this approach? Cause the errorr i'm getting is that its unable to find the view/table hive_metastore.parquet.{path}.Feral
C
1

This might not yet be available. Looks like it's still an open issue with no dev work so far.

Related issue for dbt-external-tables package repo: Support Spark external tables

Do you have the dependencies from dbt-spark installed?

Here are some relevant issues there:

Spark_connection_url do not contain workspace_id while connecting to databricks

Support Delta Lake format

I realize those aren't exactly help with the easy dbt-external-tables use-case but it looks like development is still on-going to support the azure databricks / datalake stack.

Gonna try to dig into this a bit more later because this is a use-case that's relevant to me also.

Charqui answered 19/8, 2020 at 16:20 Comment(1)
Thank you for sharing your thoughts! My plan B is to create views for my external tables as select a,b from parquet.path/to/file`` which works, but isn't ideal. I was hoping that I could define these external dependencies in a yaml, but as you said, it might still be in the backlog.Amil

© 2022 - 2024 — McMap. All rights reserved.