How to run DBT in airflow without copying our repo
Asked Answered
P

3

7

We use airflow to orchestrate our workflows, and dbt with bigquery for our daily transformations in BigQuery. We have two separate git repos, one for our dbt project and a separate one for airflow.

It seems the simplest approach to scheduling our daily run dbt seems to be a BashOperator in airflow. However, to schedule DBT to run with Airflow, it seems like our entire DBT project would need to be nested inside of our Airflow project, that way we can point to it for our dbt run bash command?

Is it possible to trigger our dbt run and dbt test without moving our DBT directory inside of our Airflow directory? With the airflow-dbt package, for the dir in the default_args, maybe it is possible to point to the gibhub link for the DBT project here?

Pettifogging answered 18/11, 2020 at 9:18 Comment(1)
When you use(ed) airflow-dbt package, how do you manage the service account key? Did you keep in GCS bucket?Tenor
K
11

My advice would be to leave your dbt and airflow codebases separated. There is indeed a better way:

  1. dockerise your dbt project in a simple python-based image where you COPY the codebase
  2. push that to DockerHub or ECR or any other docker repository that you are using
  3. use the DockerOperator in your airflow DAG to run that docker image with your dbt code

I'm assuming that you use the airflow LocalExecutor here and that you want to execute your dbt run workload on the server where airflow is running. If that's not the case and that you have access to a Kubernetes cluster, I would suggest instead to use the KubernetesPodOperator.

Kauffmann answered 18/11, 2020 at 9:35 Comment(5)
I heavily second this. I think this has become the de facto community standard for self-hosted dbt scheduling.Thump
This is great, knew there had to be a better way than combining codebases. I will definitely use this approach.Pettifogging
Awesome approach, I will try this.Tenor
Hi @louis_guitton, I have more questions for you regarding your implementation with the Docker Operator. I have included it as a separate question here (#65465256) if you want to share more insights about your experience using that approach = )Expanded
Hey 1 more question , how do you mange the service account ? Do you keep in image?Tenor
P
3

Accepted the other answer based on the consensus via upvotes and the supporting comment, however this is a 2nd option we're currently using:

  • dbt and airflow repos / directories are next to each other.
  • in our airflow's docker-compose.yml, we've added our DBT directory as a volume so that airflow has access to it.
  • in our airflow's Dockerfile, install DBT and copy our dbt code.
  • use BashOperator to run dbt and test dbt.
Pettifogging answered 25/11, 2020 at 15:9 Comment(0)
S
0

Since you’re on GCP another option that is completely serverless is to run dbt with cloud build instead of airflow. You can also add workflows to that if you want more orchestration. If you want a detailed description there’s a post describing it. https://robertsahlin.com/serverless-dbt-on-google-cloud-platform/

Shanan answered 27/11, 2021 at 11:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.