Airflow structure/organization of Dags and tasks
Asked Answered
S

3

38

My questions :

Snuff answered 7/6, 2017 at 23:47 Comment(0)
M
21

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:

  1. Dump a lot of data into a data-lake (directly accessible only to a few people)
  2. Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)

Today I organize the files into three main folders that try to reflect the logic above:

├── dags
│   ├── dag_1.py
│   └── dag_2.py
├── data-lake
│   ├── data-source-1
│   └── data-source-2
└── dw
    ├── cubes
    │   ├── cube_1.sql
    │   └── cube_2.sql
    ├── dims
    │   ├── dim_1.sql
    │   └── dim_2.sql
    └── facts
        ├── fact_1.sql
        └── fact_2.sql

This is more or less my basic folder structure.

Mattock answered 13/6, 2017 at 17:18 Comment(1)
Just restating this is just my way to organize. If anyone comes around to this question it would be great to benchmark with other ways to structure folders and files! :)Mattock
F
52

I use something like this.

  • A project is normally something completely separate or unique. Perhaps DAGs to process files that we receive from a certain client which will be completely unrelated to everything else (almost certainly a separate database schema)
  • I have my operators, hooks, and some helper scripts (delete all Airflow data for a certain DAG, etc.) in a common folder
  • I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project (makes it more organized and easier to grant permissions on Gitlab since projects are so unrelated). This means that each project folder also as a .git and .gitignore, etc as well
  • I tend to save the raw data and then 'rest' a modified copy of the data which is exactly what gets copied into the database. I have to heavily modify some of the raw data due to different formats from different clients (Excel, web scraping, HTML email scraping, flat files, queries from SalesForce or other database sources...)

Example tree:

├───dags
│   ├───common
│   │   ├───hooks
│   │   │       pysftp_hook.py
│   │   │
│   │   ├───operators
│   │   │       docker_sftp.py
│   │   │       postgres_templated_operator.py
│   │   │
│   │   └───scripts
│   │           delete.py
│   │
│   ├───project_1
│   │   │   dag_1.py
│   │   │   dag_2.py
│   │   │
│   │   └───sql
│   │           dim.sql
│   │           fact.sql
│   │           select.sql
│   │           update.sql
│   │           view.sql
│   │
│   └───project_2
│       │   dag_1.py
│       │   dag_2.py
│       │
│       └───sql
│               dim.sql
│               fact.sql
│               select.sql
│               update.sql
│               view.sql
│
└───data
    ├───project_1
    │   ├───modified
    │   │       file_20180101.csv
    │   │       file_20180102.csv
    │   │
    │   └───raw
    │           file_20180101.csv
    │           file_20180102.csv
    │
    └───project_2
        ├───modified
        │       file_20180101.csv
        │       file_20180102.csv
        │
        └───raw
                file_20180101.csv
                file_20180102.csv

Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.

├───dags
│   │
│   ├───project_1
│   │     dag_1.py
│   │     dag_2.py
│   │
│   └───project_2
│         dag_1.py
│         dag_2.py
│
├───plugins
│   ├───hooks
│   │      pysftp_hook.py
|   |      servicenow_hook.py
│   │   
│   ├───sensors
│   │      ftp_sensor.py
|   |      sql_sensor.py
|   |
│   ├───operators
│   │      servicenow_to_azure_blob_operator.py
│   │      postgres_templated_operator.py
│   |
│   ├───scripts
│       ├───project_1
|       |      transform_cases.py
|       |      common.py
│       ├───project_2
|       |      transform_surveys.py
|       |      common.py
│       ├───common
|             helper.py
|             dataset_writer.py
| .airflowignore
| Dockerfile
| docker-stack-airflow.yml
Fluorosis answered 5/7, 2018 at 18:46 Comment(9)
I find this directory structure to be very useful; just one doubt: would performance of scheduler be impacted by placing everything under dag directory? Although even if I place files containing operators (or anything except actual DAGs) outside dag directory and import them in DAG files, it would mean more-or-less the same thing. However non-python files (like .sql files above) inside dag folder can (in theory) cause unnecessary overhead for schedulerTedmann
Oh, I am not sure. I will have to look into it though. Honestly never crossed my mind - I set it up this way more due to version control (easy to clone or share a specific project end to end and not mix up projects)Fluorosis
I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project how are you then gluing the different repos together in the main repo ? (my current solution is using git submodules)Saskatchewan
Just following up on this thread. Has anyone see any impact on the scheduler from this repository structure?Suzisuzie
curious to see a sample of your codeSuperinduce
I always got trouble with my local env (docker) and deplot¡y to prod. What's your root path? It always force me to add "airflow" if not, its not recognized by the dags when the scheduler loads the files.Nanon
In my opinion this structure is good, but when your project gets bigger the scripts folder should become a project added to the pythonpath (or installed as a python package) to follow a clean architecture and scale betterOran
That would make sense. Do you just edit the Dockerfile to add something new to the pythonpath? I also have new folder in plugins called 'tasks' which has TaskFlow API functions that I import these days, instead of traditional operators.Fluorosis
@Fluorosis Are you constraining everything under scripts/ to use the same Python version and dependencies? I am trying to solve for different "scripts" (more so modules in my case) that may require different Pythons or have their deps versioned independently from one module to another.Gorga
M
21

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:

  1. Dump a lot of data into a data-lake (directly accessible only to a few people)
  2. Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)

Today I organize the files into three main folders that try to reflect the logic above:

├── dags
│   ├── dag_1.py
│   └── dag_2.py
├── data-lake
│   ├── data-source-1
│   └── data-source-2
└── dw
    ├── cubes
    │   ├── cube_1.sql
    │   └── cube_2.sql
    ├── dims
    │   ├── dim_1.sql
    │   └── dim_2.sql
    └── facts
        ├── fact_1.sql
        └── fact_2.sql

This is more or less my basic folder structure.

Mattock answered 13/6, 2017 at 17:18 Comment(1)
Just restating this is just my way to organize. If anyone comes around to this question it would be great to benchmark with other ways to structure folders and files! :)Mattock
M
1

I am using Google Cloud Composer. I have to manage multiple projects with some additional SQL scripts and I want to sync everything via gsutil rsync Hence I use the following structure:

├───dags
│   │
│   ├───project_1
│       │
│       ├───dag_bag.py
│       │
│       ├───.airflowignore
│       │
│       ├───dag_1
│       │      dag.py
│       │      script.sql
│
├───plugins
│   │
│   ├───hooks
│   │      hook_1.py
│   │   
│   ├───sensors
│   │      sensor_1.py
│   │
│   ├───operators
│   │      operator_1.py

And the file dag_bag.py containes these lines

from airflow.models import DagBag

dag_bag = DagBag(dag_folder="/home/airflow/gcs/dags/project_1", include_examples=False)
Masticate answered 29/12, 2022 at 16:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.