I use something like this.
- A project is normally something completely separate or unique. Perhaps DAGs to process files that we receive from a certain client which will be completely unrelated to everything else (almost certainly a separate database schema)
- I have my operators, hooks, and some helper scripts (delete all Airflow data for a certain DAG, etc.) in a common folder
- I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project (makes it more organized and easier to grant permissions on Gitlab since projects are so unrelated). This means that each project folder also as a .git and .gitignore, etc as well
- I tend to save the raw data and then 'rest' a modified copy of the data which is exactly what gets copied into the database. I have to heavily modify some of the raw data due to different formats from different clients (Excel, web scraping, HTML email scraping, flat files, queries from SalesForce or other database sources...)
Example tree:
├───dags
│ ├───common
│ │ ├───hooks
│ │ │ pysftp_hook.py
│ │ │
│ │ ├───operators
│ │ │ docker_sftp.py
│ │ │ postgres_templated_operator.py
│ │ │
│ │ └───scripts
│ │ delete.py
│ │
│ ├───project_1
│ │ │ dag_1.py
│ │ │ dag_2.py
│ │ │
│ │ └───sql
│ │ dim.sql
│ │ fact.sql
│ │ select.sql
│ │ update.sql
│ │ view.sql
│ │
│ └───project_2
│ │ dag_1.py
│ │ dag_2.py
│ │
│ └───sql
│ dim.sql
│ fact.sql
│ select.sql
│ update.sql
│ view.sql
│
└───data
├───project_1
│ ├───modified
│ │ file_20180101.csv
│ │ file_20180102.csv
│ │
│ └───raw
│ file_20180101.csv
│ file_20180102.csv
│
└───project_2
├───modified
│ file_20180101.csv
│ file_20180102.csv
│
└───raw
file_20180101.csv
file_20180102.csv
Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.
├───dags
│ │
│ ├───project_1
│ │ dag_1.py
│ │ dag_2.py
│ │
│ └───project_2
│ dag_1.py
│ dag_2.py
│
├───plugins
│ ├───hooks
│ │ pysftp_hook.py
| | servicenow_hook.py
│ │
│ ├───sensors
│ │ ftp_sensor.py
| | sql_sensor.py
| |
│ ├───operators
│ │ servicenow_to_azure_blob_operator.py
│ │ postgres_templated_operator.py
│ |
│ ├───scripts
│ ├───project_1
| | transform_cases.py
| | common.py
│ ├───project_2
| | transform_surveys.py
| | common.py
│ ├───common
| helper.py
| dataset_writer.py
| .airflowignore
| Dockerfile
| docker-stack-airflow.yml