A Dataset is really just a reference string that tracks the status of the task it is inside (using the task's outlets
parameter). It is a potentially more transparent and scalable way to create dependencies in Airflow.
Datasets allow the success of one task to trigger other DAGs, with minimal code. They also allow for some separation, similar to the function of an api between different applications. However, datasets do not connect to, track, or even know about your actual data.
Datasets and URIs are summarized well by Marc Lamberti from Astronomer in this YouTube video on Datasets:
[You] can put pretty much whatever you want for the URI of your dataset. It's because Airflow doesn't care about if the actual data is updated or not. Indeed the only thing that Airflow monitors is if the task that updates the dataset successfully completes or not. If it successfully completes then the DAG is triggered. If it fails then the DAG is not triggered. It is as simple as that.
think of the URI as the unique identifier of a dataset and not as a way for Airflow to access the actual data of your dataset. That means if another tool like Spark updates that dataset, Airflow will not be aware of that and your DAG won't be triggered.
Note: If you need to listen for external data changes, Airflow Sensors are still the way to go.
The Airflow Dataset documentation, while more technically worded, supports this:
Airflow makes no assumptions about the content or location of the data represented by the URI. It is treated as a string
These are all valid ways to create Datasets:
mysql_data = Dataset('mysql://localhost:3306/database_name?table=table_name')
bigquery_table = Dataset('bigquery://gcp-project-name/dataset-name/table-name')
some_table = Dataset('table://database/table_name')
some_other_table = Dataset('table_name')
some_file = Dataset('file_name.csv')
While it doesn't technically matter what string you choose, it's typically to your benefit to define it clearly. I referenced the mysql uri docs to create the mysql one. Bigquery doesn't have a direct URI, so I made up my own to reference. You can even use simple strings. You determine how detailed you want to be.
Datasets have a very simple set of options now (as of Airflow v2.7.1), but they likely lay the foundation for more data-aware pipelines in future Airflow versions.