How to manage Python dependencies in Airflow?
Asked Answered
V

2

14

On my local machine I created a virtualenv and installed Airflow. When a dag or plugin requires a python library I pip install it into the same virtualenv.

How can I keep track of which libraries belong to a dag, and which are used for airflow itself? I recently deleted a dag and wanted to remove the libraries it was using. It was pretty time consuming, and I was crossing my fingers I didn't delete something that was being used by another dag!

Vedis answered 4/7, 2019 at 15:33 Comment(1)
Btw, if you need to do the same for your custom libraries, take a look at airflow.apache.org/docs/stable/plugins.html ( astronomer.io/guides/airflow-importing-custom-hooks-operators )Pembroke
U
11

Particularly for larger Airflow use-cases, I'd recommend using Airflow as a way to orchestrate tasks on a different layer of abstraction so you aren't managing dependencies from the Airflow side.

I'd recommend taking a look at either the DockerOperator or KubernetesPodOperator. With these, you can build your Python tasks into Docker containers, and have Airflow run those. That way you don't need to manage Python dependencies in Airflow, and you won't encounter any disaster scenarios where two DAGs have conflicting dependencies. This does, however, require you to be knowledgeable about managing a Kubernetes cluster.

Understructure answered 5/7, 2019 at 8:28 Comment(5)
I see an issue with this though. Now lets say you package an ETL pipeline in a docker container. If it's a 3 step process and one of them fails, Airflow would have to rerun the entire process rather than just the one that failed correct? Or within your dockererized task can you call separate pieces of your pipeline?Detestation
@Detestation You don't necessarily need to package all three tasks into one docker invocation. You can separate them out into three images, or use one image with three different calls. I agree, having three Airflow tasks running for three stages is better practice.Understructure
Agree that seems like a better practice. On the subject of this topic - can you shed any additional light on why one might use the KubernetesPodOperator over the DockerOperator? They both seem like a good way to containerize your workflow.Detestation
DockerOperator vs KubernetesPodOperator really depends on your underlying infrastructure. In short, DockerOperator will run a docker image on the node the Airflow worker lives on. If that node's resource utilization is completely consumed, your docker container can't run. A Kubernetes cluster is a way to manage a cluster of nodes to run docker containers. So KubernetesPodOperator will have an entire cluster available to run your docker container. If you're using a large amount of machine resources, I'd recommend KubernetesPodOperator.Understructure
I second @chris.mclennon. I have a related detailed video on this, that could help clear concepts; youtube.com/watch?v=9pykChPp-X4Hallow
B
7

There is airflow.operators.python_operator.PythonVirtualenvOperator you may see about using in Dags where you use a PythonOperator.

Using VirtualenvOperator in place of PythonOperator isolates the dependencies for a Dag to a Virtualenv and you can keep separate requirement files.

You may use comments in the requirements file to mark dependencies for a Dag e.g.

package-one # Dag1.

...and when you delete the Dag, grep requirements with the DAG's name, uninstall then delete the lines.

With this way, when you install a package for a DAG, you need a process to comment the Dag name in your requirements file. You could write a script to perform this.

Brendis answered 5/7, 2019 at 6:16 Comment(3)
Not practical if you have a big project with lots of package. Virtualenv is created and destroyed for each VirtualenvOperator you call.Overindulge
Also, wouldn't these requirements pollute the Airflow environment?Ghiselin
@BrylieChristopherOxley The requirements are installed into virtual environment in a temporary folder so that Airflow environment is not polluted. You can read the implementation hereBrendis

© 2022 - 2024 — McMap. All rights reserved.