Apache Airflow DAG cannot import local module
Asked Answered
S

5

33

I do not seem to understand how to import modules into an apache airflow DAG definition file. I would want to do this to be able to create a library which makes declaring tasks with similar settings less verbose, for instance.

Here is the simplest example I can think of that replicates the issue: I modified the airflow tutorial (https://airflow.apache.org/tutorial.html#recap) to simply import a module and run a definition from that module. Like so:

Directory structure:

- dags/
-- __init__.py
-- lib.py
-- tutorial.py

tutorial.py:

"""
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

# Here is my added import
from lib import print_double

# And my usage of the imported def
print_double(2)

## -- snip, because this is just the tutorial code, 
## i.e., some standard DAG defintion stuff --

print_double is just a simple def which multiplies whatever input you give it by 2, and prints the result, but obviously that doesn't even matter because this is an import issue.

I am able to run airflow test tutorial print_date 2015-06-01 as per the tutorial docs successfully - the dag runs, and moreover the print_double succeeds. 4 is printed to the console, as expected. All appears well.

Then I go the web UI, and am greeted by Broken DAG: [/home/airflow/airflow/dags/tutorial.py] No module named 'lib'. Unpausing the dag and attempting a manual run using the UI causes a "running" status, but it never succeeds or fails. It just sits on "running" forever. I can queue up as many as I'd like, but they'll all just sit on "running" status.

I've checked the airflow logs, and don't see any useful debug information there.

So what am I missing?

Systemize answered 27/12, 2017 at 20:55 Comment(3)
tested this in my local and it worked; are you sure you got the files in the proper directory? could you be editing a tutorial.py file that it's not really in the dag folder? the path looks dodgy with the two "airflow"'s there: /home/airflow/airflow/dags/tutorial.pyKalin
I've context switched off of this problem, but I'll try a totally fresh airflow install in a vm and try to replicate again when I get a chance. However I can confirm that airflow is the username and airflow/airflow is the install dir, so at least that part is not the issue. I also can confirm just by cding into the dir that the directory structure is as posted in the question. But I'll do my due diligence and replicate the whole thing in an isolated environment since you are saying it works for you.Systemize
I've done three strange things: add / in the end of the [core] dags_folder = ... settings in the airflow.cfg. Also, I've done chmod 777 to the __init__.py file in the dags folder. And reboot the system. After these three steps airflow starts working IDK why. Maybe the thing was only in rebooting.Aerostatics
H
16

Adding the sys path again worked for me,

import sys
sys.path.insert(0,os.path.abspath(os.path.dirname(__file__)))
Honora answered 8/1, 2019 at 22:43 Comment(10)
Why do you use join()? This has worked for me instead: import sys sys.path.insert(0,os.path.abspath(os.path.dirname(file)))Layman
@alltej The dag file itself, I think. Has worked for me before, though am currently seeing some weirdness trying in another dag right now (https://mcmap.net/q/453597/-use-separate-environ-and-sys-path-between-dags/8236733)Holarctic
Yes - this is in the DAG Python file. Works till day for me.Honora
We've stopped using airflow, so I cannot validate any of these answers. Based on the number of upvotes on this one after so much time, I'm going to accept it as the correct answer.Systemize
where does this need to be added?Terryl
@Terryl - This needs to go along with the import blocks - typically before importing the local modulesHonora
If you don't like to put import sys; sys.path.insert(0,os.path.abspath(os.path.dirname(__file__))) into you dag module, you can extend the PYTHONPATH environment variable with the return value of os.path.abspath(os.path.dirname(__file__)). In my case, using apache-airflow with docker, I put the following into my Dockerfile: ENV PYTHONPATH "${PYTHONPATH}:blablabla:/opt/project". Apparently, my custom dag-helper module was located at /opt/project inside the airflow container. Extending the PYTHONPATH makes python look for my custom modules at /opt/project now too, whenever i import sthMccubbin
Can you explain why you should add the sys path? As I know in Python - a module that was imported - the interpreter first searches for a built-in module if not found, then searches for a file in a list of directories given by the variable sys.pathCootch
Adding this line into the DAG worked for me, thank you.Patellate
It helped a lotFretted
L
10

Are you using Airflow 1.9.0? This might be fixed there.

The issue is caused by the way Airflow loads DAGs: it doesn't just import them as normal python modules, because it want's to be able to reload it without restarting processes. As a result . isn't in the python search path.

If 1.9.0 doesn't fix this, the easiest change is to put export PYTHONPATH=/home/airflow/airflow/:$PYTHONPATH in the startup scripts. The exact format of that will depend on what you are using (systemd vs init scripts etc.)

Lawford answered 8/1, 2018 at 16:20 Comment(10)
Context switched out of this for the moment - will investigate this answer as soon as I can!Systemize
I'm having the same problem any fixes yet?Kelley
I handed off the airflow task to a coworker, so I won't get around to checking this answer yet. But this answer seems reasonable - have you tried it @Kelley ?Systemize
"it want's to be able to reload it without restarting processes" Couldn't Airflow use importlib.reload docs.python.org/3/library/importlib.html#importlib.reload if this were the only reason?Runion
It could, but that wouldn't help with deps/imported libs. There are other reasons too (like sys.exit, or worse case a segfault in one dag shouldn't bring down the scheduler, and it uses multiple parallel processes to improve performance)Lawford
@AshBerlin-Taylor This seems to be an issue still with 1.10.2.Iyre
exporting the PYTHONPATH did not work also. I am using 1.10.2Iyre
It there any other workaround? The issue still persists in version 1.10.10Pharyngoscope
"The issue still persists in version 1.10.12"Eskilstuna
what do you mean by startup scripts ? and init scripts ?Eskilstuna
S
2

If you're working with git-sync and did not use at as an initContainer (only as a container or not at all) in kubernetes, then it is possible that the modules were not loaded into the webserver or scheduler.

Subsocial answered 10/9, 2020 at 17:39 Comment(2)
I'm using Airflow's Helm chart, and I'm having this issue, maybe there's a way to overcome it?Volpe
it looks like they fixed it in this PR: github.com/apache/airflow/pull/16339 So it's a matter of time so this comes to a new Helm releaseVolpe
K
1

I could fix this issue by putting my modules in a subfolder (scripts) in the plugins folder under Airflow. For importing the module in my DAG files I used:

import sys
sys.path.append('../../plugins/scripts')
import <your_modul_name> as ymn

Make sure when you call your functions in the DAG that you add the prefix if you specified one.

Kimmy answered 7/2 at 6:34 Comment(0)
A
-1

Simply put your local module in airflow plugin folder it will start working. to know location of your airflow plugin use command: airflow info

Antibaryon answered 30/9, 2021 at 11:58 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Proulx

© 2022 - 2024 — McMap. All rights reserved.