I have a program that is a DAG which process and cleans certain files, combines them, then does additional calculations. I want a way to run the whole analysis pipeline, and re-run if anything changes, but without having to re-process every single component.
I read about Makefiles and thought that it sounds like the perfect solution. I am also aware that it is probably outdated and that better alternatives probably exist, but I generally only find large lists of work flow scheduler tools that are not quite suited to this purpose, as far as I can tell (e.g., Airflow, Luigi, Nextflow, Dagobah, etc., etc.)
It seems like many of these are overkill with schedulers, GUIs, etc. which I don't really need. I just want one file that does the following:
- makes it obvious what all of the python scripts are that need to run
- shows file dependencies so that a full re-run will only redo parts where something has been changed upstream
- has the potential for some parallelization (not very necessary)
- doesn't have too much boilerplate
Makefile example:
.PHONY : dats
dats : isles.dat abyss.dat
isles.dat : books/isles.txt
python countwords.py books/isles.txt isles.dat
abyss.dat : books/abyss.txt
python countwords.py books/abyss.txt abyss.dat
.PHONY : clean
clean :
rm -f *.dat
Is this the best procedure to run something like this in python or is there a better way?