What is the state of the art way to handle what makefiles do for python data analysis?
Asked Answered
N

4

10

I have a program that is a DAG which process and cleans certain files, combines them, then does additional calculations. I want a way to run the whole analysis pipeline, and re-run if anything changes, but without having to re-process every single component.

I read about Makefiles and thought that it sounds like the perfect solution. I am also aware that it is probably outdated and that better alternatives probably exist, but I generally only find large lists of work flow scheduler tools that are not quite suited to this purpose, as far as I can tell (e.g., Airflow, Luigi, Nextflow, Dagobah, etc., etc.)

It seems like many of these are overkill with schedulers, GUIs, etc. which I don't really need. I just want one file that does the following:

  • makes it obvious what all of the python scripts are that need to run
  • shows file dependencies so that a full re-run will only redo parts where something has been changed upstream
  • has the potential for some parallelization (not very necessary)
  • doesn't have too much boilerplate

Makefile example:

.PHONY : dats
dats : isles.dat abyss.dat

isles.dat : books/isles.txt
    python countwords.py books/isles.txt isles.dat

abyss.dat : books/abyss.txt
    python countwords.py books/abyss.txt abyss.dat

.PHONY : clean
clean :
    rm -f *.dat

Is this the best procedure to run something like this in python or is there a better way?

Nynorsk answered 8/11, 2019 at 0:14 Comment(0)
B
11

DVC (Data Version Control) includes a modern re-implementation and extension of make that is particularly suited to data-science pipelines (see here).

Handling pipelines in DVC has important benefits over make for many scenarios, such as relying on file checksum rather than modification-time. On the contrary, make is simpler in some sense, and it has a powerful macro mechanism. Still, there are elements in the syntax of makefiles that are quite subtle (e.g., multiple outputs, intermediate files), and make generally doesn't support whitespace in filenames.

Briannabrianne answered 8/4, 2021 at 13:59 Comment(0)
W
2

Is this the best procedure to run something like this in python or is there a better way?

"Best" is surely in the eye of the beholder. However, if the make-based approach presented in the question is satisfactorily representative of the problem then it is a good way. make implementations are very widely available, and their behavior is well understood and generally well-suited to problems such as the one presented.

There are other build tools that compete with make, some written in Python, and there are undoubtedly some more esoteric software frameworks that could be applied to the task. Nevertheless, if you want to focus on doing the work instead of on building the framework to do the work, then I don't see any reason to look past the make-based solution you already have.

Washout answered 8/11, 2019 at 1:22 Comment(0)
N
2

The way you present the question, I would say snakemake is the way to go. Having said that, GNU make may be old but is not going to disappear any time soon and it's been tested and tried to death.

I don't speak make, but I think your example Makefile in snakemake would be something like this:

rule all:
    input:
        ['isles.dat', 'abyss.dat'],

rule make_isles:
    input:
        'books/isles.txt',
    output:
        'isles.dat',
    shell:
        r"""
        python countwords.py {input} {output}
        """

rule make_abyss:
    input:
        'books/abyss.txt',
    output:
        'abyss.dat',
    shell:
        r"""
        python countwords.py {input} {output}
        """

Save this in a file called Snakefile and execute it as:

snakemake # vanilla execution   

snakemake -p -n # Print shell commands (-p). Dry-run mode (-n)

snakemake --delete-all-output # Same-ish as .PHONY clean

snakemake is popular in bioinformatics but it has pretty general purpose.

Nightie answered 10/11, 2019 at 18:29 Comment(0)
A
0

Maybe not "state of the art", but here are two relatively lightweight alternative Python tools that match the OP's requirements.

In both, rule/task configuration is done in Python, which may be preferred over Make's dedicated rule definition syntax, and which adds flexibility when working with Python code. On the other hand, it's hard to beat Make's syntax in conciseness.

Invoking python via the command line and other details of the below examples may not be idiomatic for these tools, but the implementations should be close to the OP's Makefile example.

Gird

Contents of girdfile.py:

from gird import Phony, rule

RULE_ISLES = rule(
    target=PATH_ISLES_DAT,
    deps=PATH_ISLES_TXT,
    recipe=f"python countwords.py {PATH_ISLES_TXT} {PATH_ISLES_DAT}",
)

RULE_ABYSS = rule(
    target=PATH_ABYSS_DAT,
    deps=PATH_ABYSS_TXT,
    recipe=f"python countwords.py {PATH_ABYSS_TXT} {PATH_ABYSS_DAT}",
)

rule(
    target=Phony("dats"),
    deps=(
        RULE_ISLES,
        RULE_ABYSS,
    ),
)

rule(
    target=Phony("clean"),
    recipe="rm -f *.dat",
)

doit

Contents of dodo.py:

def task_isles():
    return {
        "actions": [f"python countwords.py {PATH_ISLES_TXT} {PATH_ISLES_DAT}"],
        "file_dep": [PATH_ISLES_TXT],
        "targets": [PATH_ISLES_DAT],
    }

def task_abyss():
    return {
        "actions": [f"python countwords.py {PATH_ABYSS_TXT} {PATH_ABYSS_DAT}"],
        "file_dep": [PATH_ABYSS_TXT],
        "targets": [PATH_ABYSS_DAT],
    }

def task_dats():
    return {
        "task_dep": ["isles", "abyss"],
        "actions": None,
    }

def task_clean_all():
    return {
        "actions": ["rm -f *.dat"],
    }
Alessandro answered 3/1, 2024 at 14:59 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.