Symlink (auto-generated) directories via Snakemake
Asked Answered
P

2

7

I am trying to create a symlink-directory structure for aliasing output directories in a Snakemake workflow.

Let's consider the following example:

A long time ago in a galaxy far, far away, somebody wanted to find the best ice cream flavour in the universe and conducted a survey. Our example workflow aims at representing the votes by a directory structure. The survey was conducted in English (because that's what they all speak in that foreign galaxy), but the results should be understood by non-English speakers as well. Symbolic links come to the rescue.

To make the input parsable for us humans as well as Snakemake, we stick them into a YAML file:

cat config.yaml
flavours:
  chocolate:
    - vader
    - luke
    - han
  vanilla:
    - yoda
    - leia
  berry:
    - windu
translations:
  french:
    chocolat: chocolate
    vanille: vanilla
    baie: berry
  german:
    schokolade: chocolate
    vanille: vanilla
    beere: berry

To create the corresponding directory tree, I started with this simple Snakefile:

### Setup ###

configfile: "config.yaml"


### Targets ###

votes = ["english/" + flavour + "/" + voter
         for flavour, voters in config["flavours"].items()
         for voter in voters]

translations = {language + "_translation/" + translation
                for language, translations in config["translations"].items()
                for translation in translations.keys()}


### Commands ###

create_file_cmd = "touch '{output}'"

relative_symlink_cmd = "ln --symbolic --relative '{input}' '{output}'"


### Rules ###

rule all:
    input: votes, translations

rule english:
    output: "english/{flavour}/{voter}"
    shell: create_file_cmd

rule translation:
    input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans]
    output: "{lang}_translation/{trans}"
    shell: relative_symlink_cmd

I am sure there ary more 'pythonic' ways to achieve what I wanted, but this is just a quick example to illustrate my problem.

Running the above workflow with snakemake, I get the following error:

Building DAG of jobs...
MissingInputException in line 33 of /tmp/snakemake.test/Snakefile
Missing input files for rule translation:
english/vanilla

So while Snakemake is clever enough to create the english/<flavour> directories when attempting to make an english/<flavour>/<voter> file, it seems to 'forget' about the existence of this directory when using it as an input to make a <language>_translation/<flavour> symlink.

As an intermediate step, I applied the following patch to the Snakefile:

27c27
<     input: votes, translations
---
>     input: votes#, translations

Now, the workflow ran through and created the english directory as expected (snakemake -q output only):

Job counts:
        count   jobs
        1       all
        6       english
        7

Now with the target directories created, I went back to the initial version of the Snakefile and re-ran it:

Job counts:
        count   jobs
        1       all
        6       translation
        7
ImproperOutputException in line 33 of /tmp/snakemake.test/Snakefile
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule translation:
french_translation/chocolat
Exiting because a job execution failed. Look above for error message

While I am not sure if a symlink to a directory qualfies as a directory, I went ahead and applied a new patch to follow the suggestion:

35c35
<     output: "{lang}_translation/{trans}"
---
>     output: directory("{lang}_translation/{trans}")

With that, snakemake finally created the symlinks:

Job counts:
        count   jobs
        1       all
        6       translation
        7

As a confirmation, here is the resulting directory structure:

english
├── berry
│   └── windu
├── chocolate
│   ├── han
│   ├── luke
│   └── vader
└── vanilla
    ├── leia
    └── yoda
french_translation
├── baie -> ../english/berry
├── chocolat -> ../english/chocolate
└── vanille -> ../english/vanilla
german_translation
├── beere -> ../english/berry
├── schokolade -> ../english/chocolate
└── vanille -> ../english/vanilla

9 directories, 6 files

However, besides not being able to create this structure without running snakemake twice (and modifying the targets in between), even simply re-running the workflow results in an error:

Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
/tmp/snakemake.test/english/berry
/tmp/snakemake.test/english/berry/windu

running the translation rules again for no (good) reason:

Job counts:
        count   jobs
        1       all
        5       translation
        6

So my question is: How can I implement the above logic in a working Snakefile?

Note that I am not looking for advice to change the data representation in the YAML file and/or the Snakefile. This is just an example to highlight (and isolate) an issue I encountered in a more complex scenario.

Sadly, while I could not figure this out by myself so far, I managed to get a working GNU make version (even though the 'YAML parsing' is hackish at best):

### Setup ###

configfile := config.yaml


### Targets ###

votes := $(shell awk ' \
  NR == 1 { next } \
  /^[^ ]/ { exit } \
  NF == 1 { sub(":", "", $$1); dir = "english/" $$1 "/"; next } \
  { print dir $$2 } \
  ' '$(configfile)')

translations := $(shell awk ' \
  NR == 1 { next } \
  /^[^ ]/ { trans = 1; next } \
  ! trans { next } \
  { sub(":", "", $$1) } \
  NF == 1 { dir = $$1 "_translation/"; next } \
  { print dir $$1 } \
  ' '$(configfile)')


### Commands ###

create_file_cmd = touch '$@'

create_dir_cmd = mkdir --parent '$@'

relative_symlink_cmd = ln --symbolic --relative '$<' '$@'


### Rules ###

all : $(votes) $(translations)

$(sort $(dir $(votes) $(translations))) : % :
    $(create_dir_cmd)
$(foreach vote, $(votes), $(eval $(vote) : | $(dir $(vote))))
$(votes) : % :
    $(create_file_cmd)

translation_targets := $(shell awk ' \
  NR == 1 { next } \
  /^[^ ]/ { trans = 1; next } \
  ! trans { next } \
  NF != 1 { print "english/" $$2 "/"} \
  ' '$(configfile)')
define translation
$(word $(1), $(translations)) : $(word $(1), $(translation_targets)) | $(dir $(word $(1), $(translations)))
    $$(relative_symlink_cmd)
endef
$(foreach i, $(shell seq 1 $(words $(translations))), $(eval $(call translation, $(i))))

Running make on this works just fine:

mkdir --parent 'english/chocolate/'
touch 'english/chocolate/vader'
touch 'english/chocolate/luke'
touch 'english/chocolate/han'
mkdir --parent 'english/vanilla/'
touch 'english/vanilla/yoda'
touch 'english/vanilla/leia'
mkdir --parent 'english/berry/'
touch 'english/berry/windu'
mkdir --parent 'french_translation/'
ln --symbolic --relative 'english/chocolate/' 'french_translation/chocolat'
ln --symbolic --relative 'english/vanilla/' 'french_translation/vanille'
ln --symbolic --relative 'english/berry/' 'french_translation/baie'
mkdir --parent 'german_translation/'
ln --symbolic --relative 'english/chocolate/' 'german_translation/schokolade'
ln --symbolic --relative 'english/vanilla/' 'german_translation/vanille'
ln --symbolic --relative 'english/berry/' 'german_translation/beere'

The resulting tree is identical to the one shown above.

Also, running make again works as well:

make: Nothing to be done for 'all'.

So I really hope the solution is not to go back to old-fashioned GNU make with all the unreadable hacks I internalized over the years but that there is a way to convince Snakemake as well to do what I spelled out to do. ;-)

Just in case it is relevant: This was tested using Snakemake version 5.7.132.2.


edits:

Platt answered 9/7, 2020 at 15:53 Comment(10)
You should remove the makefile tag since this isn't a make / makefile question. Just to note, the way to fix your "more than once" error is to introduce a $(sort ...) which as a side-effect also uniquifies.Nel
@MadScientist: Well I did not use the GNU make tag as I consider snakemake just another variation of make (which can be argued, I understand). Regarding the "more than once" error: I know (I even write it would be fixable), I just did not bother for the sake of this example. But thank you for remininding me that $(sort ...) has this side-effect that would make this less convoluted than my usual marco preserving the order. So I guess there was an advantage for the community getting your attention via the 'wrong' tag after all. Thank you for your feedback.Platt
I personally consider makefile to be a tag for POSIX-derived makefiles, which snakemake is not... but I do not own SO and opinions vary :). Glad I could help; cheers!Nel
I don't know snakemake but do I conclude correctly, that your translation task doesn't really consult the filesystem but creates the symlinks from the YAML structure? Is this task somehow depending on the 'english' task programmatically in a way I don't see?Diatomite
snakemake.readthedocs.io/en/stable/project_info/… suggests using the -r flag on the ln command. That's the only difference I can see between your example and theirs. It also notes that a symlink is treated as a file by snakemake Workingwoman
@Nick: That is a neat feature to avoid the $(realpath [...]) hack. How could I miss this all those years. Unfortunately it doesn't change what is the problem here.Platt
@Vroomfondel: Not sure I got your question correctly but i) the symlink command does not check for the existence of the target, but ii) the input (lambda) funtion used in the Snakefile for the corresponding rule, would make Snakemake fail if the target would a) not exist, and b) no other rule would describe how to generate it (which in turn would be run first). So to make french-translation/vanille, english/vanilla must exists or the english rule would be used to create it (as a side effect of generating any of the english/vanilla/<voter> files). Does that answer your question?Platt
Have you tried with a newer Snakemake version? The cases triggering ChildIOException may have changed recently.Assamese
@bli: Unfortunately I do not have root access on the machine I am running this on and the OS on my personal machine (Gentoo) doesn't provide a new version either. I am using GNU Guix to get reproducible environments across systems without root so if I knew a version that worked, I could try to upstream a patch to bump the version. Does my Snakefile work for you on a recent version? If so, which version are you on?Platt
@Platt I tried something with 5.20.1, which will run a first time and generate all desired output in one go, but still fails if a second run is attempted (see my answer).Assamese
A
1

I wanted to test with a newer version of Snakemake (5.20.1), and I came up with something similar to the answer proposed by Manalavan Gajapathy:

### Setup ###

configfile: "config.yaml"

VOTERS = list({voter for flavour in config["flavours"].keys() for voter in config["flavours"][flavour]})

### Targets ###

votes = ["english/" + flavour + "/" + voter
         for flavour, voters in config["flavours"].items()
         for voter in voters]

translations = {language + "_translation/" + translation
                for language, translations in config["translations"].items()
                for translation in translations.keys()}


### Commands ###

create_file_cmd = "touch '{output}'"

relative_symlink_cmd = "ln --symbolic --relative $(dirname '{input}') '{output}'"


### Rules ###

rule all:
    input: votes, translations

rule english:
    output: "english/{flavour}/{voter}"
    # To avoid considering ".done" as a voter
    wildcard_constraints:
        voter="|".join(VOTERS),
    shell: create_file_cmd

def get_voters(wildcards):
    return [f"english/{wildcards.flavour}/{voter}" for voter in config["flavours"][wildcards.flavour]]

rule flavour:
    input: get_voters
    output: "english/{flavour}/.done"
    shell: create_file_cmd

rule translation:
    input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans] + "/.done"
    output: directory("{lang}_translation/{trans}")
    shell: relative_symlink_cmd

This runs and creates the desired output, but fails with ChildIOException when re-run (even if there would be nothing more to be done).

Assamese answered 31/7, 2020 at 10:13 Comment(3)
Thank you for the answer. However, I do not really see how this solves the underlying issue of a false-positive circular dependency. It is nice to have an alternative to handle the directory-input issue (though I do not like cluttering the output with hidden files, but I guess one could easily hide those files in the .snakemake directory) but the core issue persists.Platt
I agree the .done file is an ugly hack. This was merely to report what happens with a newer Snakemake.Assamese
Besides cluttering the english/<flavour> directories with .done files, this does work with Snakemake v. 5.32.2. Since even my 'solution' would already introduce .snakemake_timestamp files in the same directories and I never got any response (at all) to my feature request, I'll go ahead and accept this as an answer. Thanks again.Platt
A
1

Here is a way to solve your first question (ie. have snakemake run only once to get all desired outputs). I use output files of rule english as input to rule translation, and the latter rule's shell command modified to reflect that. In my experience, using directories as input doesn't work great with snakemake, and if I remember correctly, directory() tag in input gets ignored.

Relevant code changes:

relative_symlink_cmd = """ln -s \
        "$(realpath --relative-to="$(dirname '{output}')" "$(dirname {input[0]})")" \
        '{output}'"""

rule translation:
    input: lambda wc: ["english/" + config["translations"][wc.lang][wc.trans] + "/" + voter for voter in config['flavours'][config["translations"][wc.lang][wc.trans]]]
    output: directory("{lang}_translation/{trans}")
    shell: relative_symlink_cmd

Your second question is tricky because when you run the snakemake again, it will resolve the symlinks to their corresponding source file and this leads to ChildIOException error. This can be verified by replacing relative_symlink_cmd to make their own directory instead of symlinks, as shown below. In such case, snakemake works as expected.

relative_symlink_cmd = """mkdir -p '{output}'"""

I'm not sure how to get around that.

Arrowy answered 13/7, 2020 at 1:38 Comment(2)
Yeah, I also had a similar idea already: Basically, I used output: "{lang}_translation/{trans}/{voter}" and the corresponding input via an appropriate input: lambda wc: [...] and modified rule_symlink to act on the directories. This would come with the extra issue of potentially creating the same symlink when running the rule in parallel, so your solution is a step forward but it still does ot result in a working Snakefile as you point out yourself. So it seems not only is Snakemake bad with directories, but also with symlinks. Even basic cp has -P to handle such things. ;-)Platt
I tried this as well with Snakemake v. 5.32.2.but got an error: ln: target 'german_translation/beere': No such file or directory. Did not try degugging this further since the other answer works. Leaving the upvote though since your answer did help understanding the problem and leading to a working solution.Platt
A
1

I wanted to test with a newer version of Snakemake (5.20.1), and I came up with something similar to the answer proposed by Manalavan Gajapathy:

### Setup ###

configfile: "config.yaml"

VOTERS = list({voter for flavour in config["flavours"].keys() for voter in config["flavours"][flavour]})

### Targets ###

votes = ["english/" + flavour + "/" + voter
         for flavour, voters in config["flavours"].items()
         for voter in voters]

translations = {language + "_translation/" + translation
                for language, translations in config["translations"].items()
                for translation in translations.keys()}


### Commands ###

create_file_cmd = "touch '{output}'"

relative_symlink_cmd = "ln --symbolic --relative $(dirname '{input}') '{output}'"


### Rules ###

rule all:
    input: votes, translations

rule english:
    output: "english/{flavour}/{voter}"
    # To avoid considering ".done" as a voter
    wildcard_constraints:
        voter="|".join(VOTERS),
    shell: create_file_cmd

def get_voters(wildcards):
    return [f"english/{wildcards.flavour}/{voter}" for voter in config["flavours"][wildcards.flavour]]

rule flavour:
    input: get_voters
    output: "english/{flavour}/.done"
    shell: create_file_cmd

rule translation:
    input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans] + "/.done"
    output: directory("{lang}_translation/{trans}")
    shell: relative_symlink_cmd

This runs and creates the desired output, but fails with ChildIOException when re-run (even if there would be nothing more to be done).

Assamese answered 31/7, 2020 at 10:13 Comment(3)
Thank you for the answer. However, I do not really see how this solves the underlying issue of a false-positive circular dependency. It is nice to have an alternative to handle the directory-input issue (though I do not like cluttering the output with hidden files, but I guess one could easily hide those files in the .snakemake directory) but the core issue persists.Platt
I agree the .done file is an ugly hack. This was merely to report what happens with a newer Snakemake.Assamese
Besides cluttering the english/<flavour> directories with .done files, this does work with Snakemake v. 5.32.2. Since even my 'solution' would already introduce .snakemake_timestamp files in the same directories and I never got any response (at all) to my feature request, I'll go ahead and accept this as an answer. Thanks again.Platt

© 2022 - 2024 — McMap. All rights reserved.