I am running a snakemake pipeline on a HPC that uses slurm. The pipeline is rather long, consisting of ~22 steps. Periodically, snakemake will encounted a problem when attempting to submit a job. This reults in the error
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Error submitting jobscript (exit code 1):
I run the pipeline via a sbatch file with the following snakemake call
snakemake -j 999 -p --cluster-config cluster.json --cluster 'sbatch --account {cluster.account} --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem {cluster.mem} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}'
This results in not only an output for snakemake sbatch job, but also for the jobs that snakemake creates. The above error appears in the slurm.out for the sbatch file.
The specific job step the error indicates will run successfully, and give output, but the pipeline fails. The logs of the job step show that the job id ran without a problem. I have googled this error, and it appears to happen often with slurm, and especially when the scheduler is under high IO, which suggests it will be an inevitable and regular occurrence. I was hoping someone has encountered this problem, and could offer suggestions for a work around, so that the entire pipeline doesn't fail.
-j 999
snakemake is trying to submit too many jobs for the capacity of the cluster? – Arty