sbatch: error: Batch job submission failed: Socket timed out on send/recv operation when running Snakemake
Asked Answered
M

1

6

I am running a snakemake pipeline on a HPC that uses slurm. The pipeline is rather long, consisting of ~22 steps. Periodically, snakemake will encounted a problem when attempting to submit a job. This reults in the error

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Error submitting jobscript (exit code 1):

I run the pipeline via a sbatch file with the following snakemake call

snakemake -j 999 -p --cluster-config cluster.json --cluster 'sbatch --account {cluster.account} --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem {cluster.mem} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}' 

This results in not only an output for snakemake sbatch job, but also for the jobs that snakemake creates. The above error appears in the slurm.out for the sbatch file.

The specific job step the error indicates will run successfully, and give output, but the pipeline fails. The logs of the job step show that the job id ran without a problem. I have googled this error, and it appears to happen often with slurm, and especially when the scheduler is under high IO, which suggests it will be an inevitable and regular occurrence. I was hoping someone has encountered this problem, and could offer suggestions for a work around, so that the entire pipeline doesn't fail.

Mihe answered 23/10, 2019 at 16:32 Comment(1)
Maybe with -j 999 snakemake is trying to submit too many jobs for the capacity of the cluster?Arty
B
6

snakemake has an option --max-jobs-per-second and --max-status-checks-per-second with default argument of 10. Maybe try decreasing them to reduce strain on the scheduler? Also, maybe try to reduce -j 999?

Bishop answered 24/10, 2019 at 6:35 Comment(3)
Adding --max-jobs-per-second and --max-status-checks-per-second and reducing to 5 fixed the problem. Thanks for the help!Mihe
I have a similar problem, but the --max-jobs-per-second and --max-status-checks-per-second hasn't fixed the issue...Kassie
Would the --retries flag be better? Or catch this error? snakemake.readthedocs.io/en/stable/snakefiles/rules.htmlKassie

© 2022 - 2024 — McMap. All rights reserved.