Why do I keep getting NonZeroExitCode when using sbatch SLURM?
Asked Answered
H

3

6

I have a simple test.ksh that I am running with the command:

sbatch test.ksh

I keep getting "JobState=FAILED Reason=NonZeroExitCode" (using "scontrol show job")

I have already made sure of the following:

  1. slurmd and slurmctld are up and running correctly
  2. user privileges on "test.ksh" is 777.
  3. The command "srun test.ksh" (by itself, without using sbatch) succeeds without problems
  4. I tried putting in a "return 0" in the last line of "test.ksh" without luck
  5. I tried putting in a "exit 0" in the last line of "test.ksh" without luck
  6. I tried putting in "hostname" in the last line of "test.ksh" without luck
  7. I tried putting in "srun hostname" in the last line of "test.ksh" without luck
Holiness answered 22/1, 2015 at 16:29 Comment(0)
H
7

I found out that I hadn't set --error and --output, which meant that the default was the current directory from which I was issuing the command.

The problem was that I didn't have sufficient privileges to write to the current directory.

The solution was to set the --error and --output to directories to a place where I had privileges.

Holiness answered 22/1, 2015 at 17:7 Comment(0)
J
0

In my case it was because my folder owner was root when I was actually using a second user. I made the mistake to create the folder as root in the home folder of a particular user. use chown user:usergroup foldername and it fixes the problem

Jaquez answered 15/4, 2023 at 23:45 Comment(0)
G
0

Sometimes the issue is due to missing folders.

You can check the output job file locations using scontrol show job <PID> and checking for StdOut and StdErr fields.

In my case the slurm folder was missing.

Resolve it by creating the missing folder(s).

Gasper answered 22/4 at 18:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.