Debugging job errors
September 14 2018
Job submission is a necessary part of using a cluster. To learn more about job submission and how to write a job script, go here. In this
Another note to make in this blog post is that, while the commands/emails shown below are for PBS TORQUE job scheduler, the steps presented to debug are generally common even among other job schedulers like SLURM on Bridges.
Preferably, the below flags are included in your job submission script to notify you by email when the job starts, aborts or ends. Looking at the email is your first clue
#PBS -m abe #mail me when job: a – abort, b - begins, e - ends
#PBS -M <your email>
The first clue – check your email
An email similar to the one below will be sent to the entered email address
1. First, take a look at the exit code; exit status=0 is an indicator that nothing went wrongNotice I don’t actually say success because there is always a few cases when the exit code says 0 but the job actually failed. So exit_status is just the first checkpoint
Some common exit codes from TORQUE resource manager (more information available here.
2. Next, take a look “resources used” lines. If the job was successful, the job should have used some amount of resources. Now take a look at the email here, what do you think?
The jobs actually used 0kb memory now that is another indication the job likely failed.
Also, notice the
This sums it up, the job number “340249” we are looking into in this post likely failed. Now, this is a start but where can we find more information, for example, a better description of why it failed?
Taking a look at job logs
The answer to the question “why did the job fail” will be written into the job log files (in most cases). If you are not sure what the job log files are. Let me explain that briefly
You would have noticed, every time you submitted a job, there are two files generated with the format <job name>.e<job number> and <job name>.o<job number>. The file path to these log files are actually written in the email sent as well.
Log back into the machine, and go to the file path mentioned in the email -
Note: this is not the exact directory in the email or the same tool or job ID, but the change was made so we can work through some errors and understand how to debug
quast.log- the log from
Take a look at the error log first, here is a picture of the
Before you continue reading, tell me what do you think caused this error?
The reason the job failed was that the job could not open the entered input file “*.fastq”. The answer was this sentence “no such file or directory”
Now go back to the job script, check the file name and make sure the input file is entered correctly with the correct path.
Here is another exercise, what do you think went wrong with this job from the below job error log?
Walltime limit exceeded! So this job needed a more
It’s more complicated than thatThe examples above were straightforward and easy to debug. There will be cases (most of the time) when you are not sure what the error is, or what the error log is even saying. In those cases, try these quick steps
- Search for the word “Error” – read that line and see if you can figure it out. If you are not sure
trya quick google search with the line next to error with programname
- Look up the exit_status number (in your email, and only if it’s not 0) with program name on google
- Still not sure – send the HPC support team an email, or email us at email@example.com.
What to include in the email asking for help
- The program you are running with version
computecluster you are running the program on
- The job script, job error logs (both the .e
nd.o files, and any other outputs you have handy)
- If you can easily share
your inputfiles you used, that would be a big help for us too!