Job States
As a job moves from submission to execution to completion, it does so goes through a variety of states . In fact, - and at any given moment every job is in exactly one of several possible states. Various commands issued either from the command line or through the Qube GUI instruct the Supervisor to generate an event that changes the state of the job called a transition. The description of all possible states and their transitions is called a state machine.
The key to understand how to effectively use Qube to manage jobs is to see how different commands change the state of a job. Normally, the submission of a job will place it in an initial state called pending. The Supervisor will take over from there, and without any other intervention, will place the job in running when it executes, and either failed or done when the job completes.
Some of the commands have fairly straightforward effects. Killing, suspending, or blocking will change the state of a job to the corresponding state.Initial Job States
Qube jobs can be submitted in one of two initial states: pending and blocked. Pending is the default and signals the Supervisor that the job may be started at any time. Blocked tells the system to hold the job until it is unblocked. To specify the start state, a developer/user may specify it through the job structure in the API, or through the command line:
Example:
% qbsub --state blocked ls
The starting state of a job can be specified by the user or developer through the job structure in the API, or through the command line.
State | Meaning |
---|---|
pending | Default state for submitted jobs. Signals to the Supervisor that the job may be started at any time. Jobs which have been suspended will also be marked as pending |
blocked | Alternate state for submitted jobs. Tells the system to hold the job until it is unblocked by something, usually another job that this one depends on. |
running | Job that is doing work, with no failures. |
failing | Job that has not finished, but has at least one frame or instance that has failed. |
retrying | Jobs that have retry counts greater than zero, and have been retried (automatically) at least once, are marked as retrying. |
killed | Job that has been killed by a user. Killed jobs must be manually retried or resubmitted. |
complete | Job is no longer running, and all frames have succeeded. |
failed | Job is no longer running, and at least one frame or instance has failed. |
Actions
States can be changed due to various actions taken by users or the Supervisor.
Action | Meaning |
---|---|
block | Typically done by users, but auto-wrangling will also block instances and jobs. |
interrupt | Kill the current frame and put the job into a pending state, where it can be picked up and rerun. |
kill | End the current frame and don't restart the job. A user must retry or resubmit this job. |
resubmit | Bring up the submission UI and possibly modify the job's parameters before sending it back to the Supervisor. |
retry | Put the job back onto the queue as-is, without modifying any of the submission parameters. |
suspend | Like "interrupt" except that it allows the current frame to finish first. |