Synopsis
As a job moves from submission to execution to completion, it goes through a variety of states - and at any given moment every job is in exactly one of several possible states. Various commands issued either from the command line or through a Qube! GUI instruct the Supervisor to generate an event that changes the state of the job; this is called a transition. The description of all possible states and their transitions is called a state machine.
Initial Job States
The key to understand how to effectively use Qube! to manage jobs is to see how different commands change the state of a job. Qube! jobs can be submitted in one of two initial states: pending and blocked. Pending is the default and signals the The starting state of a job can be specified by the user or developer through the job structure in the API, or through the command line.
State | Meaning |
---|---|
pending | Default state for submitted jobs. Signals to the Supervisor that the job may be started at any time. |
...
Jobs which have been suspended will also be marked as pending | |
blocked | Alternate state for submitted jobs. Tells the system to hold the job until it is unblocked by something, usually another job that this one depends on. |
...
Intermediate Job States
Normally, the submission of a job will place it in the pending state. The Supervisor will take over from there, and without any other intervention, will place the job in the running state when it executes. A running job can be killed or suspended by the user who submitted it, or by other users with the appropriate permissions. A job which is killed can never be run again unless it is retried. A job can also be interrupted, which requests the Supervisor to force a job off a host, immediately killing it. The job is then placed back in the queue in the pending state, to be executed on another qualified host.
Final Job States
A job that completes successfully is marked as done, and a job that completes unsuccessfully is marked as failed.
Example
% qbsub --state blocked ls
running | Job that is doing work, with no failures. |
failing | Job that has not finished, but has at least one frame or instance that has failed. |
retrying | Jobs that have retry counts greater than zero, and have been retried (automatically) at least once, are marked as retrying. |
killed | Job that has been killed by a user. Killed jobs must be manually retried or resubmitted. |
complete | Job is no longer running, and all frames have succeeded. |
failed | Job is no longer running, and at least one frame or instance has failed. |
Actions
States can be changed due to various actions taken by users or the Supervisor.
Action | Meaning |
---|---|
block | Typically done by users, but auto-wrangling will also block instances and jobs. |
interrupt | Kill the current frame and put the job into a pending state, where it can be picked up and rerun. |
kill | End the current frame and don't restart the job. A user must retry or resubmit this job. |
resubmit | Bring up the submission UI and possibly modify the job's parameters before sending it back to the Supervisor. |
retry | Put the job back onto the queue as-is, without modifying any of the submission parameters. |
suspend | Like "interrupt" except that it allows the current frame to finish first. |