As a job moves from submission to execution to completion, it goes through a variety of states - at any given moment every job is in exactly one of several possible states. The starting state of a job can be specified by the user or developer through the job structure in the API, or through the command line.
State | Meaning |
---|---|
pending | Default state for submitted jobs. Signals to the Supervisor that the job may be started at any time. Jobs which have been suspended will also be marked as pending |
blocked | Alternate state for submitted jobs. Tells the system to hold the job until it is unblocked by something, usually another job that this one depends on. |
running | Job that is doing work, with no failures. |
failing | Job that has not finished, but has at least one frame or instance that has failed. |
retrying | Jobs that have retry counts greater than zero, and have been retried (automatically) at least once, are marked as retrying. |
killed | Job that has been killed by a user. Killed jobs must be manually retried or resubmitted. |
complete | Job is no longer running, and all frames have succeeded. |
failed | Job is no longer running, and at least one frame or instance has failed. |
Actions
States can be changed due to various actions taken by users or the Supervisor.
Action | Meaning |
---|---|
block | Typically done by users, but auto-wrangling will also block instances and jobs. |
interrupt | Kill the current frame and put the job into a pending state, where it can be picked up and rerun. |
kill | End the current frame and don't restart the job. A user must retry or resubmit this job. |
resubmit | Bring up the submission UI and possibly modify the job's parameters before sending it back to the Supervisor. |
retry | Put the job back onto the queue as-is, without modifying any of the submission parameters. |
suspend | Like "interrupt" except that it allows the current frame to finish first. |