|Knowledge Center Contents Previous Next Index|
Automatic Job Suspension
Jobs running under LSF can be suspended based on the load conditions on the execution hosts. Each host and each queue can be configured with a set of suspending conditions. If the load conditions on an execution host exceed either the corresponding host or queue suspending conditions, one or more jobs running on that host are suspended to reduce the load.
When LSF suspends a job, it invokes the SUSPEND action. The default SUSPEND action is to send the signal SIGSTOP.
By default, jobs are resumed when load levels fall below the suspending conditions. Each host and queue can be configured so that suspended checkpointable or rerunnable jobs are automatically migrated to another host instead.
If no suspending threshold is configured for a load index, LSF does not check the value of that load index when deciding whether to suspend jobs.
Suspending thresholds can also be used to enforce inter-queue priorities. For example, if you configure a low-priority queue with an
r1m(1 minute CPU run queue length) scheduling threshold of 0.25 and an
r1msuspending threshold of 1.75, this queue starts one job when the machine is idle. If the job is CPU intensive, it increases the run queue length from 0.25 to roughly 1.25. A high-priority queue configured with a scheduling threshold of 1.5 and an unlimited suspending threshold sends a second job to the same host, increasing the run queue to 2.25. This exceeds the suspending threshold for the low priority job, so it is stopped. The run queue length stays above 0.25 until the high priority job exits. After the high priority job exits the run queue index drops back to the idle level, so the low priority job is resumed.
When jobs are running on a host, LSF periodically checks the load levels on that host. If any load index exceeds the corresponding per-host or per-queue suspending threshold for a job, LSF suspends the job. The job remains suspended until the load levels satisfy the scheduling thresholds.
At regular intervals, LSF gets the load levels for that host. The period is defined by the SBD_SLEEP_TIME parameter in the
lsb.paramsfile. Then, for each job running on the host, LSF compares the load levels against the host suspending conditions and the queue suspending conditions. If any suspending condition at either the corresponding host or queue level is satisfied as a result of increased load, the job is suspended. A job is only suspended if the load levels are too high for that particular job's suspending thresholds.
There is a time delay between when LSF suspends a job and when the changes to host load are seen by the LIM. To allow time for load changes to take effect, LSF suspends no more than one job at a time on each host.
Jobs from the lowest priority queue are checked first. If two jobs are running on a host and the host is too busy, the lower priority job is suspended and the higher priority job is allowed to continue. If the load levels are still too high on the next turn, the higher priority job is also suspended.
If a job is suspended because of its own load, the load drops as soon as the job is suspended. When the load goes back within the thresholds, the job is resumed until it causes itself to be suspended again.
In some special cases, LSF does not automatically suspend jobs because of load levels. LSF does not suspend a job:
- Forced to run with
- If it is the only job running on a host, unless the host is being used interactively. When only one job is running on a host, it is not suspended for any reason except that the host is not interactively idle (the
itinteractive idle time load index is less than one minute). This means that once a job is started on a host, at least one job continues to run unless there is an interactive user on the host. Once the job is suspended, it is not resumed until all the scheduling conditions are met, so it should not interfere with the interactive user.
- Because of the paging rate, unless the host is being used interactively. When a host has interactive users, LSF suspends jobs with high paging rates, to improve the response time on the host for interactive users. When a host is idle, the
pg(paging rate) load index is ignored. The PG_SUSP_IT parameter in
lsb.paramscontrols this behavior. If the host has been idle for more than PG_SUSP_IT minutes, the
pgload index is not checked against the suspending threshold.
LSF provides different alternatives for configuring suspending conditions. Suspending conditions are configured at the host level as load thresholds, whereas suspending conditions are configured at the queue level as either load thresholds, or by using the STOP_COND parameter in the
lsb.queuesfile, or both.
The load indices most commonly used for suspending conditions are the CPU run queue lengths (
r15m), paging rate (
pg), and idle time (
it). The (
swp) and (
tmp) indices are also considered for suspending jobs.
To give priority to interactive users, set the suspending threshold on the
it(idle time) load index to a non-zero value. Jobs are stopped when any user is active, and resumed when the host has been idle for the time given in the
To tune the suspending threshold for paging rate, it is desirable to know the behavior of your application. On an otherwise idle machine, check the paging rate using
lsload, and then start your application. Watch the paging rate as the application runs. By subtracting the active paging rate from the idle paging rate, you get a number for the paging rate of your application. The suspending threshold should allow at least 1.5 times that amount. A job can be scheduled at any paging rate up to the scheduling threshold, so the suspending threshold should be at least the scheduling threshold plus 1.5 times the application paging rate. This prevents the system from scheduling a job and then immediately suspending it because of its own paging.
The effective CPU run queue length condition should be configured like the paging rate. For CPU-intensive sequential jobs, the effective run queue length indices increase by approximately one for each job. For jobs that use more than one process, you should make some test runs to determine your job's effect on the run queue length indices. Again, the suspending threshold should be equal to at least the scheduling threshold plus 1.5 times the load for one job.
If new hosts are added for resizable jobs, LSF considers load threshold scheduling on those new hosts. If hosts are removed from allocation, LSF does not apply load threshold scheduling for resizing the jobs.
Configuring load thresholds at queue level
The queue definition (
lsb.queues) can contain thresholds for 0 or more of the load indices. Any load index that does not have a configured threshold has no effect on job scheduling.
Each load index is configured on a separate line with the format:
Specify the name of the load index, for example
r1mfor the 1-minute CPU run queue length or
pgfor the paging rate.
loadSchedis the scheduling threshold for this load index.
loadStopis the suspending threshold. The
loadSchedcondition must be satisfied by a host before a job is dispatched to it and also before a job suspended on a host can be resumed. If the
loadStopcondition is satisfied, a job is suspended.
loadStopthresholds permit the specification of conditions using simple AND/OR logic. For example, the specification:MEM=100/10 SWAP=200/30
translates into a
mem>=100 && swap>=200and a
mem < 10 || swap < 30.
r15mCPU run queue length conditions are compared to the effective queue length as reported by
lsload -E, which is normalised for multiprocessor hosts. Thresholds for these parameters should be set at appropriate levels for single processor hosts.
- Configure load thresholds consistently across queues. If a low priority queue has higher suspension thresholds than a high priority queue, then jobs in the higher priority queue are suspended before jobs in the low priority queue.
Configuring load thresholds at host level
A shared resource cannot be used as a load threshold in the
Hostssection of the
Configuring suspending conditions at queue level
The condition for suspending a job can be specified using the queue-level STOP_COND parameter. It is defined by a resource requirement string. Only the
selectsection of the resource requirement string is considered when stopping a job. All other sections are ignored.
This parameter provides similar but more flexible functionality for
loadStopthresholds have been specified, then a job is suspended if either the STOP_COND is TRUE or the
loadStopthresholds are exceeded.
This queue suspends a job based on the idle time for desktop machines and based on availability of swap and memory on compute servers. Assume
csis a Boolean resource defined in the
lsf.sharedfile and configured in the
cluster_namefile to indicate that a host is a compute server:Begin Queue . STOP_COND= select[((!cs && it < 5) || (cs && mem < 15 && swap < 50))] . End Queue
Viewing host-level and queue-level suspending conditions
The suspending conditions are displayed by the
Viewing job-level suspending conditions
The thresholds that apply to a particular job are the more restrictive of the host and queue thresholds, and are displayed by the
Viewing suspend reason
bjobs -lpcommand shows the load threshold that caused LSF to suspend a job, together with the scheduling parameters.
The use of STOP_COND affects the suspending reasons as displayed by the
bjobscommand. If STOP_COND is specified in the queue and the
loadStopthresholds are not specified, the suspending reasons for each individual load index are not displayed.
Resuming suspended jobs
Jobs are suspended to prevent overloading hosts, to prevent batch jobs from interfering with interactive use, or to allow a more urgent job to run. When the host is no longer overloaded, suspended jobs should continue running.
When LSF automatically resumes a job, it invokes the RESUME action. The default action for RESUME is to send the signal SIGCONT.
If there are any suspended jobs on a host, LSF checks the load levels in each dispatch turn.
If the load levels are within the scheduling thresholds for the queue and the host, and all the resume conditions for the queue (RESUME_COND in
lsb.queues) are satisfied, the job is resumed.
If RESUME_COND is not defined, then the
loadSchedthresholds are used to control resuming of jobs: all the
loadSchedthresholds must be satisfied for the job to be resumed. The
loadSchedthresholds are ignored if RESUME_COND is defined.
Jobs from higher priority queues are checked first. To prevent overloading the host again, only one job is resumed in each dispatch turn.
Specify resume condition
- Use RESUME_COND in
lsb.queuesto specify the condition that must be satisfied on a host if a suspended job is to be resumed.
selectsection of the resource requirement string is considered when resuming a job. All other sections are ignored.
Viewing resume thresholds
bjobs -lcommand displays the scheduling thresholds that control when a job is resumed.
Platform Computing Inc.
|Knowledge Center Contents Previous Next Index|