|Knowledge Center Contents Previous Next Index|
Goal-Oriented SLA-Driven Scheduling
- Using Goal-Oriented SLA Scheduling
- Configuring Service Classes for SLA Scheduling
- View Information about SLAs and Service Classes
- Understanding Service Class Behavior
Using Goal-Oriented SLA Scheduling
Goal-oriented SLA scheduling policies help you configure your workload so that your jobs are completed on time and reduce the risk of missed deadlines. They enable you to focus on the "what and when" of your projects, not the low-level details of "how" resources need to be allocated to satisfy various workloads.
Service-level agreements in LSF
service-level agreement(SLA) defines how a service is delivered and the parameters for the delivery of a service. It specifies what a service provider and a service recipient agree to, defining the relationship between the provider and recipient with respect to a number of issues, among them:
An SLA in LSF is a "just-in-time" scheduling policy that defines an agreement between LSF administrators and LSF users. The SLA scheduling policy defines how many jobs should be run from each SLA to meet the configured goals.
restriction:LSF MultiCluster does not support SLAs.
SLA definitions consist of service-level goals that are expressed in individual
service classes. A service class is the actual configured policy that sets the service-level goals for the LSF system. The SLA defines the workload (jobs or other services) and users that need the work done, while the service class that addresses the SLA defines individual goals, and a time window when the service class is active.
You configure the following kinds of goals:
A specified number of jobs should be completed within a specified time window. For example, run all jobs submitted over a weekend.
Expressed as concurrently running jobs. For example: maintain 10 running jobs between 9:00 a.m. and 5:00 p.m. Velocity goals are well suited for short jobs (run time less than one hour). Such jobs leave the system quickly, and configuring a velocity goal ensures a steady flow of jobs through the system.
Expressed as number of finished jobs per hour. For example: finish 15 jobs per hour between the hours of 6:00 p.m. and 7:00 a.m. Throughput goals are suitable for medium to long running jobs. These jobs stay longer in the system, so you typically want to control their rate of completion rather than their flow.
Combining different types of goals
You might want to set velocity goals to maximize quick work during the day, and set deadline and throughput goals to manage longer running work on nights and over weekends.
How service classes perform goal-oriented scheduling
Goal-oriented scheduling makes use of other, lower level LSF policies like queues and host partitions to satisfy the service-level goal that the service class expresses. The decisions of a service class are considered first before any queue or host partition decisions. Limits are still enforced with respect to lower level scheduling objects like queues, hosts, and users.
Optimum number of running jobs
As jobs are submitted, LSF determines the optimum number of job slots (or concurrently running jobs) needed for the service class to meet its service-level goals. LSF schedules a number of jobs at least equal to the optimum number of slots calculated for the service class.
LSF attempts to meet SLA goals in the most efficient way, using the optimum number of job slots so that other service classes or other types of work in the cluster can still progress. For example, in a service class that defines a deadline goal, LSF spreads out the work over the entire time window for the goal, which avoids blocking other work by not allocating as many slots as possible at the beginning to finish earlier than the deadline.
Submit jobs to a service class
You submit jobs to a service class as you would to a queue, except that a service class is a higher level scheduling policy that makes use of other, lower level LSF policies like queues and host partitions to satisfy the service-level goal that the service class expresses.
The service class name where the job is to run is configured in
lsb.serviceclasses. If the SLA does not exist or the user is not a member of the service class, the job is rejected.
Outside of the configured time windows, the SLA is not active, and LSF schedules jobs without enforcing any service-level goals. Jobs will flow through queues following queue priorities even if they are submitted with
service_class_nameto submit a job to a service class for SLA-driven scheduling.
bsub -W 15 -sla Kyuquot sleep 100
submits the UNIX command
sleeptogether with its argument 100 as a job to the service class named
Submitting with a run limit
You should submit your jobs with a run time limit at the job level (
-Woption), the application level (RUNLIMIT parameter in the application definition in
lsb.applications), or the queue level (RUNLIMIT parameter in the queue definition in
lsb.queues). You can also submit the job with a run time estimate defined at the application level (RUNTIME parameter in
lsb.applications) instead of or in conjunction with the run time limit.
The following table describes how LSF uses the values that you provide for SLA-driven scheduling.
If you specify... And... Then... A run time limit and a run time estimate The run time estimate is less than or equal to the run time limit LSF uses the run time estimate to compute the optimum number of running jobs. A run time limit You do not specify a run time estimate, or the estimate is greater than the limit LSF uses the run time limit to compute the optimum number of running jobs. A run time estimate You do not specify a run time limit LSF uses the run time estimate to compute the optimum number of running jobs. Neither a run time limit nor a run time estimate LSF automatically adjusts the optimum number of running jobs according to the observed run time of finished jobs.
Modify SLA jobs (bmod)
bmod -slato modify the service class a job is attached to, or to attach a submitted job to a service class. Run
bmod -slanto detach a job from a service class:
bmod -sla Kyuquot 2307
Attaches job 2307 to the service class
bmod -slan 2307
Detaches job 2307 from the service class
- Move job array elements from one service class to another, only entire job arrays
- Modify the service class of jobs already attached to a job group
If a default SLA is configured in
bmod -slanmoves the job to the default SLA. If the job is already attached to the default SLA,
bmod -slanhas no effect on that job.
Configuring Service Classes for SLA Scheduling
Configure service classes in
/configdir/lsb.serviceclasses. Each service class is defined in a
Each service class section begins with the line Begin ServiceClass and ends with the line
End ServiceClass. You must specify:
- A service class name
- At least one goal (deadline, throughput, or velocity) and a time window when the goal is active
- A service class priority
All other parameters are optional. You can configure as many service class sections as you need.
important:The name you use for your service classes cannot be the same as an existing host partition or user group name.
User groups for service classes
You can control access to the SLA by configuring a user group for the service class. If LSF user groups are specified in
lsb.users, each user in the group can submit jobs to this service class. If a group contains a subgroup, the service class policy applies to each member in the subgroup recursively. The group can define fairshare among its members, and the SLA defined by the service class enforces the fairshare policy among the users in the user group configured for the SLA.
By default, all users in the cluster can submit jobs to the service class.
Service class priority
A higher value indicates a higher priority, relative to other service classes. Similar to queue priority, service classes access the cluster resources in priority order.
LSF schedules jobs from one service class at a time, starting with the highest-priority service class. If multiple service classes have the same priority, LSF runs the jobs from these service classes in the order the service classes are configured in
Service class priority in LSF is completely independent of the UNIX scheduler's priority system for time-sharing processes. In LSF, the NICE parameter is used to set the UNIX time-sharing priority for batch jobs.
Service class configuration examples
- The service class
Ucluletdefines one deadline goal that is active during working hours between 8:30 AM and 4:00 PM. All jobs in the service class should complete by the end of the specified time window. Outside of this time window, the SLA is inactive and jobs are scheduled without any goal being enforced:Begin ServiceClass NAME = Uclulet PRIORITY = 20 GOALS = [DEADLINE timeWindow (8:30-16:00)] DESCRIPTION = "working hours" End ServiceClass
The service class
Nanaimodefines a deadline goal that is active during the weekends and at nights.Begin ServiceClass NAME = Nanaimo PRIORITY = 20 GOALS = [DEADLINE timeWindow (5:18:00-1:8:30 20:00-8:30)] DESCRIPTION = "weekend nighttime regression tests" End ServiceClass
The service class
Inuvikdefines a throughput goal of 6 jobs per hour that is always active:Begin ServiceClass NAME = Inuvik PRIORITY = 20 GOALS = [THROUGHPUT 6 timeWindow ()] DESCRIPTION = "constant throughput" End ServiceClass
tip:To configure a time window that is always open, use the timeWindow keyword with empty parentheses.
The service class
Tofinodefines two velocity goals in a 24 hour period. The first goal is to have a maximum of 10 concurrently running jobs during business hours (9:00 a.m. to 5:00 p.m). The second goal is a maximum of 30 concurrently running jobs during off-hours (5:30 p.m. to 8:30 a.m.)Begin ServiceClass NAME = Tofino PRIORITY = 20 GOALS = [VELOCITY 10 timeWindow (9:00-17:00)] \ [VELOCITY 30 timeWindow (17:30-8:30)] DESCRIPTION = "day and night velocity" End ServiceClass
The service class
Kyuquotdefines a velocity goal that is active during working hours (9:00 a.m. to 5:30 p.m.) and a deadline goal that is active during off-hours (5:30 p.m. to 9:00 a.m.) Only users
user2can submit jobs to this service class.Begin ServiceClass NAME = Kyuquot PRIORITY = 23 USER_GROUP = user1 user2 GOALS = [VELOCITY 8 timeWindow (9:00-17:30)] \ [DEADLINE timeWindow (17:30-9:00)] DESCRIPTION = "Daytime/Nighttime SLA" End ServiceClass
The service class
Teveredefines a combination similar to
Kyuquot, but with a deadline goal that takes effect overnight and on weekends. During the working hours in weekdays the velocity goal favors a mix of short and medium jobs.Begin ServiceClass NAME = Tevere PRIORITY = 20 GOALS = [VELOCITY 100 timeWindow (9:00-17:00)] \ [DEADLINE timeWindow (17:30-8:30 5:17:30-1:8:30)] DESCRIPTION = "nine to five" End ServiceClass
View Information about SLAs and Service Classes
Monitor the progress of an SLA (bsla)
bslato display the properties of service classes configured in
lsb.serviceclassesand dynamic information about the state of each configured service class.
- One velocity goal of service class
Tofinois active and on time. The other configured velocity goal is inactive.
bslaSERVICE CLASS NAME: Tofino -- day and night velocity PRIORITY: 20 GOAL: VELOCITY 30 ACTIVE WINDOW: (17:30-8:30) STATUS: Inactive SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD GOAL: VELOCITY 10 ACTIVE WINDOW: (9:00-17:00) STATUS: Active:On time SLA THROUGHPUT: 10.00 JOBS/CLEAN_PERIOD NJOBS PEND RUN SSUSP USUSP FINISH 300 280 10 0 0 10
- The deadline goal of service class
Ucluletis not being met, and
bslaSERVICE CLASS NAME: Uclulet -- working hours PRIORITY: 20 GOAL: DEADLINE ACTIVE WINDOW: (8:30-19:00) STATUS: Active:Delayed SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD ESTIMATED FINISH TIME: (Tue Oct 28 06:17) OPTIMUM NUMBER OF RUNNING JOBS: 6 NJOBS PEND RUN SSUSP USUSP FINISH 40 39 1 0 0 0
- The configured velocity goal of the service class
Kyuquotis active and on time. The configured deadline goal of the service class is inactive.
bsla KyuquotSERVICE CLASS NAME: Kyuquot -- Daytime/Nighttime SLA PRIORITY: 23 USER_GROUP: user1 user2 GOAL: VELOCITY 8 ACTIVE WINDOW: (9:00-17:30) STATUS: Active:On time SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD GOAL: DEADLINE ACTIVE WINDOW: (17:30-9:00) STATUS: Inactive SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD NJOBS PEND RUN SSUSP USUSP FINISH 0 0 0 0 0 0
- The throughput goal of service class
Inuvikis always active.
bsla InuvikSERVICE CLASS NAME: Inuvik -- constant throughput PRIORITY: 20 GOAL: THROUGHPUT 6 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 10.00 JOBs/CLEAN_PERIOD OPTIMUM NUMBER OF RUNNING JOBS: 5 NJOBS PEND RUN SSUSP USUSP FINISH 110 95 5 0 0 10
View jobs running in an SLA (bjobs)
bjobs -slato display jobs running in a service class:
bjobs -sla InuvikJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 136 user1 RUN normal hostA hostA sleep 100 Sep 28 13:24 137 user1 RUN normal hostA hostB sleep 100 Sep 28 13:25
-gto display job groups attached to a service class. Once a job group is attached to a service class, all jobs submitted to that group are subject to the SLA.
Track historical behavior of an SLA (bacct)
bacctto display historical performance of a service class. For example, service classes
Tuktoyaktukconfigure throughput goals.
bslaSERVICE CLASS NAME: Inuvik -- throughput 6 PRIORITY: 20 GOAL: THROUGHPUT 6 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 10.00 JOBs/CLEAN_PERIOD OPTIMUM NUMBER OF RUNNING JOBS: 5 NJOBS PEND RUN SSUSP USUSP FINISH 111 94 5 0 0 12 -------------------------------------------------------------- SERVICE CLASS NAME: Tuktoyaktuk -- throughput 3 PRIORITY: 15 GOAL: THROUGHPUT 3 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 4.00 JOBs/CLEAN_PERIOD OPTIMUM NUMBER OF RUNNING JOBS: 4 NJOBS PEND RUN SSUSP USUSP FINISH 104 96 4 0 0 4
These two service classes have the following historical performance. For SLA
bacctshows a total throughput of 8.94 jobs per hour over a period of 20.58 hours:
bacct -sla InuvikAccounting information about jobs that are: - submitted by users user1, - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on service classes Inuvik, ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 183 Total number of exited jobs: 1 Total CPU time consumed: 40.0 Average CPU time consumed: 0.2 Maximum CPU time of a job: 0.3 Minimum CPU time of a job: 0.1 Total wait time in queues: 1947454.0 Average wait time in queue:10584.0 Maximum wait time in queue:18912.0 Minimum wait time in queue: 7.0 Average turnaround time: 12268 (seconds/job) Maximum turnaround time: 22079 Minimum turnaround time: 1713 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00
Total throughput: 8.94 (jobs/hour) during 20.58 hoursBeginning time: Oct 11 20:23 Ending time: Oct 12 16:58
For SLA Tuktoyaktuk,
bacctshows a total throughput of 4.36 jobs per hour over a period of 19.95 hours:
bacct -sla TuktoyaktukAccounting information about jobs that are: - submitted by users user1, - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on service classes Tuktoyaktuk, ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 87 Total number of exited jobs: 0 Total CPU time consumed: 18.0 Average CPU time consumed: 0.2 Maximum CPU time of a job: 0.3 Minimum CPU time of a job: 0.1 Total wait time in queues: 2371955.0 Average wait time in queue:27263.8 Maximum wait time in queue:39125.0 Minimum wait time in queue: 7.0 Average turnaround time: 30596 (seconds/job) Maximum turnaround time: 44778 Minimum turnaround time: 3355 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00
Total throughput: 4.36 (jobs/hour) during 19.95 hoursBeginning time: Oct 11 20:50 Ending time: Oct 12 16:47
Because the run times are not uniform, both service classes actually achieve higher throughput than configured.
Understanding Service Class Behavior
A simple deadline goal
The following service class configures an SLA with a simple deadline goal with a half hour time window.Begin ServiceClass NAME = Quadra PRIORITY = 20 GOALS = [DEADLINE timeWindow (16:15-16:45)] DESCRIPTION = short window End ServiceClass
Six jobs submitted with a run time of 5 minutes each will use 1 slot for the half hour time window.
bslashows that the deadline can be met:
bsla QuadraSERVICE CLASS NAME: Quadra -- short window PRIORITY: 20 GOAL: DEADLINE ACTIVE WINDOW: (16:15-16:45) STATUS: Active:On time ESTIMATED FINISH TIME: (Wed Jul 2 16:38) OPTIMUM NUMBER OF RUNNING JOBS: 1 NJOBS PEND RUN SSUSP USUSP FINISH 6 5 1 0 0 0
The following illustrates the progress of the SLA to the deadline. The optimum number of running jobs in the service class (
nrun) is maintained at a steady rate of 1 job at a time until near the completion of the SLA.
When the finished job curve (
nfinished) meets the total number of jobs curve (
njobs) the deadline is met. All jobs are finished well ahead of the actual configured deadline, and the goal of the SLA was met.
An overnight run with two service classes
bslashows the configuration and status of two service classes
bsla QualicumSERVICE CLASS NAME: Qualicum PRIORITY: 23 GOAL: VELOCITY 8 ACTIVE WINDOW: (8:00-18:00) STATUS: Inactive SLA THROUGHPUT: 0.00 JOBS/CLEAN_PERIOD GOAL: DEADLINE ACTIVE WINDOW: (18:00-8:00) STATUS: Active:On time ESTIMATED FINISH TIME: (Thu Jul 10 07:53) OPTIMUM NUMBER OF RUNNING JOBS: 2 NJOBS PEND RUN SSUSP USUSP FINISH 280 278 2 0 0 0
The following illustrates the progress of the deadline SLA
Qualicumrunning 280 jobs overnight with random runtimes until the morning deadline. As with the simple deadline goal example, when the finished job curve (
nfinished) meets the total number of jobs curve (
njobs) the deadline is met with all jobs completed ahead of the configured deadline.
Comoxhas a velocity goal of 2 concurrently running jobs that is always active:
bsla ComoxSERVICE CLASS NAME: Comox PRIORITY: 20 GOAL: VELOCITY 2 ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 2.00 JOBS/CLEAN_PERIOD NJOBS PEND RUN SSUSP USUSP FINISH 100 98 2 0 0 0
The following illustrates the progress of the velocity SLA
Comoxrunning 100 jobs with random runtimes over a 14 hour period.
When an SLA is missing its goal
- Use the CONTROL_ACTION parameter in your service class to configure an action to be run if the SLA goal is delayed for a specified number of minutes.
] CMD [
If the SLA goal is delayed for longer than VIOLATION_PERIOD, the action specified by CMD is invoked. The violation period is reset and the action runs again if the SLA is still active when the violation period expires again. If the SLA has multiple active goals that are in violation, the action is run for each of them.
ExampleCONTROL_ACTION=VIOLATION_PERIOD CMD [echo `date`: SLA is in violation >> ! /tmp/sla_violation.log]
Preemption and SLA policies
SLA jobs cannot be preempted. You should avoid running jobs belonging to an SLA in low priority queues.
Chunk jobs and SLA policies
SLA jobs will not get chunked. You should avoid submitting SLA jobs to a chunk job queue.
SLA statistics files
Each active SLA goal generates a statistics file for monitoring and analyzing the system. When the goal becomes inactive the file is no longer updated. The files are created in the
LSB_SHAREDIR/cluster_name/logdir/SLAdirectory. Each file name consists of the name of the service class and the goal type.
For example the file named
Quadra.deadlineis created for the deadline goal of the service class name
Quadra. The following file named
Tofino.velocityrefers to a velocity goal of the service class named
cat Tofino.velocity# service class Tofino velocity, NJOBS, NPEND (NRUN + NSSUSP + NUSUSP), (NDONE + NEXIT) 17/9 15:7:34 1063782454 2 0 0 0 0 17/9 15:8:34 1063782514 2 0 0 0 0 17/9 15:9:34 1063782574 2 0 0 0 0 # service class Tofino velocity, NJOBS, NPEND (NRUN + NSSUSP + NUSUSP), (NDONE + NEXIT) 17/9 15:10:10 1063782610 2 0 0 0 0
Resizable jobs and SLA scheduling
For resizable job allocation requests, since the job itself has already started to run, LSF bypasses dispatch rate checking and continues scheduling the allocation request.
Job groups and SLA scheduling
Job groups provide a method for assigning arbitrary labels to groups of jobs. Typically, job groups represent a project hierarchy. You can use
-slaat job submission to attach all jobs in a job group to a service class and have them scheduled as SLA jobs and subject to the scheduling policy of the SLA. Within the job group, resources are allocated to jobs on a fairshare basis.
All jobs submitted to a group under an SLA automatically belong to the SLA itself. You cannot modify a job group of a job that is attached to an SLA.
A job group hierarchy can belong to only one SLA.
It is not possible to have some jobs in a job group not part of the service class. Multiple job groups can be created under the same SLA. You can submit additional jobs to the job group without specifying the service class name again.
If the specified job group does not exist, it is created and attached to the SLA.
You can also use
-slato specify a service class when you create a job group with
View job groups attached to an SLA (bjgroup)
bjgroupto display job groups attached to a service class:
bjgroupGROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /fund1_grp 5 4 0 1 0 0 Venezia 1/5 user1 /fund2_grp 11 2 5 0 0 4 Venezia 5/5 user1 /bond_grp 2 2 0 0 0 0 Venezia 0/- user2 /risk_grp 2 1 1 0 0 0 () 1/- user2 /admi_grp 4 4 0 0 0 0 () 0/- user2
bjgroupdisplays the name of the service class that the job group is attached to with
service_class_name. If the job group is not attached to any service class, empty parentheses
()are displayed in the SLA name column.
Platform Computing Inc.
|Knowledge Center Contents Previous Next Index|