|Knowledge Center Contents Previous Next Index|
Achieving Performance and Scalability
- Optimizing Performance in Large Sites
- Tuning UNIX for Large Clusters
- Tuning LSF for Large Clusters
- Monitoring Performance Metrics in Real Time
Optimizing Performance in Large Sites
As your site grows, you must tune your LSF cluster to support a large number of hosts and an increased workload.
This chapter discusses how to efficiently tune querying, scheduling, and event logging in a large cluster that scales to 5000 hosts and 100,000 jobs at any one time.
To target performance optimization to a cluster with 5000 hosts and 100,000 jobs, you must:
- Configure your operating system. See Tuning UNIX for Large Clusters
- Fine-tune LSF. See Tuning LSF for Large Clusters
What's new in LSF performance?
LSF provides parameters for tuning your cluster, which you will learn about in this chapter. However, before you calculate the values to use for tuning your cluster, consider the following enhancements to the general performance of LSF daemons, job dispatching, and event replaying:
- Both scheduling and querying are much faster
- Switching and replaying the events log file,
lsb.events, is much faster. The length of the events file no longer impacts performance
- Restarting and reconfiguring your cluster is much faster
- Job submission time is constant. It does not matter how many jobs are in the system. The submission time does not vary.
- The scalability of load updates from the slaves to the master has increased
- Load update intervals are scaled automatically
The following graph shows the improvement in LIM startup after the LSF performance enhancements:
Tuning UNIX for Large Clusters
The following hardware and software specifications are requirements for a large cluster that supports 5,000 hosts and 100,000 jobs at any one time.
In this section
LSF master host:
To meet the performance requirements of a large cluster, increase the file descriptor limit of the operating system.
The file descriptor limit of most operating systems used to be fixed, with a limit of 1024 open files. Some operating systems, such as Linux and AIX, have removed this limit, allowing you to increase the number of file descriptors.
Increase the file descriptor limit
- To achieve efficiency of performance in LSF, follow the instructions in your operating system documentation to increase the number of file descriptors on the LSF master host.
tip:To optimize your configuration, set your file descriptor limit to a value at least as high as the number of hosts in your cluster.
The following is an example configuration. The instructions for different operating systems, kernels, and shells are varied. You may have already configured the host to use the maximum number of file descriptors that are allowed by the operating system. On some operating systems, the limit is configured dynamically.
Your cluster size is 5000 hosts. Your master host is on Linux, kernel version 2.4:
- Log in to the LSF master host as the
- Add the following line to your
echo -n "5120" > /proc/sys/fs/file-max
- Restart the operating system to apply the changes.
- In the
bashshell, instruct the operating system to use the new file limits:
# ulimit -n unlimited
Tuning LSF for Large Clusters
To enable and sustain large clusters, you need to tune LSF for efficient querying, dispatching, and event log management.
In this section
- Managing scheduling performance
- Limiting the number of batch queries
- Improving the speed of host status updates
- Managing your user's ability to move jobs in a queue
- Managing the number of pending reasons
- Achieving efficient event switching
- Automatic load updating
- Managing the I/O performance of the info directory
- Processor binding for LSF job processes
- Increasing the job ID limit
Managing scheduling performance
For fast job dispatching in a large cluster, configure the following parameters:
The maximum number of jobs the scheduler can dispatch in one scheduling session
Some operating systems, such as Linux and AIX, let you increase the number of file descriptors that can be allocated on the master host. You do not need to limit the number of file descriptors to 1024 if you want fast job dispatching. To take advantage of the greater number of file descriptors, you must set
LSB_MAX_JOB_DISPATCH_PER_SESSIONto a value greater than 300.
LSB_MAX_JOB_DISPATCH_PER_SESSIONto one-half the value of
MAX_SBD_CONNS. This setting configures
mbatchdto dispatch jobs at a high rate while maintaining the processing speed of other
The maximum number of open file connections between
Specify a value equal to the number of hosts in your cluster plus a buffer. For example, if your cluster includes 4000 hosts, set:
Highly recommended for large clusters to decrease the load on the master LIM. Forces the client
sbatchdto contact the local LIM for host status and load information. The client sbatchd only contacts the master LIM or a LIM on one of the LSF_SERVER_HOSTS if sbatchd cannot find the information locally.
Enable fast job dispatch
- Log in to the LSF master host as the
- Increase the system-wide file descriptor limit of your operating system if you have not already done so.
MAX_SBD_CONNSequal to the number of hosts in the cluster plus a buffer.
lsf.conf, set the parameter
LSB_MAX_JOB_DISPATCH_PER_SESSIONto a value greater than 300 and less than or equal to one-half the value of
For example, for a cluster with 4000 hosts:LSB_MAX_JOB_DISPATCH_PER_SESSION = 2050 MAX_SBD_CONNS=4100
lsf.conf, define the parameter
LSF_SERVER_HOSTSto decrease the load on the master LIM.
- In the shell you used to increase the file descriptor limit, shut down the LSF
batchdaemons on the master host:
badmin mbdrestartto restart the LSF
batchdaemons on the master host.
badmin hrestart allto restart every
sbatchdin the cluster:
note:When you shut down the batch daemons on the master host, all LSF services are temporarily unavailable, but existing jobs are not affected. When
mbatchdis later started by
sbatchd, its previous status is restored and job scheduling continues.
Enable continuous scheduling
- To enable the scheduler to run continuously, define the parameter
Limiting the number of batch queries
In large clusters, job querying can grow very quickly. If your site sees a lot of high traffic job querying, you can tune LSF to limit the number of job queries that
mbatchdcan handle. This helps decrease the load on the master host.
If a job information query is sent after the limit has been reached, an error message is displayed and
mbatchdkeeps retrying, in one second intervals. If the number of job queries later drops below the limit,
mbatchdhandles the query.
You define the maximum number of concurrent jobs queries to be handled by
mbatchdin the parameter
mbatchdis using multithreading, a dedicated query port is defined by the parameter
mbatchdhas a dedicated query port, the value of
MAX_CONCURRENT_JOB_QUERYsets the maximum number of queries that can be handled by each child
mbatchdthat is forked by
mbatchd. This means that the total number of job queries handled can be more than the number specified by
MAX_CONCURRENT_JOB_QUERY (MAX_CONCURRENT_JOB_QUERYmultiplied by the number of child daemons forked by
mbatchdis not using multithreading, the value of
MAX_CONCURRENT_JOB_QUERYsets the maximum total number of job queries that can be handled by
Specifies the maximum number of job queries that can be handled by
mbatchd. Valid values are positive integers between 1 and 100. The default value is unlimited.
Specifies that no more than 20 queries can be handled by
Incorrect value. The default value will be used. An unlimited number of job queries will be handled by
Improving the speed of host status updates
To improve the speed with which
mbatchdobtains and reports host status, configure the parameter LSB_SYNC_HOST_STAT_LIM in the file
lsb.params. This also improves the speed with which LSF reschedules jobs: the sooner LSF knows that a host has become unavailable, the sooner LSF reschedules any rerunnable jobs executing on that host.
For example, during maintenance operations, the cluster administrator might need to shut down half of the hosts at once. LSF can quickly update the host status and reschedule any rerunnable jobs that were running on the unavailable hosts.
When you define this parameter,
mbatchdperiodically obtains the host status from the master LIM, and then verifies the status by polling each
sbatchdat an interval defined by the parameters MBD_SLEEP_TIME and LSB_MAX_PROBE_SBD.
Managing your user's ability to move jobs in a queue
JOB_POSITION_CONTROL_BY_ADMIN=Yallows an LSF administrator to control whether users can use
bbotto move jobs to the top and bottom of queues. When set, only the LSF administrator (including any queue administrators) can use
btopto move jobs within a queue. A user attempting to user
btopreceives the error "User permission denied."
remember:You must be an LSF administrator to set this parameter.
Managing the number of pending reasons
For efficient, scalable management of pending reasons, use
lsb.paramsto condense all the host-based pending reasons into one generic pending reason.
If a job has no other main pending reason,
bjobs -lwill display the following:Individual host based reasons
If you condense host-based pending reasons, but require a full pending reason list, you can run the following command:
remember:You must be an LSF administrator or a queue administrator to run this command.
Achieving efficient event switching
Periodic switching of the event file can weaken the performance of
mbatchd,which automatically backs up and rewrites the events file after every 1000 batch job completions. The old
lsb.eventsfile is moved to
lsb.events.1, and each old
nfile is moved to
Change the frequency of event switching with the following two parameters in
MAX_JOB_NUMspecifies the number of batch jobs to complete before
lsb.eventsis backed up and moved to
lsb.events.1. The default value is 1000
MIN_SWITCH_PERIODcontrols how frequently
mbatchdchecks the number of completed batch jobs
The two parameters work together. Specify the
MIN_SWITCH_PERIODvalue in seconds.
For example:MAX_JOB_NUM=1000 MIN_SWITCH_PERIOD=7200
mbatchdto check if the events file has logged 1000 batch job completions every two hours. The two parameters can control the frequency of the events file switching as follows:
- After two hours,
mbatchdchecks the number of completed batch jobs. If 1000 completed jobs have been logged, it switches the events file
- If 1000 jobs complete after five minutes,
mbatchddoes not switch the events file until till the end of the two-hour period
tip:For large clusters, set the MIN_SWITCH_PERIOD to a value equal to or greater than 600. This causes
mbatchdto fork a child process that handles event switching, thereby reducing the load on
mbatchdterminates the child process and appends delta events to new events after the MIN_SWITCH_PERIOD has elapsed. If you define a value less than 600 seconds, mbatchd will not fork a child process for event switching.
Automatic load updating
Periodically, the LIM daemons exchange load information. In large clusters, let LSF automatically load the information by dynamically adjusting the period based on the load.
important:For automatic tuning of the loading interval, make sure the parameter
notdefined. Do not configure your cluster to load the information at specific intervals.
Managing the I/O performance of the info directory
In large clusters, there are large numbers of jobs submitted by its users. Since each job generally has a job file, this results in a large number of job files stored in the
LSF_SHAREDIR/cluster_name/logdir/infodirectory at any time. When the total size of the job files reaches a certain point, you will notice a significant delay when performing I/O operations in the
This delay is caused by a limit in the total size of files that can reside in a file server directory. This limit is dependent on the file system implementation. A high load on the file server delays the master batch daemon operations, and therefore slows down the overall cluster throughput.
You can prevent this delay by creating and using subdirectories under the parent directory. Each new subdirectory is subject to the file size limit, but the parent directory is not subject to the total file size of its subdirectories. Since the total file size of the
infodirectory is divided among its subdirectories, your cluster can process more job operations before reaching the total size limit of the job files.
If your cluster has a lot of jobs resulting in a large
infodirectory, you can tune your cluster by enabling LSF to create subdirectories in the
lsb.paramsto create the subdirectories and enable
mbatchdto distribute the job files evenly throughout the subdirectories.
num_subdirsspecifies the number of subdirectories that you want to create under the
LSF_SHAREDIR/cluster_name/logdir/infodirectory. Valid values are positive integers between
1024. By default, MAX_INFO_DIRS is not defined.
badmin reconfigto create and use the subdirectories.
Duplicate event logging
note:If you enabled duplicate event logging, you must run badmin mbdrestart instead of badmin reconfig to restart mbatchd.
bparams -lto display the value of the
mbatchdcreates ten subdirectories from
Processor binding for LSF job processes
See also Processor Binding for Parallel Jobs.
Rapid progress of modern processor manufacture technologies has enabled the low cost deployment of LSF on hosts with multicore and multithread processors. The default soft affinity policy enforced by the operating system scheduler may not give optimal job performance. For example, the operating system scheduler may place all job processes on the same processor or core leading to poor performance. Frequently switching processes as the operating system schedules and reschedules work between cores can cause cache invalidations and cache miss rates to grow large.
Processor binding for LSF job processes takes advantage of the power of multiple processors and multiple cores to provide hard processor binding functionality for sequential LSF jobs and parallel jobs that run on a single host.
restriction:Processor binding is supported on hosts running Linux with kernel version 2.6 or higher.
For multi-host parallel jobs, LSF sets two environment variables (
$LSB_BIND_CPU_LIST) but does not attempt to bind the job to any host.
When processor binding for LSF job processes is enabled on supported hosts, job processes of an LSF job are bound to a processor according to the binding policy of the host. When an LSF job is completed (exited or done successfully) or suspended, the corresponding processes are unbound from the processor.
When a suspended LSF job is resumed, the corresponding processes are bound again to a processor. The process is not guaranteed to be bound to the same processor it was bound to before the job was suspended.
The processor binding affects the whole job process group. All job processes forked from the root job process (the job RES) are bound to the same processor.
Processor binding for LSF job processes does not bind daemon processes.
If processor binding is enabled, but the execution hosts do not support processor affinity, the configuration has no effect on the running processes. Processor binding has no effect on a single-processor host.
Processor, core, and thread-based CPU binding
By default, the number of CPUs on a host represents the number of physical processors a machine has. For LSF hosts with multiple cores, threads, and processors,
ncpuscan be defined by the cluster administrator to consider one of the following:
Globally, this definition is controlled by the parameter
ego.conf. The default behavior for
ncpusis to consider only the number of physical processors (
lsb.params, the resource requirement string keyword
ncpusrefers to the number of slots instead of the number of processors, however
lshostsoutput will continue to show
ncpusas defined by
Binding job processes randomly to multiple processors, cores, or threads, may affect job performance. Processor binding configured with LSF_BIND_JOB in
lsf.confor BIND_JOB in
lsb.applications, detects the EGO_DEFINE_NCPUS policy to bind the job processes by processor, core, or thread (PCT).
For example, if a host's PCT policy is set to processor (EGO_DEFINE_NCPUS=procs) and the binding option is set to BALANCE, the first job process is bound to the first physical processor, the second job process is bound to the second physical processor and so on.
If host's PCT policy is set to core level (EGO_DEFINE_NCPUS=cores) and the binding option is set to BALANCE, the first job process is bound to the first core on the first physical processor, the second job process is bound to the first core on the second physical processor, the third job process is bound to the second core on the first physical processor and so on.
If host's PCT policy is set to thread level (EGO_DEFINE_NCPUS=threads) and the binding option is set to BALANCE, the first job process is bound to the first thread on the first physical processor, the second job process is bound to the first thread on the second physical processor, the third job process is bound to the second thread on the first physical processor and so on.
The BIND_JOB=BALANCE option instructs LSF to bind the job based on the load of the available processors/cores/threads. For each slot:
- If the PCT level is set to processor, the lowest loaded physical processor runs the job.
- If the PCT level is set to core, the lowest loaded core on the lowest loaded processor runs the job.
- If the PCT level is set to thread, the lowest loaded thread on the lowest loaded core on the lowest loaded processor runs the job.
If there is a single 2 processor quad core host and you submit a parallel job with
-n 2 -R"span[hosts=1]"when the PCT level is core, the job is bound to the first core on the first processor and the first core on the second processor:
After submitting another three jobs with
-n 2 -R"span[hosts=1]":
PARALLEL_SCHED_BY_SLOT=Yis set in
lsb.params, the job specifies a maximum and minimum number of job slots instead of processors. If the MXJ value is set to 16 for this host (there are 16 job slots on this host), LSF can dispatch more jobs to this host. Another job submitted to this host is bound to the first core on the first processor and the first core on the second processor:
BIND_JOB=PACKoption instructs LSF to try to pack all the processes onto a single processor. If this cannot be done, LSF tries to use as few processors as possible. Email is sent to you after job dispatch and when job finishes. If no processors/cores/threads are free (when the PCT level is processor/core/thread level), LSF tries to use the BALANCE policy for the new job.
LSF depends on the order of processor IDs to pack jobs to a single processor.
If PCT level is processor (default value after installation), there is no difference between BALANCE and PACK.
This option binds jobs to a single processor where it makes sense, but does not oversubscribe the processors/cores/threads. The other processors are used when they are needed. For instance, when the PCT level is core level, if we have a single 4 processor quad core host and we had bound 4 sequential jobs onto the first processor, the 5th-8th sequential job is bound to the second processor.
If you submit three single-host parallel jobs with
-n 2 -R"span[hosts=1]"when the PCT level is core level, the first job is bound to the first and seconds cores of the first processor, the second job is bound to the third and fourth cores of the first processor. Binding the third job to the first processor oversubscribes the cores in the first processor, so the third job is bound to the first and second cores of the second processor:
After JOB1 and JOB2 finished, if you submit one single-host parallel jobs with
-n 2 -R"span[hosts=1], the job is bound to the third and fourth cores of the second processor:
BIND_JOB=ANYbinds the job to the first N available processors/cores/threads with no regard for locality. If the PCT level is core, LSF binds the first N available cores regardless of whether they are on the same processor or not. LSF arranges the order based on APIC ID.
If PCT level is processor (default value after installation), there is no difference between ANY and BALANCE.
For example, with a single 2-processor quad core host and the below table is the relationship of APIC ID and logic processor/core id:
If the PCT level is core level and you submits two jobs to this host with
-n 3 -R "span[hosts=1]", then the first job is bound to the first, second and third core of the first physical processor, the second job is bound to the fourth core of the first physical processor and the first, second core in the second physical processor.
BIND_JOB=USERbinds the job to the value of
$LSB_USER_BIND_JOBas specified in the user submission environment. This allows the Administrator to delegate binding decisions to the actual user. This value must be one of Y, N, NONE, BALANCE, PACK, or ANY. Any other value is treated as ANY.
BIND_JOB=USER_CPU_LISTbinds the job to the explicit logic CPUs specified in environment variable
$LSB_USER_BIND_CPU_LIST. LSF does not check that the value is valid for the execution host(s). It is the user's responsibility to correctly specify the CPU list for the hosts they select.
The correct format of
$LSB_USER_BIND_CPU_LISTis a list which may contain multiple items, separated by comma, and ranges. For example, 0,5,7,9-11.
If the value's format is not correct or there is no such environment variable, jobs are not bound to any processor.
If the format is correct and it cannot be mapped to any logic CPU, the binding fails. But if it can be mapped to some CPUs, the job is bound to the mapped CPUs. For example, with a two-processor quad core host and the logic CPU ID is 0-7:
- If user1 specifies 9,10 into
$LSB_USER_BIND_CPU_LIST, his job is not bound to any CPUs.
- If user2 specifies 1,2,9 into
$LSB_USER_BIND_CPU_LIST, his job is bound to CPU 1 and 2.
If the value's format is not correct or it does not apply for the execution host, the related information is added to the email sent to users after job dispatch and job finish.
If user specifies a minimum and a maximum number of processors for a single-host parallel job, LSF may allocate processors between these two numbers for the job. In this case, LSF binds the job according to the CPU list specified by the user.
BIND_JOB=NONEis functionally equivalent to the former
BIND_JOB=Nwhere the processor binding is disabled.
- Existing CPU affinity features
- Processor binding of LSF job processes will not take effect on a master host with the following parameters configured.
- IRIX cpusets
- Processor binding cannot be used with IRIX cpusets. If an execution host is configured as part of a cpuset, processor binding is disabled on that host.
- Job requeue, rerun, and migration
- When a job is requeued, rerun or migrated, a new job process is created. If processor binding is enabled when the job runs, the job processes will be bound to a processor.
badmin hrestartrestarts a new sbatchd. If a job process has already been bound to a processor, after sbatchd is restarted, processor binding for the job processes are restored.
- If the BIND_JOB parameter is modified in an application profile,
badmin reconfigonly affects pending jobs. The change does not affect running jobs.
- MultiCluster job forwarding model
- In a MultiCluster environment, the behavior is similar to the current application profile behavior. If the application profile name specified in the submission cluster is not defined in the execution cluster, the job is rejected. If the execution cluster has the same application profile name, but does not enable processor binding, the job processes are not bound at the execution cluster.
Enable processor binding for LSF job processes
LSF supports the following binding options for sequential jobs and parallel jobs that run on a single host:
- Enable processor binding cluster-wide or in an application profile.
- Cluster-wide configuration (
- Define LSF_BIND_JOB in
lsf.conf to enable processor binding for all execution hosts in the cluster. On the execution hosts that support this feature, job processes will be hard bound to selected processors.
- Application profile configuration (
- Define BIND_JOB in an application profile configuration in
lsb.applicationsto enable processor binding for all jobs submitted to the application profile. On the execution hosts that support this feature, job processes will be hard bound to selected processors.
If BIND_JOB is not set in an application profile in
lsb.applications, the value of LSF_BIND_JOB in
lsf.conftakes effect. The BIND_JOB parameter configured in an application profile overrides the
Increasing the job ID limit
By default, LSF assigns job IDs up to 6 digits. This means that no more than 999999 jobs can be in the system at once. The job ID limit is the highest job ID that LSF will ever assign, and also the maximum number of jobs in the system.
LSF assigns job IDs in sequence. When the job ID limit is reached, the count rolls over, so the next job submitted gets job ID "1". If the original job 1 remains in the system, LSF skips that number and assigns job ID "2", or the next available job ID. If you have so many jobs in the system that the low job IDs are still in use when the maximum job ID is assigned, jobs with sequential numbers could have different submission times.
Increase the maximum job ID
You cannot lower the job ID limit, but you can raise it to 10 digits. This allows longer term job accounting and analysis, and means you can have more jobs in the system, and the job ID numbers will roll over less often.
Use MAX_JOBID in
lsb.paramsto specify any integer from 999999 to 2147483646 (for practical purposes, you can use any 10-digit integer less than this value).
Increase the job ID display length
bhistdisplay job IDs with a maximum length of 7 characters. Job IDs greater than 9999999 are truncated on the left.
Use LSB_JOBID_DISP_LENGTH in
lsf.confto increase the width of the JOBID column in bjobs and bhist display. When LSB_JOBID_DISP_LENGTH=10, the width of the JOBID column in
bhistincreases to 10 characters.
Monitoring Performance Metrics in Real Time
Enable metric collection
lsb.paramsto enable performance metric collection.
Start performance metric collection dynamically:
badmin perfmon start
Optionally, you can set a sampling period, in seconds. If no sample period is specified, the default sample period set in
badmin perfmon stop
SCHED_METRIC_SAMPLE_PERIODcan be specified independently. That is, you can specify
SCHED_METRIC_SAMPLE_PERIODand not specify
SCHED_METRIC_ENABLE. In this case, when you turn on the feature dynamically (using
badmin perfmon start), the sampling period valued defined in
SCHED_METRIC_SAMPLE_PERIODwill be used.
badmin perfmon startand
badmin perfmon stopoverride the configuration setting in
lsb.params. Even if
SCHED_METRIC_ENABLEis set, if you run
badmin perfmon start, performance metric collection is started. If you run
badmin perfmon stop, performance metric collection is stopped.
Tune the metric sampling period
lsb.paramsto specify an initial cluster-wide performance metric sampling period.
Set a new sampling period in seconds:
badmin perfmon setperiod
Collecting and recording performance metric data may affect the performance of LSF. Smaller sampling periods will result in the
lsb.streamsfile growing faster.
Display current performance
badmin perfmon viewto view real time performance metric information. The following metrics are collected and recorded in each sample period:
- The number of queries handled by
- The number of queries for each of jobs, queues, and hosts. (
bhosts, as well as other daemon requests)
- The number of jobs submitted (divided into job submission requests and jobs actually submitted)
- The number of jobs dispatched
- The number of jobs completed
- The numbers of jobs sent to remote cluster
- The numbers of jobs accepted by from cluster
badmin perfmon viewPerformance monitor start time: Fri Jan 19 15:07:54 End time of last sample period: Fri Jan 19 15:25:55 Sample period : 60 Seconds ------------------------------------------------------------------ Metrics Last Max Min Avg Total ------------------------------------------------------------------ Total queries 0 25 0 8 159 Jobs information queries 0 13 0 2 46 Hosts information queries 0 0 0 0 0 Queue information queries 0 0 0 0 0 Job submission requests 0 10 0 0 10 Jobs submitted 0 100 0 5 100 Jobs dispatched 0 0 0 0 0 Jobs completed 0 13 0 5 100 Jobs sent to remote cluster 0 12 0 5 100 Jobs accepted from remote cluster 0 0 0 0 0 ------------------------------------------------------------------ File Descriptor Metrics Free Used Total ------------------------------------------------------------------ MBD file descriptor usage 800 424 1024
Performance metrics information is calculated at the end of each sampling period. Running
badmin perfmonbefore the end of the sampling period displays metric data collected from the sampling start time to the end of last sample period.
If no metrics have been collected because the first sampling period has not yet ended,
badmin perfmon viewdisplays:
badmin perfmon viewPerformance monitor start time: Thu Jan 25 22:11:12 End time of last sample period: Thu Jan 25 22:11:12 Sample period : 120 Seconds ------------------------------------------------------------------ No performance metric data available. Please wait until first sample period ends.
badmin perfmon output
Current sample period
Performance monitor start time
The start time of sampling
End time of last sample period
The end time of last sampling period
The name of metrics
This is accumulated metric counter value for each metric. It is counted from Performance monitor start time to End time of last sample period.
Last sampling value of metric. It is calculated per sampling period. It is represented as the metric value per period, and normalized by the following formula.
Maximum sampling value of metric. It is re-evaluated in each sampling period by comparing Max and Last Period. It is represented as the metric value per period.
Minimum sampling value of metric. It is re-evaluated in each sampling period by comparing Min and Last Period. It is represented as the metric value per period.
Average sampling value of metric. It is recalculated in each sampling period. It is represented as the metric value per period, and normalized by the following formula.
Reconfiguring your cluster with performance metric sampling enabled
If performance metric sampling is enabled dynamically with
badmin perfmon start. You must enable it again after running
badmin mbdrestart. If performance metric sampling is enabled by default,
StartTimewill be reset to the point
SCHED_METRIC_SAMPLE_PERIODparameters are changed,
badmin reconfigis the same as
Performance metric logging in lsb.streams
By default, collected metrics must be written to
lsb.streams. However, performance metric can still be turned on even if
ENABLE_EVENT_STREAM=Nis defined. In this case, no metric data will be logged.
EVENT_STREAM_FILEis defined and is valid, collected metrics should be written to
ENABLE_EVENT_STREAM=Nis defined, metrics data will not be logged.
Only one submission request is counted. Element jobs are counted for jobs submitted, jobs dispatched, and jobs completed.
Job rerun occurs when execution hosts become unavailable while a job is running, and the job will be put to its original queue first and later will be dispatched when a suitable host is available. So in this case, only one submission request, one job submitted, and
njobs completed are counted (
nrepresents the number of times the job reruns before it finishes successfully).
Requeued jobs may be dispatched, run, and exit due to some special errors again and again. The job data always exists in the memory, so LSF only counts one job submission request and one job submitted, and counts more than one job dispatched.
For jobs completed, if a job is requeued with
brequeue, LSF counts two jobs completed, since requeuing a job first kills the job and later puts the job into pending list. If the job is automatically requeued, LSF counts one job completed when the job finishes successfully.
When job replay is finished, submitted jobs are not counted in job submission and job submitted, but are counted in job dispatched and job finished.
Platform Computing Inc.
|Knowledge Center Contents Previous Next Index|