| Knowledge Center Contents Previous Next Index |
Working with Your Cluster
Contents
- Viewing cluster information
- Example directory structures
- Cluster administrators
- Controlling daemons
- Controlling mbatchd
- Reconfiguring your cluster
Viewing cluster information
LSF provides commands for users to access information about the cluster. Cluster information includes the cluster master host, cluster name, cluster resource definitions, cluster administrator, and so on.
To view the ... Run ... Version of LSFlsid Cluster namelsid Current master hostlsid Cluster administratorslsclusters Configuration parametersbparams
View LSF version, cluster name, and current master host
- Run
lsidto display the version of LSF, the name of your cluster, and the current master host:lsidPlatform LSF 7 Update 6 May 6 2009 Copyright 1992-2009 Platform Computing Corporation My cluster name is cluster1 My master name is hostAView cluster administrators
- Run
lsclustersto find out who your cluster administrator is and see a summary of your cluster:lsclustersCLUSTER_NAME STATUS MASTER_HOST ADMIN HOSTS SERVERS cluster1 ok hostA lsfadmin 6 6If you are using the LSF MultiCluster product, you will see one line for each of the clusters that your local cluster is connected to in the output of
lsclusters.View configuration parameters
- Run
bparamsto display the generic configuration parameters of LSF. These include default queues, job dispatch interval, job checking interval, and job accepting interval.bparamsDefault Queues: normal idle Job Dispatch Interval: 20 seconds Job Checking Interval: 15 seconds Job Accepting Interval: 20 seconds- Run
bparams -lto display the information in long format, which gives a brief description of each parameter and the name of the parameter as it appears inlsb.params.bparams -lSystem default queues for automatic queue selection: DEFAULT_QUEUE = normal idle The interval for dispatching jobs by master batch daemon: MBD_SLEEP_TIME = 20 (seconds) The interval for checking jobs by slave batch daemon: SBD_SLEEP_TIME = 15 (seconds) The interval for a host to accept two batch jobs subsequently: JOB_ACCEPT_INTERVAL = 1 (* MBD_SLEEP_TIME) The idle time of a host for resuming pg suspended jobs: PG_SUSP_IT = 180 (seconds) The amount of time during which finished jobs are kept in core: CLEAN_PERIOD = 3600 (seconds) The maximum number of finished jobs that are logged in current event file: MAX_JOB_NUM = 2000 The maximum number of retries for reaching a slave batch daemon: MAX_SBD_FAIL = 3 The number of hours of resource consumption history: HIST_HOURS = 5 The default project assigned to jobs. DEFAULT_PROJECT = default Sync up host status with master LIM is enabled: LSB_SYNC_HOST_STAT_LIM = Y MBD child query processes will only run on the following CPUs: MBD_QUERY_CPUS=1 2 3- Run
bparams -ato display all configuration parameters and their values inlsb.params.For example:
bparams -alsb.params configuration at Fri Jun 8 10:27:52 CST 2007 MBD_SLEEP_TIME = 20 SBD_SLEEP_TIME = 15 JOB_ACCEPT_INTERVAL = 1 SUB_TRY_INTERVAL = 60 LSB_SYNC_HOST_STAT_LIM = N MAX_JOBINFO_QUERY_PERIOD = 2147483647 PEND_REASON_UPDATE_INTERVAL = 30Viewing daemon parameter configuration
- Display all configuration settings for running LSF daemons.
- Use
lsadmin showconfto display all configured parameters and their values inlsf.conforego.conffor LIM.- Use
badmin showconfto display all configured parameters and their values inlsf.conforego.confformbatchdandsbatchd.In a MultiCluster environment,
lsadmin showconfandbadmin showconfonly display the parameters of daemons on the local cluster.Running
lsadmin showconfandbadmin showconffrom a master candidate host will reach all server hosts in the cluster. Runninglsadmin showconfandbadmin showconffrom a slave-only host may not be able to reach other slave-only hosts.You cannot run
lsadmin showconfandbadmin showconffrom client hosts.lsadminshows only server host configuration, not client host configuration.
lsadmin showconfandbadmin showconfonly displays the values used by LSF.
lsadmin showconfandbadmin showconfdisplay EGO_MASTER_LIST from wherever it is defined. You can define either LSF_MASTER_LIST inlsf.confor or EGO_MASTER_LIST inego.conf. LIM readslsf.conffirst, andego.confif EGO is enabled in the LSF cluster. LIM only takes the value of LSF_MASTER_LIST if EGO_MASTER_LIST is not defined at all inlsf.conf.For example, if EGO is enabled in the LSF cluster, and you define LSF_MASTER_LIST in
lsf.conf, and EGO_MASTER_LIST inego.conf,lsadmin showconfandbadmin showconfdisplay the value of EGO_MASTER_LIST inego.conf.If EGO is disabled,
ego.confnot loaded, so whatever is defined inlsf.confis displayed.- Display
mbatchdand rootsbatchdconfiguration.
- Use
badmin showconf mbdto display the parameters configured inlsf.conforego.confthat apply tombatchd.- Use
badmin showconf sbdto display the parameters configured inlsf.conforego.confthat apply to rootsbatchd.- Display LIM configuration.
Use
lsadmin showconf limto display the parameters configured inlsf.conforego.confthat apply to root LIM.By default,
lsadmindisplays the local LIM parameters. You can specify the host to display the LIM parameters.Examples
- Show
mbatchdconfiguration:badmin showconf mbd MBD configuration at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show sbatchdconfiguration on a specific host:badmin showconf sbd hosta SBD configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show sbatchdconfiguration for all hosts:badmin showconf sbd all SBD configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ... SBD configuration for host <hostb> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show limconfiguration:lsadmin showconf lim LIM configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show limconfiguration for a specific host:lsadmin showconf lim hosta LIM configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show limconfiguration for all hosts:lsadmin showconf lim all LIM configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ... LIM configuration for host <hostb> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Example directory structures
UNIX and Linux
The following figures show typical directory structures for a new UNIX or Linux installation with
lsfinstall. Depending on which products you have installed and platforms you have selected, your directory structure may vary.
![]()
Microsoft Windows
The following diagram shows an example directory structure for a Windows installation.
![]()
Cluster administrators
Primary cluster administrator
Required. The first cluster administrator, specified during installation. The primary LSF administrator account owns the configuration and log files. The primary LSF administrator has permission to perform clusterwide operations, change configuration files, reconfigure the cluster, and control jobs submitted by all users.
Other cluster administrators
Optional. May be configured during or after installation.
Cluster administrators can perform administrative operations on all jobs and queues in the cluster. Cluster administrators have the same cluster-wide operational privileges as the primary LSF administrator except that they do not have permission to change LSF configuration files.
Add cluster administrators
- In the
ClusterAdminssection oflsf.cluster.cluster_name, specify the list of cluster administrators following ADMINISTRATORS, separated by spaces.You can specify user names and group names.
The first administrator in the list is the primary LSF administrator. All others are cluster administrators.
For example:
Begin ClusterAdmins ADMINISTRATORS = lsfadmin admin1 admin2 End ClusterAdmins- Save your changes.
- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin mbdrestartto restartmbatchd.Controlling daemons
Permissions required
To control all daemons in the cluster, you must
- Be logged on as root or as a user listed in the
/etc/lsf.sudoersfile. See thePlatform LSF Configuration Referencefor configuration details oflsf.sudoers.- Be able to run the
rshorsshcommands across all LSF hosts without having to enter a password. See your operating system documentation for information about configuring thershandsshcommands. The shell command specified by LSF_RSH inlsf.confis used beforershis tried.Daemon commands
The following is an overview of commands you use to control LSF daemons.
sbatchd
Restarting
sbatchdon a host does not affect jobs that are running on that host.If
sbatchdis shut down, the host is not available to run new jobs. Existing jobs running on that host continue, but the results are not sent to the user untilsbatchdis restarted.LIM and RES
Jobs running on the host are not affected by restarting the daemons.
If a daemon is not responding to network connections,
lsadmindisplays an error message with the host name. In this case you must kill and restart the daemon manually.If the LIM and the other daemons on the current master host shut down, another host automatically takes over as master.
If the RES is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted.
Controlling mbatchd
You use the
badmincommand to controlmbatchd.Reconfigure mbatchd
If you add a host to a host group, a host to a queue, or change resource configuration in the Hosts section of
lsf.cluster.cluster_name, the change is not recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must restartmbatchd.
- Run
badmin reconfig.When you reconfigure the cluster,
mbatchdis not restarted. Only configuration files are reloaded.Restart mbatchd
- Run
badmin mbdrestart.LSF checks configuration files for errors and prints the results to
stderr. If no errors are found, the following occurs:
- Configuration files are reloaded
mbatchdis restarted- Events in
lsb.eventsare reread and replayed to recover the running state of the lastmbatchd
tip:Whenevermbatchdis restarted, it is unavailable to service requests. In large clusters where there are many events inlsb.events, restartingmbatchdcan take some time. To avoid replaying events inlsb.events, use the commandbadmin reconfig.Log a comment when restarting mbatchd
- Use the
-Coption ofbadmin mbdrestartto log an administrator comment inlsb.events.For example:
badmin mbdrestart -C "Configuration change"The comment text
Configuration changeis recorded inlsb.events.- Run
badmin historbadmin mbdhistto display administrator comments formbatchdrestart.Shut down mbatchd
- Run
badmin hshutdownto shut downsbatchdon the master host.For example:
badmin hshutdown hostDShut down slave batch daemon on <hostD> .... done- Run
badmin mbdrestart:badmin mbdrestartChecking configuration files ... No errors found.This causes
mbatchdandmbschdto exit.mbatchdcannot be restarted, becausesbatchdis shut down. All LSF services are temporarily unavailable, but existing jobs are not affected. Whenmbatchdis later started bysbatchd, its previous status is restored from the event log file and job scheduling continues.Customize batch command messages
LSF displays error messages when a batch command cannot communicate with
mbatchd. Users see these messages when the batch command retries the connection tombatchd.You can customize three of these messages to provide LSF users with more detailed information and instructions.
- In the file
lsf.conf, identify the parameter for the message that you want to customize.The following lists the parameters you can use to customize messages when a batch command does not receive a response from
mbatchd.
- Specify a message string, or specify an empty string:
- To specify a message string, enclose the message text in quotation marks (") as shown in the following example:
LSB_MBD_BUSY_MSG="The mbatchd daemon is busy. Your command will retry every 5 minutes. No action required."- To specify an empty string, type quotation marks (") as shown in the following example:
LSB_MBD_BUSY_MSG=""Whether you specify a message string or an empty string, or leave the parameter undefined, the batch command retries the connection to
mbatchdat the intervals specified by the parameters LSB_API_CONNTIMEOUT and LSB_API_RECVTIMEOUT.
note:Before Version 7.0, LSF displayed the following message for all three message types: "batch daemon not responding...still trying." To display the previous default message, you must define each of the three message parameters and specify "batch daemon not responding...still trying" as the message string.- Save and close the
lsf.conffile.Reconfiguring your cluster
After changing LSF configuration files, you must tell LSF to reread the files to update the configuration. Use the following commands to reconfigure a cluster:
lsadmin reconfigbadmin reconfigbadmin mbdrestartThe reconfiguration commands you use depend on which files you change in LSF. The following table is a quick reference.
Reconfigure the cluster with lsadmin and badmin
To make a configuration change take effect, use this method to reconfigure the cluster.
- Log on to the host as
rootor the LSF administrator.- Run
lsadmin reconfigto reconfigure LIM:
lsadmin reconfigThe
lsadmin reconfigcommand checks for configuration errors.If no errors are found, you are prompted to either restart
limon master host candidates only, or to confirm that you want to restartlimon all hosts. If fatal errors are found, reconfiguration is aborted.- Run
badmin reconfigto reconfigurembatchd:
badmin reconfigThe
badmin reconfigcommand checks for configuration errors.If fatal errors are found, reconfiguration is aborted.
Reconfigure the cluster by restarting mbatchd
To replay and recover the running state of the cluster, use this method to reconfigure the cluster.
- Run
badmin mbdrestartto restartmbatchd:
badmin mbdrestartThe
badmin mbdrestartcommand checks for configuration errors.If no fatal errors are found, you are asked to confirm
mbatchdrestart. If fatal errors are found, the command exits without taking any action.
tip:If thelsb.eventsfile is large, or many jobs are running, restartingmbatchdcan take some time. In addition,mbatchdis not available to service requests while it is restarted.View configuration errors
- Run
lsadmin ckconfig -v.- Run
badmin ckconfig -v.This reports all errors to your terminal.
How reconfiguring the cluster affects licenses
If the license server goes down, LSF can continue to operate for a period of time until it attempts to renew licenses.
Reconfiguring causes LSF to renew licenses. If no license server is available, LSF does not reconfigure the system because the system would lose all its licenses and stop working.
If you have multiple license servers, reconfiguration proceeds provided LSF can contact at least one license server. In this case, LSF still loses the licenses on servers that are down, so LSF may have fewer licenses available after reconfiguration.
|
Platform Computing Inc.
www.platform.com |
| Knowledge Center Contents Previous Next Index |