| Knowledge Center Contents Previous Next Index |
Working with Hosts
Contents
- Host status
- How LIM Determines Host Models and Types
- Viewing Host Information
- Controlling Hosts
- Adding a Host
- Remove a Host
- Adding Hosts Dynamically
- Automatically Detect Operating System Types and Versions
- Add Host Types and Host Models to lsf.shared
- Registering Service Ports
- Host Naming
- Hosts with Multiple Addresses
- Using IPv6 Addresses
- Specify host names with condensed notation
- Host Groups
- Compute Units
- Tuning CPU Factors
- Handling Host-level Job Exceptions
Host status
Host status describes the ability of a host to accept and run batch jobs in terms of daemon states, load levels, and administrative controls. The
bhostsandlsloadcommands display host status.bhosts
Displays the current status of the host:
bhosts -l
Displays the closed reasons. A closed host does not accept new batch jobs:
bhostsHOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hostA ok - 55 2 2 0 0 0 hostB closed - 20 16 16 0 0 0 ...bhosts -l hostBHOST hostB STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 23.10 - 55 2 2 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 1.0 -0.0 -0.0 4% 9.4 148 2 3 4231M 698M 233M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -lsload
Displays the current state of the host:
lsloadHOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostA ok 0.0 0.0 0.0 4% 0.4 0 4316 10G 302M 252M hostB ok 1.0 0.0 0.0 4% 8.2 2 14 4231M 698M 232M ...How LIM Determines Host Models and Types
The LIM (load information manager) daemon/service automatically collects information about hosts in an LSF cluster, and accurately determines running host models and types. At most, 1024 model types can be manually defined in
lsf.shared.If
lsf.sharedis not fully defined with all known host models and types found in the cluster, LIM attempts to match an unrecognized running host to one of the models and types that is defined.LIM supports both exact matching of host models and types, and "fuzzy" matching, where an entered host model name or type is slightly different from what is defined in
lsf.shared(or inego.sharedif EGO is enabled in the LSF cluster).How does "fuzzy" matching work?
LIM reads host models and types that have been manually configured in
lsf.shared. The format for entering host models and types ismodel_bogomips_architecture(for example,x15_4604_OpterontmProcessor142,IA64_2793, orSUNWUltra510_360_sparc). Names can be up to 64 characters long.When LIM attempts to match running host model with what is entered in
lsf.shared, it first attempts an exact match, then proceeds to make a fuzzy match.How LIM attempts to make matches
Viewing Host Information
LSF uses some or all of the hosts in a cluster as execution hosts. The host list is configured by the LSF administrator. Use the
bhostscommand to view host information. Use thelsloadcommand to view host load information.
View all hosts in the cluster and their status
- Run
bhoststo display information about all hosts and their status.
bhostsdisplays condensed information for hosts that belong to condensed host groups. When displaying members of a condensed host group,bhostslists the host group name instead of the name of the individual host. For example, in a cluster with a condensed host group (groupA), an uncondensed host group (groupBcontaininghostCandhostE), and a host that is not in any host group (hostF),bhostsdisplays the following:bhostsHOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV groupA ok 5 8 4 2 0 1 1 hostC ok - 3 0 0 0 0 0 hostE ok 2 4 2 1 0 0 1 hostF ok - 2 2 1 0 1 0Define condensed host groups in the
HostGroupssection oflsb.hosts. To find out more about condensed host groups and to see the configuration for the above example, see Defining condensed host groups.View uncondensed host information
- Run
bhosts -Xto display all hosts in an uncondensed format, including those belonging to condensed host groups:bhosts -XHOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hostA ok 2 2 0 0 0 0 0 hostD ok 2 4 2 1 0 0 1 hostB ok 1 2 2 1 0 1 0 hostC ok - 3 0 0 0 0 0 hostE ok 2 4 2 1 0 0 1 hostF ok - 2 2 1 0 1 0View detailed server host information
- Run
bhosts -lhost_nameandlshosts -lhost_nameto display all information about each server host such as the CPU factor and the load thresholds to start, suspend, and resume jobs:bhosts -l hostBHOST hostB STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOWS ok 20.20 - - 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.1 0.1 0.1 9% 0.7 24 17 0 394M 396M 12M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -lshosts -l hostBHOST_NAME: hostB type model cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads LINUX86 PC6000 116.1 2 1 2016M 1983M 72917M 0 Yes 1 2 2 RESOURCES: Not defined RUN_WINDOWS: (always open) LICENSES_ENABLED: (LSF_Base LSF_Manager LSF_MultiCluster) LICENSE_NEEDED: Class(E) LOAD_THRESHOLDS: r15s r1m r15m ut pg io ls it tmp swp mem - 1.0 - - - - - - - - 4MView host load by host
The
lsloadcommand reports the current status and load levels of hosts in a cluster. Thelshosts -lcommand shows the load thresholds.The
lsmoncommand provides a dynamic display of the load information. The LSF administrator can find unavailable or overloaded hosts with these tools.
- Run
lsloadto see load levels for each host:lsloadHOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostD ok 1.3 1.2 0.9 92% 0.0 2 20 5M 148M 88M hostB -ok 0.1 0.3 0.7 0% 0.0 1 67 45M 25M 34M hostA busy 8.0 *7.0 4.9 84% 4.6 6 17 1M 81M 27MThe first line lists the load index names, and each following line gives the load levels for one host.
Viewing host architecture (type and model) information
An LSF cluster may consist of hosts of differing architectures and speeds. The
lshostscommand displays configuration information about hosts. All these parameters are defined by the LSF administrator in the LSF configuration files, or determined by the LIM directly from the system.Host types represent binary compatible hosts; all hosts of the same type can run the same executable. Host models give the relative CPU performance of different processors. For example:
lshostsHOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostD SUNSOL SunSparc 6.0 1 64M 112M Yes (solaris cserver) hostM RS6K IBM350 7.0 1 64M 124M Yes (cserver aix) hostC SGI6 R10K 14.0 16 1024M 1896M Yes (irix cserver) hostA HPPA HP715 6.0 1 98M 200M Yes (hpux fserver)In the above example, the host type
SUNSOLrepresents Sun SPARC systems running Solaris, andSGI6represents an SGI server running IRIX 6. Thelshostscommand also displays the resources available on each host.type
The host CPU architecture. Hosts that can run the same binary programs should have the same type.
An
UNKNOWNtype or model indicates the host is down, or LIM on the host is down. See UNKNOWN host type or model for instructions on measures to take.When automatic detection of host type or model fails (the host type configured in
lsf.sharedcannot be found), the type or model is set toDEFAULT. LSF will work on the host, but aDEFAULTmodel may be inefficient because of incorrect CPU factors. ADEFAULTtype may also cause binary incompatibility because a job from aDEFAULThost type can be migrated to anotherDEFAULThost type. automatic detection of host type or model has failed, and the host type configured inlsf.sharedcannot be found.View host history
- Run
badmin hhistto view the history of a host such as when it is opened or closed:badmin hhist hostBWed Nov 20 14:41:58: Host <hostB> closed by administrator <lsf>. Wed Nov 20 15:23:39: Host <hostB> opened by administrator <lsf>.View host model and type information
- Run
lsinfo -mto display information about host models that exist in the cluster:lsinfo -mMODEL_NAME CPU_FACTOR ARCHITECTURE PC1133 23.10 x6_1189_PentiumIIICoppermine HP9K735 4.50 HP9000735_125 HP9K778 5.50 HP9000778 Ultra5S 10.30 SUNWUltra510_270_sparcv9 Ultra2 20.20 SUNWUltra2_300_sparc Enterprise3000 20.00 SUNWUltraEnterprise_167_sparc- Run
lsinfo -Mto display all host models defined inlsf.shared:lsinfo -MMODEL_NAME CPU_FACTOR ARCHITECTURE UNKNOWN_AUTO_DETECT 1.00 UNKNOWN_AUTO_DETECT DEFAULT 1.00 LINUX133 2.50 x586_53_Pentium75 PC200 4.50 i86pc_200 Intel_IA64 12.00 ia64 Ultra5S 10.30 SUNWUltra5_270_sparcv9 PowerPC_G4 12.00 x7400G4 HP300 1.00 SunSparc 12.00Run lim -tto display the type, model, and matched type of the current host. You must be the LSF administrator to use this command:lim -tHost Type : NTX64 Host Architecture : EM64T_1596 Physical Processors : 2 Cores per Processor : 4 Threads per Core : 2 License Needed : Class(B),Multi-cores Matched Type : NTX64 Matched Architecture : EM64T_3000 Matched Model : Intel_EM64T CPU Factor : 60.0View job exit rate and load for hosts
- Run
bhoststo display the exception threshold for job exit rate and the current load value for hosts.:In the following example, EXIT_RATE for
hostAis configured as 4 jobs per minute.hostAdoes not currently exceed this ratebhosts -l hostAHOST hostA STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW ok 18.60 - 1 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.0 0.0 0.0 0% 0.0 0 1 2 646M 648M 115M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M share_rsrc host_rsrc Total 3.0 2.0 Reserved 0.0 0.0 LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 4.00 Load 0.00- Use
bhosts -xto see hosts whose job exit rate has exceeded the threshold for longer than JOB_EXIT_RATE_DURATION, and are still high. By default, these hosts are closed the next time LSF checks host exceptions and invokeseadmin.If no hosts exceed the job exit rate,
bhosts -xdisplays:There is no exceptional host foundView dynamic host information
- Use
lshoststo display information on dynamically added hosts.An LSF cluster may consist of static and dynamic hosts. The
lshostscommand displays configuration information about hosts. All these parameters are defined by the LSF administrator in the LSF configuration files, or determined by the LIM directly from the system.Host types represent binary compatible hosts; all hosts of the same type can run the same executable. Host models give the relative CPU performance of different processors. Server represents the type of host in the cluster. "Yes" is displayed for LSF servers, "No" is displayed for LSF clients, and "Dyn" is displayed for dynamic hosts.
For example:
lshostsHOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostA SOL64 Ultra60F 23.5 1 64M 112M Yes () hostB LINUX86 Opteron8 60.0 1 94M 168M Dyn ()In the above example,
hostAis a static host whilehostBis a dynamic host.Controlling Hosts
Hosts are opened and closed by an LSF Administrator or root issuing a command or through configured dispatch windows.
Close a host
- Run
badmin hclose:badmin hclose hostBClose <hostB> ...... doneIf the command fails, it may be because the host is unreachable through network problems, or because the daemons on the host are not running.
Open a host
- Run
badmin hopen:badmin hopen hostBOpen <hostB> ...... doneConfigure Dispatch Windows
A dispatch window specifies one or more time periods during which a host will receive new jobs. The host will not receive jobs outside of the configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, dispatch windows are not configured.
To configure dispatch windows:
- Edit
lsb.hosts.- Specify one or more time windows in the DISPATCH_WINDOW column:
Begin Host HOST_NAME r1m pg ls tmp DISPATCH_WINDOW ... hostB 3.5/4.5 15/ 12/15 0 (4:30-12:00) ... End Host- Reconfigure the cluster:
- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.- Run
bhosts -lto display the dispatch windows.Log a comment when closing or opening a host
- Use the
-Coption ofbadmin hcloseandbadmin hopento log an administrator comment inlsb.events:badmin hclose -C "Weekly backup" hostBThe comment text
Weekly backupis recorded inlsb.events. If you close or open a host group, each host group member displays with the same comment string.A new event record is recorded for each host open or host close event. For example:
badmin hclose -C "backup" hostAfollowed by
badmin hclose -C "Weekly backup" hostAgenerates the following records in
"HOST_CTRL" "7.0 1050082346 1 "hostA" 32185 "lsfadmin" "backup" "HOST_CTRL" "7.0 1050082373 1 "hostA" 32185 "lsfadmin" "Weekly backup"lsb.events:- Use
badmin historbadmin hhistto display administrator comments for closing and opening hosts:badmin hhistFri Apr 4 10:35:31: Host <hostB> closed by administrator <lsfadmin> Weekly backup.
bhosts -lalso displays the comment text:bhosts -lHOST hostA STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 1.00 - - 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.0 0.0 0.0 2% 0.0 64 2 11 7117M 512M 432M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 2.00 Load 0.00 ADMIN ACTION COMMENT: "Weekly backup"How events are displayed and recorded in MultiCluster lease model
In the MultiCluster resource lease model, host control administrator comments are recorded only in the
lsb.eventsfile on the local cluster.badmin histandbadmin hhistdisplay only events that are recorded locally. Host control messages are not passed between clusters in the MultiCluster lease model. For example. if you close an exported host in both the consumer and the provider cluster, the host close events are recorded separately in their locallsb.events.Adding a Host
You use the
lsfinstallcommand to add a host to an LSF cluster.Contents
Add a host of an existing type using lsfinstall
restriction:lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
- Verify that the host type already exists in your cluster:
- Log on to any host in the cluster. You do not need to be root.
- List the contents of the LSF_TOP/7.0 directory. The default is
/usr/share/lsf/7.0. If the host type currently exists, there is a subdirectory with the name of the host type. If it does not exist, go to Add a host of a new type using lsfinstall.- Add the host information to
lsf.cluster.cluster_name:
- Log on to the LSF master host as root.
- Edit
LSF_CONFDIR/lsf.cluster.cluster_name, and specify the following in theHostsection:
- The name of the host.
- The model and type, or specify ! to automatically detect the type or model.
- Specify
1for LSF server or0for LSF client.Begin Host HOSTNAME model type server r1m mem RESOURCES REXPRI hosta ! SUNSOL6 1 1.0 4 () 0 hostb ! SUNSOL6 0 1.0 4 () 0 hostc ! HPPA1132 1 1.0 4 () 0 hostd ! HPPA1164 1 1.0 4 () 0 End Host- Save your changes.
- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin mbdrestartto restartmbatchd.- Run
hostsetupto set up the new host and configure the daemons to start automatically at boot from/usr/share/lsf/7.0/install:./hostsetup --top="/usr/share/lsf" --boot="y"- Start LSF on the new host:
lsadmin limstartuplsadmin resstartupbadmin hstartup- Run
bhostsandlshoststo verify your changes.
- If any host type or host model is UNKNOWN, follow the steps in UNKNOWN host type or model to fix the problem.
- If any host type or host model is DEFAULT, follow the steps in DEFAULT host type or model to fix the problem.
Add a host of a new type using lsfinstall
restriction:lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
- Verify that the host type does not already exist in your cluster:
- Log on to any host in the cluster. You do not need to be root.
- List the contents of the LSF_TOP/7.0 directory. The default is
/usr/share/lsf/7.0. If the host type currently exists, there will be a subdirectory with the name of the host type. If the host type already exists, go to Add a host of an existing type using lsfinstall.- Get the LSF distribution tar file for the host type you want to add.
- Log on as root to any host that can access the LSF install directory.
- Change to the LSF install directory. The default is
/usr/share/lsf/7.0/install- Edit
install.config:
- For LSF_TARDIR, specify the path to the tar file. For example:
LSF_TARDIR="/usr/share/lsf_distrib/7.0"- For LSF_ADD_SERVERS, list the new host names enclosed in quotes and separated by spaces. For example:
LSF_ADD_SERVERS="hosta hostb"- Run
./lsfinstall -finstall.config. This automatically creates the host information inlsf.cluster.cluster_name.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.- Run
hostsetupto set up the new host and configure the daemons to start automatically at boot from/usr/share/lsf/7.0/install:./hostsetup --top="/usr/share/lsf" --boot="y"- Start LSF on the new host:
lsadmin limstartuplsadmin resstartupbadmin hstartup- Run
bhostsandlshoststo verify your changes.
- If any host type or host model is UNKNOWN, follow the steps in UNKNOWN host type or model to fix the problem.
- If any host type or host model is DEFAULT, follow the steps in DEFAULT host type or model to fix the problem.
Remove a Host
Removing a host from LSF involves preventing any additional jobs from running on the host, removing the host from LSF, and removing the host from the cluster.
caution:Never remove the master host from LSF. If you want to remove your current default master from LSF, changelsf.cluster.cluster_nameto assign a different default master host. Then remove the host that was once the master host.
- Log on to the LSF host as root.
- Run
badmin hcloseto close the host. This prevents jobs from being dispatched to the host and allows running jobs to finish.- Stop all running daemons manually.
- Remove any references to the host in the Host section of
LSF_CONFDIR/lsf.cluster.cluster_name.- Remove any other references to the host, if applicable, from the following LSF configuration files:
LSF_CONFDIR/lsf.sharedLSB_CONFDIR/cluster_name/configdir/lsb.hostsLSB_CONFDIR/cluster_name/configdir/lsb.queuesLSB_CONFDIR/cluster_name/configdir/lsb.resources- Log off the host to be removed, and log on as
rootor the primary LSF administrator to any other host in the cluster.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin mbdrestartto restartmbatchd.- If you configured LSF daemons to start automatically at system startup, remove the LSF section from the host's system startup files.
- If any users of the host use
lstcshas their login shell, change their login shell totcshorcsh. Removelstcshfrom the/etc/shellsfile.Remove a Host from Master Candidate List
You can remove a host from the master candidate list so that it can no longer be the master should failover occur. You can choose to either keep it as part of the cluster or remove it.
- Shut down the current LIM:
limshutdownhost_nameIf the host was the current master, failover occurs.
- In
lsf.conf, remove the host name fromLSF_MASTER_LIST.- Run
lsadmin reconfigfor the remaining master candidates.- If the host you removed as a master candidate still belongs to the cluster, start up the LIM again:
limstartuphost_nameAdding Hosts Dynamically
By default, all configuration changes made to LSF are static. To add or remove hosts within the cluster, you must manually change the configuration and restart all master candidates.
Dynamic host configuration allows you to add and remove hosts without manual reconfiguration. To enable dynamic host configuration, all of the parameters described in the following table must be defined.
important:If you choose to enable dynamic hosts when you install LSF, the installer adds the parameter LSF_HOST_ADDR_RANGE tolsf.cluster.cluster_nameusing a default value that allows any host to join the cluster. To enable security, configure LSF_HOST_ADDR_RANGE inlsf.cluster.cluster_nameafter installation to restrict the hosts that can join your cluster.How dynamic host configuration works
Master LIM
The master LIM runs on the master host for the cluster. The master LIM receives requests to add hosts, and tells the master host candidates defined by the parameter LSF_MASTER_LIST to update their configuration information when a host is dynamically added or removed.
Upon startup, both static and dynamic hosts wait to receive an acknowledgement from the master LIM. This acknowledgement indicates that the master LIM has added the host to the cluster. Static hosts normally receive an acknowledgement because the master LIM has access to static host information in the LSF configuration files. Dynamic hosts do not receive an acknowledgement, however, until they announce themselves to the master LIM. The parameter LSF_DYNAMIC_HOST_WAIT_TIME in
lsf.confdetermines how long a dynamic host waits before sending a request to the master LIM to add the host to the cluster.Master candidate LIMs
The parameter LSF_MASTER_LIST defines the list of master host candidates. These hosts receive updated host information from the master LIM so that any master host candidate can take over as master host for the cluster.
important:Master candidate hosts should share LSF configuration and binaries.Dynamic hosts cannot be master host candidates. By defining the parameter LSF_MASTER_LIST, you ensure that LSF limits the list of master host candidates to specific, static hosts.
mbatchd
mbatchdgets host information from the master LIM; when it detects the addition or removal of a dynamic host within the cluster,mbatchdautomatically reconfigures itself.
tip:After adding a host dynamically, you might have to wait formbatchdto detect the host and reconfigure. Depending on system load,mbatchdmight wait up to a maximum of 10 minutes before reconfiguring.lsadmin command
Use the command
lsadmin limstartupto start the LIM on a newly added dynamic host.Allowing only certain hosts to join the cluster
By default, any host can be dynamically added to the cluster. To enable security, define LSF_HOST_ADDR_RANGE in
lsf.cluster.cluster_nameto identify a range of IP addresses for hosts that are allowed to dynamically join the cluster as LSF hosts. IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. You can use IPv6 addresses if you define the parameter LSF_ENABLE_SUPPORT_IPV6 inlsf.conf; you do not have to map IPv4 addresses to an IPv6 format.Configure LSF to run batch jobs on dynamic hosts
Before you run batch jobs on a dynamic host, complete any or all of the following steps, depending on your cluster configuration.
- Configure queues to accept all hosts by defining the
HOSTSparameter inlsb.queuesusing the keywordall.- Define host groups that will accept wild cards in the
HostGroupsection oflsb.hosts.For example, define
linuxrack*as aGROUP_MEMBERwithin a host group definition.- Add a dynamic host to a host group using the command
badmin hghostadd.Changing a dynamic host to a static host
If you want to change a dynamic host to a static host, first use the command
badmin hghostdelto remove the dynamic host from any host group that it belongs to, and then configure the host as a static host inlsf.cluster.cluster_name.Adding dynamic hosts
Add a dynamic host in a shared file system environment
In a shared file system environment, you do not need to install LSF on each dynamic host. The master host will recognize a dynamic host as an LSF host when you start the daemons on the dynamic host.
- In
lsf.confon the master host, define the parameter LSF_DYNAMIC_HOST_WAIT_TIME, in seconds, and assign a value greater than zero.LSF_DYNAMIC_HOST_WAIT_TIME specifies the length of time a dynamic host waits before sending a request to the master LIM to add the host to the cluster.
For example:
LSF_DYNAMIC_HOST_WAIT_TIME=60- In
lsf.confon the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host,
mbatchdautomatically reconfigures itself.
note:For very large clusters, defining this parameter could decrease system performance.For example:
LSF_DYNAMIC_HOST_TIMEOUT=60m- In
lsf.cluster.cluster_nameon the master host, define the parameter LSF_HOST_ADDR_RANGE.LSF_HOST_ADDR_RANGE enables security by defining a list of hosts that can join the cluster. Specify IP addresses or address ranges for hosts that you want to allow in the cluster.
tip:If you define the parameter LSF_ENABLE_SUPPORT_IPV6 inlsf.conf,IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format; you do not have to map IPv4 addresses to an IPv6 format.For example:
LSF_HOST_ADDR_RANGE=100-110.34.1-10.4-56All hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. In this example, no IPv6 hosts are allowed.
- Log on as root to each host you want to join the cluster.
- Source the LSF environment:
- For
cshortcsh:source LSF_TOP/conf/cshrc.lsf- For
sh,ksh, orbash:. LSF_TOP/conf/profile.lsf- Do you want LSF to start automatically when the host reboots?
- If no, go to step 7.
- If yes, run the
hostsetupcommand. For example:cd /usr/share/lsf/7.0/install./hostsetup --top="/usr/share/lsf" --boot="y"For complete
hostsetupusage, enterhostsetup -h.- Use the following commands to start LSF:
lsadmin limstartuplsadmin resstartupbadmin hstartupAdd a dynamic host in a non-shared file system environment
In a non-shared file system environment, you must install LSF binaries, a localized
lsf.conffile, and shell environment scripts (cshrc.lsfandprofile.lsf) on each dynamic host.Specify installation options in the
slave.configfileAll dynamic hosts are slave hosts, because they cannot serve as master host candidates. The
slave.configfile contains parameters for configuring all slave hosts.
- Define the required parameters.
LSF_SERVER_HOSTS="
host_name[host_name...]"LSF_ADMINS="
user_name[user_name ...]"LSF_TOP="/
path"- Define the optional parameters.
LSF_LIM_PORT=
port_number
important:If the master host does not use the default LSF_LIM_PORT, you must specify the same LSF_LIM_PORT defined inlsf.confon the master host.Add local resources on a dynamic host to the cluster
Prerequisites: Ensure that the resource name and type are defined in
lsf.shared, and that the ResourceMap section of lsf.cluster.cluster_namecontains at least one resource mapped to at least one static host. LSF can add local resources as long as the ResourceMap section is defined; you do not need to map the local resources.
- In the
slave.configfile, define the parameter LSF_LOCAL_RESOURCES.For numeric resources, define name-value pairs:
"[resourcemapvalue*resource_name]"For Boolean resources, the value is the resource name in the following format:
"[resourceresource_name]"For example:
LSF_LOCAL_RESOURCES="[resourcemap 1*verilog] [resource linux]"
tip:If LSF_LOCAL_RESOURCES are already defined in a locallsf.confon the dynamic host,lsfinstalldoes not add resources you define in LSF_LOCAL_RESOURCES inslave.config.When the dynamic host sends a request to the master host to add it to the cluster, the dynamic host also reports its local resources. If the local resource is already defined in
lsf.cluster.cluster_nameasdefaultorall, it cannot be added as a local resource.Install LSF on a dynamic host
- Run
lsfinstall -s -f slave.config.
lsfinstallcreates a locallsf.conffor the dynamic host, which sets the following parameters:LSF_CONFDIR="/
path"LSF_GET_CONF=lim
LSF_LIM_PORT=
port_number(same as the master LIM port number)LSF_LOCAL_RESOURCES="
resource..."
tip:Do not duplicate LSF_LOCAL_RESOURCES entries inlsf.conf. If local resources are defined more than once, only the last definition is valid.LSF_SERVER_HOSTS="
host_name[host_name...]"LSF_VERSION=7.0
important:If LSF_STRICT_CHECKING is defined in lsf.conf to protect your cluster in untrusted environments, and your cluster has dynamic hosts, LSF_STRICT_CHECKING must be configured in the locallsf.confon all dynamic hosts.Configure dynamic host parameters
- In
lsf.confon the master host, define the parameter LSF_DYNAMIC_HOST_WAIT_TIME, in seconds, and assign a value greater than zero.LSF_DYNAMIC_HOST_WAIT_TIME specifies the length of time a dynamic host waits before sending a request to the master LIM to add the host to the cluster.
For example:
LSF_DYNAMIC_HOST_WAIT_TIME=60- In
lsf.confon the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host,
mbatchdautomatically reconfigures itself.
note:For very large clusters, defining this parameter could decrease system performance.For example:
LSF_DYNAMIC_HOST_TIMEOUT=60m- In
lsf.cluster.cluster_nameon the master host, define the parameter LSF_HOST_ADDR_RANGE.LSF_HOST_ADDR_RANGE enables security by defining a list of hosts that can join the cluster. Specify IP addresses or address ranges for hosts that you want to allow in the cluster.
tip:If you define the parameter LSF_ENABLE_SUPPORT_IPV6 inlsf.conf,IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format; you do not have to map IPv4 addresses to an IPv6 format.For example:
LSF_HOST_ADDR_RANGE=100-110.34.1-10.4-56All hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. No IPv6 hosts are allowed.
Start LSF daemons
- Log on as root to each host you want to join the cluster.
- Source the LSF environment:
- For
cshortcsh:source LSF_TOP/conf/cshrc.lsf- For
sh,ksh, orbash:. LSF_TOP/conf/profile.lsf- Do you want LSF to start automatically when the host reboots?
- If no, go to step 4.
- If yes, run the
hostsetupcommand. For example:cd /usr/share/lsf/7.0/install./hostsetup --top="/usr/share/lsf" --boot="y"For complete
hostsetupusage, enterhostsetup -h.- Is this the first time the host is joining the cluster?
- If no, use the following commands to start LSF:
lsadmin limstartuplsadmin resstartupbadmin hstartup- If yes, you must start the daemons from the local host. For example, if you want to start the daemons on hostB from hostA, use the following commands:
rsh hostB lsadmin limstartuprsh hostB lsadmin resstartuprsh hostB badmin hstartupRemoving dynamic hosts
To remove a dynamic host from the cluster, you can either set a timeout value, or you can edit the
hostcachefile.Remove a host by setting a timeout value
LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host,
mbatchdautomatically reconfigures itself.
note:For very large clusters, defining this parameter could decrease system performance. If you want to use this parameter to remove dynamic hosts from a very large cluster, disable the parameter after LSF has removed the unwanted hosts.
- In
lsf.confon the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.To specify minutes rather than hours, append m or M to the value.
For example:
LSF_DYNAMIC_HOST_TIMEOUT=60mRemove a host by editing the hostcache file
Dynamic hosts remain in the cluster unless you intentionally remove them. Only the cluster administrator can modify the
hostcachefile.
- Shut down the cluster.
lsfshutdownThis shuts down LSF on all hosts in the cluster and prevents LIMs from trying to write to the
hostcachefile while you edit it.- In the hostcache file
$EGO_WORKDIR/lim/hostcache, delete the line for the dynamic host that you want to remove.
- If EGO is enabled, the hostcache file is in
$EGO_WORKDIR/lim/hostcache.- If EGO is not enabled, the hostcache file is in
$LSB_SHAREDIR.- Close the
hostcachefile, and then start up the cluster.
lsfrestartAutomatically Detect Operating System Types and Versions
LSF can automatically detect most operating system types and versions so that you do not need to add them to the
lsf.sharedfile manually. The list of automatically detected operating systems is updated regularly.
- Edit
lsf.shared.- In the Resource section, remove the comment from the following line:
ostype String () () () (Operating system and version)- In
$LSF_SERVERDIR, renametmp.eslim.ostypetoeslim.ostype.- Run the following commands to restart the LIM and master batch daemon:
lsadmin reconfigbadmin mbdrestart- To view operating system types and versions, run
lshosts -lorlshosts -s.LSF displays the operating system types and versions in your cluster, including any that LSF automatically detects as well as those you have defined manually in the HostType section of
lsf.shared.You can specify ostype in your resource requirement strings. For example, when submitting a job you can specify the following resource requirement:
-R "select[ostype=RHEL2.6]".Modify how long LSF waits for new operating system types and versions
Prerequisites: You must enable LSF to automatically detect operating system types and versions.
You can configure how long LSF waits for OS type and version detection.
- In
lsf.conf, modify the value forEGO_ESLIM_TIMEOUT.The value is time in seconds.
Add Host Types and Host Models to lsf.shared
The
lsf.sharedfile contains a list of host type and host model names for most operating systems. You can add to this list or customize the host type and host model names. A host type and host model name can be any alphanumeric string up to 39 characters long.Add a custom host type or model
- Log on as the LSF administrator on any host in the cluster.
- Edit
lsf.shared:
- For a new host type, modify the
HostTypesection:Begin HostType TYPENAME # Keyword DEFAULT IBMAIX564 LINUX86 LINUX64 NTX64 NTIA64 SUNSOL SOL732 SOL64 SGI658 SOLX86 HPPA11 HPUXIA64 MACOSX End HostType- For a new host model, modify the
HostModelsection:Add the new model and its CPU speed factor relative to other models. For more details on tuning CPU factors, see Tuning CPU Factors.
Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE # keyword # x86 (Solaris, Windows, Linux): approximate values, based on SpecBench results # for Intel processors (Sparc/Win) and BogoMIPS results (Linux). PC75 1.5 (i86pc_75 i586_75 x586_30) PC90 1.7 (i86pc_90 i586_90 x586_34 x586_35 x586_36) HP9K715 4.2 (HP9000715_100) SunSparc 12.0 () CRAYJ90 18.0 () IBM350 18.0 () End HostModel- Save the changes to
lsf.shared.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.Registering Service Ports
LSF uses dedicated UDP and TCP ports for communication. All hosts in the cluster must use the same port numbers to communicate with each other.
The service port numbers can be any numbers ranging from 1024 to 65535 that are not already used by other services. To make sure that the port numbers you supply are not already used by applications registered in your service database check
/etc/servicesor use the commandypcat servicesBy default, port numbers for LSF services are defined in the
lsf.conffile. You can also configure ports by modifying/etc/servicesor the NIS or NIS+ database. If you define port numberslsf.conf, port numbers defined in the service database are ignored.lsf.conf
- Log on to any host as
root.- Edit
lsf.confand add the following lines:LSF_RES_PORT=3878 LSB_MBD_PORT=3881 LSB_SBD_PORT=3882- Add the same entries to
lsf.confon every host.- Save
lsf.conf.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin mbdrestartto restartmbatchd.- Run
lsfstartupto restart all daemons in the cluster./etc/services
Configure services manually
tip:During installation, use thehostsetup --boot="y"option to set up the LSF port numbers in the service database.
- Use the file
LSF_TOP/version/install/instlib/example.servicesfile as a guide for adding LSF entries to the services database.If any other service listed in your services database has the same port number as one of the LSF services, you must change the port number for the LSF service. You must use the same port numbers on every LSF host.
- Log on to any host as
root.- Edit the
/etc/servicesfile by adding the contents of theLSF_TOP/version/install/instlib/example.servicesfile:# /etc/services entries for LSF daemons # res 3878/tcp # remote execution server lim 3879/udp # load information manager mbatchd 3881/tcp # master lsbatch daemon sbatchd 3882/tcp # slave lsbatch daemon # # Add this if ident is not already defined # in your /etc/services file ident 113/tcp auth tap # identd- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.- Run
lsfstartupto restart all daemons in the cluster.NIS or NIS+ database
If you are running NIS, you only need to modify the services database once per NIS master. On some hosts the NIS database and commands are in the
/var/ypdirectory; on others, NIS is found in/etc/yp.
- Log on to any host as
root.- Run
lsfshutdownto shut down all the daemons in the cluster- To find the name of the NIS master host, use the command:
ypwhich -m services- Log on to the NIS master host as
root.- Edit the
/var/yp/src/servicesor/etc/yp/src/servicesfile on the NIS master host adding the contents of theLSF_TOP/version/install/instlib/example.servicesfile:# /etc/services entries for LSF daemons. # res 3878/tcp # remote execution server lim 3879/udp # load information manager mbatchd 3881/tcp # master lsbatch daemon sbatchd 3882/tcp # slave lsbatch daemon # # Add this if ident is not already defined # in your /etc/services file ident 113/tcp auth tap # identdMake sure that all the lines you add either contain valid service entries or begin with a comment character (
#). Blank lines are not allowed.- Change the directory to
/var/ypor/etc/yp.- Use the following command:
ypmake servicesOn some hosts the master copy of the services database is stored in a different location.
On systems running NIS+ the procedure is similar. Refer to your system documentation for more information.
- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.- Run
lsfstartupto restart all daemons in the cluster.Host Naming
LSF needs to match host names with the corresponding Internet host addresses.
LSF looks up host names and addresses the following ways:
- In the
/etc/hostsfile- Sun Network Information Service/Yellow Pages (NIS or YP)
- Internet Domain Name Service (DNS).
- DNS is also known as the Berkeley Internet Name Domain (BIND) or
named, which is the name of the BIND daemon.Each host is configured to use one or more of these mechanisms.
Network addresses
Each host has one or more network addresses; usually one for each network to which the host is directly connected. Each host can also have more than one name.
Official host name
The first name configured for each address is called the official name.
Host name aliases
Other names for the same host are called aliases.
LSF uses the configured host naming system on each host to look up the official host name for any alias or host address. This means that you can use aliases as input to LSF, but LSF always displays the official name.
Using host name ranges as aliases
The default host file syntax
ip_addressofficial_name[alias[alias...]]is powerful and flexible, but it is difficult to configure in systems where a single host name has many aliases, and in multihomed host environments.
In these cases, the
hostsfile can become very large and unmanageable, and configuration is prone to error.The syntax of the LSF
hostsfile supports host name ranges as aliases for an IP address. This simplifies the host name alias specification.To use host name ranges as aliases, the host names must consist of a fixed node group name prefix and node indices, specified in a form like:
host_name[index_x-index_y,index_m,index_a-index_b]For example:
atlasD0[0-3,4,5-6, ...]is equivalent to:
atlasD0[0-6, ...]The node list does not need to be a continuous range (some nodes can be configured out). Node indices can be numbers or letters (both upper case and lower case).
Example
Some systems map internal compute nodes to single LSF host names. A host file might contains 64 lines, each specifying an LSF host name and 32 node names that correspond to each LSF host:
... 177.16.1.1 atlasD0 atlas0 atlas1 atlas2 atlas3 atlas4 ... atlas31 177.16.1.2 atlasD1 atlas32 atlas33 atlas34 atlas35 atlas36 ... atlas63 ...In the new format, you still map the nodes to the LSF hosts, so the number of lines remains the same, but the format is simplified because you only have to specify ranges for the nodes, not each node individually as an alias:
... 177.16.1.1 atlasD0 atlas[0-31] 177.16.1.2 atlasD1 atlas[32-63] ...You can use either an IPv4 or an IPv6 format for the IP address (if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in
lsf.conf).Host name services
Solaris
On Solaris systems, the
/etc/nsswitch.conffile controls the name service.Other UNIX platforms
On other UNIX platforms, the following rules apply:
- If your host has an
/etc/resolv.conffile, your host is using DNS for name lookups- If the command
ypcat hostsprints out a list of host addresses and names, your system is looking up names in NIS- Otherwise, host names are looked up in the
/etc/hostsfileFor more information
The man pages for the
gethostbynamefunction, theypbindandnameddaemons, theresolverfunctions, and thehosts,svc.conf,nsswitch.conf, andresolv.conffiles explain host name lookups in more detail.Hosts with Multiple Addresses
Multi-homed hosts
Hosts that have more than one network interface usually have one Internet address for each interface. Such hosts are called
multi-homed hosts. For example, dual-stack hosts are multi-homed because they have both an IPv4 and an IPv6 network address.LSF identifies hosts by name, so it needs to match each of these addresses with a single host name. To do this, the host name information must be configured so that all of the Internet addresses for a host resolve to the same name.
There are two ways to do it:
- Modify the system hosts file (
/etc/hosts) and the changes will affect the whole system- Create an LSF hosts file (
LSF_CONFDIR/hosts) and LSF will be the only application that resolves the addresses to the same hostMultiple network interfaces
Some system manufacturers recommend that each network interface, and therefore, each Internet address, be assigned a different host name. Each interface can then be directly accessed by name. This setup is often used to make sure NFS requests go to the nearest network interface on the file server, rather than going through a router to some other interface. Configuring this way can confuse LSF, because there is no way to determine that the two different names (or addresses) mean the same host. LSF provides a workaround for this problem.
All host naming systems can be configured so that host address lookups always return the same name, while still allowing access to network interfaces by different names. Each host has an official name and a number of aliases, which are other names for the same host. By configuring all interfaces with the same official name but different aliases, you can refer to each interface by a different alias name while still providing a single official name for the host.
Configuring the LSF hosts file
If your LSF clusters include hosts that have more than one interface and are configured with more than one official host name, you must either modify the host name configuration, or create a private
hostsfile for LSF to use.The LSF
hostsfile is stored in LSF_CONFDIR. The format ofLSF_CONFDIR/hostsis the same as for/etc/hosts.In the LSF
hostsfile, duplicate the systemhostsdatabase information, except make all entries for the host use the same official name. Configure all the other names for the host as aliases so that you can still refer to the host by any name.Example
For example, if your
/etc/hostsfile contains:AA.AA.AA.AA host-AA host # first interface BB.BB.BB.BB host-BB # second interfacethen the
LSF_CONFDIR/hostsfile should contain:AA.AA.AA.AA host host-AA # first interface BB.BB.BB.BB host host-BB # second interfaceExample /etc/hosts entries
No unique official name
The following example is for a host with two interfaces, where the host does not have a unique official name.
# Address Official name Aliases # Interface on network A AA.AA.AA.AA host-AA.domain host.domain host-AA host # Interface on network B BB.BB.BB.BB host-BB.domain host-BB hostLooking up the address
AA.AA.AA.AAfinds the official namehost-AA.domain. Looking up addressBB.BB.BB.BBfinds the namehost-BB.domain. No information connects the two names, so there is no way for LSF to determine that both names, and both addresses, refer to the same host.To resolve this case, you must configure these addresses using a unique host name. If you cannot make this change to the system file, you must create an LSF hosts file and configure these addresses using a unique host name in that file.
Both addresses have the same official name
Here is the same example, with both addresses configured for the same official name.
# Address Official name Aliases # Interface on network A AA.AA.AA.AA host.domain host-AA.domain host-AA host # Interface on network B BB.BB.BB.BB host.domain host-BB.domain host-BB hostWith this configuration, looking up either address returns
host.domainas the official name for the host. LSF (and all other applications) can determine that all the addresses and host names refer to the same host. Individual interfaces can still be specified by using thehost-AAandhost-BBaliases.Example for a dual-stack host
Dual-stack hosts have more than one IP address. You must associate the host name with both addresses, as shown in the following example:
# Address Official name Aliases # Interface IPv4 AA.AA.AA.AA host.domain host-AA.domain # Interface IPv6 BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB host.domain host-BB.domainWith this configuration, looking up either address returns
host.domainas the official name for the host. LSF (and all other applications) can determine that all the addresses and host names refer to the same host. Individual interfaces can still be specified by using thehost-AAandhost-BBaliases.Sun Solaris example
For example, Sun NIS uses the
/etc/hostsfile on the NIS master host as input, so the format for NIS entries is the same as for the/etc/hostsfile. Since LSF can resolve this case, you do not need to create an LSF hosts file.DNS configuration
The configuration format is different for DNS. The same result can be produced by configuring two address (A) records for each Internet address. Following the previous example:
# name class type address host.domain IN A AA.AA.AA.AA host.domain IN A BB.BB.BB.BB host-AA.domain IN A AA.AA.AA.AA host-BB.domain IN A BB.BB.BB.BBLooking up the official host name can return either address. Looking up the interface-specific names returns the correct address for each interface.
For a dual-stack host:
# name class type address host.domain IN A AA.AA.AA.AA host.domain IN A BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB host-AA.domain IN A AA.AA.AA.AA host-BB.domain IN A BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBBPTR records in DNS
Address-to-name lookups in DNS are handled using PTR records. The PTR records for both addresses should be configured to return the official name:
# address class type name AA.AA.AA.AA.in-addr.arpa IN PTR host.domain BB.BB.BB.BB.in-addr.arpa IN PTR host.domainFor a dual-stack host:
# address class type name AA.AA.AA.AA.in-addr.arpa IN PTR host.domain BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB.in-addr.arpa IN PTR host.domainIf it is not possible to change the system host name database, create the
hostsfile local to the LSF system, and configure entries for the multi-homed hosts only. Host names and addresses not found in thehostsfile are looked up in the standard name system on your host.Using IPv6 Addresses
IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. You can use IPv6 addresses if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in
lsf.conf; you do not have to map IPv4 addresses to an IPv6 format.LSF supports IPv6 addresses for the following platforms:
- Linux 2.4
- Linux 2.6
- Solaris 10
- Windows
- XP
- 2003
- 2000 with Service Pack 1 or higher
- AIX 5
- HP-UX
- 11i
- 11iv1
- 11iv2
- 11.11
- SGI Altix ProPack 3, 4, and 5
- IRIX 6.5.19 and higher, Trusted IRIX 6.5.19 and higher
- Mac OS 10.2 and higher
- Cray XT3
- IBM Power 5 Series
Enable both IPv4 and IPv6 support
- Configure the parameter
LSF_ENABLE_SUPPORT_IPV6=Yinlsf.conf.Configure hosts for IPv6
Follow the steps in this procedure if you do not have an IPv6-enabled DNS server or an IPv6-enabled router. IPv6 is supported on some linux2.4 kernels and on all linux2.6 kernels.
- Configure the kernel.
- Does the entry
/proc/net/if_inet6exist?- To load the IPv6 module into the kernel, execute the following command as root:
modprobe ipv6- To check that the module loaded correctly, execute the command
lsmod | grep -w 'ipv6'- Add an IPv6 address to the host by executing the following command as root:
/sbin/ifconfig eth0 inet6 add 3ffe:ffff:0:f101::2/64- Display the IPv6 address using
ifconfig.- Repeat step 1 through step 3 for other hosts in the cluster.
- To configure IPv6 networking, add the addresses for all IPv6 hosts to
/etc/hostson each host.
note:For IPv6 networking, hosts must be on the same subnet.- Test IPv6 communication between hosts using the command
ping6.Specify host names with condensed notation
A number of commands often require you to specify host names. You can now specify host name ranges instead. You can use condensed notation with the following commands:
bacctbhistbjobsbmigbmodbpeekbrestartbrsvaddbrsvmodbrsvsbrunbsubbswitchYou must specify a valid range of hosts, where the start number is smaller than the end number.
- Run the command you want and specify the host names as a range.
For example:
bsub -m "host[1-100].corp.com"The job is submitted to
host1.corp.com,host2.corp.com,host3.corp.com, all the way tohost100.corp.com.- Run the command you want and specify host names as a combination of ranges and individuals.
For example:
bsub -m "host[1-10,12,20-25].corp.com"The job is submitted to
host.1.corp.com,host2.corp.com,host3.corp.com, up to and includinghost10.corp.com. It is also submitted tohost12.corp.comand the hosts between and includinghost20.corp.comandhost25.corp.com.Host Groups
You can define a host group within LSF or use an external executable to retrieve host group members.
Use
bhoststo view a list of existing hosts. Usebmgroupto view host group membership.Where to use host groups
LSF host groups can be used in defining the following parameters in LSF configuration files:
- HOSTS in
lsb.queuesfor authorized hosts for the queue- HOSTS in
lsb.hostsin theHostPartitionsection to list host groups that are members of the host partitionConfigure host groups
- Log in as the LSF administrator to any host in the cluster.
- Open
lsb.hosts.- Add the
HostGroupsection if it does not exist.Begin HostGroup GROUP_NAME GROUP_MEMBER groupA (all) groupB (groupA ~hostA ~hostB) groupC (hostX hostY hostZ) groupD (groupC ~hostX) groupE (all ~groupC ~hostB) groupF (hostF groupC hostK) desk_tops (hostD hostE hostF hostG) Big_servers (!) End HostGroup- Enter a group name under the GROUP_NAME column.
External host groups must be defined in the
egroupexecutable.- Specify hosts in the GROUP_MEMBER column.
(Optional) To tell LSF that the group members should be retrieved using
egroup, put an exclamation mark (!) in the GROUP_MEMBER column.- Save your changes.
- Run
badmin ckconfigto check the group definition. If any errors are reported, fix the problem and check the configuration again.- Run
badmin mbdrestartto apply the new configuration.Using wildcards and special characters to define host names
You can use special characters when defining host group members under the GROUP_MEMBER column to specify hosts. These are useful to define several hosts in a single entry, such as for a range of hosts, or for all host names with a certain text string.
If a host matches more than one host group, that host is a member of all groups. If any host group is a condensed host group, the status and other details of the hosts are counted towards all of the matching host groups.
When defining host group members, you can use string literals and the following special characters:
- Use a tilde (
~) to exclude specified hosts or host groups from the list. The tilde can be used in conjunction with the other special characters listed below. The following example matches all hosts in the cluster except forhostA,hostB, and all members of thegroupAhost group:... (all ~hostA ~hostB ~groupA)Use an asterisk ( *) as a wildcard character to represent any number of characters. The following example matches all hosts beginning with the text string "hostC" (such ashostCa,hostC1, orhostCZ1):... (hostC*)Use square brackets with a hyphen ( [integer1-integer2]) to define a range of non-negative integers at the end of a host name. The first integer must be less than the second integer. The following example matches all hosts fromhostD51tohostD100:... (hostD[51-100])Use square brackets with commas ( [integer1,integer2...]) to define individual non-negative integers at the end of a host name. The following example matcheshostD101,hostD123, andhostD321:... (hostD[101,123,321])Use square brackets with commas and hyphens (such as [integer1-integer2,integer3,integer4-integer5]) to define different ranges of non-negative integers at the end of a host name. The following example matches all hosts fromhostD1tohostD100,hostD102, all hosts fromhostD201tohostD300, andhostD320):... (hostD[1-100,102,201-300,320])Restrictions
You cannot use more than one set of square brackets in a single host group definition.
The following example is
notcorrect:... (hostA[1-10]B[1-20] hostC[101-120])The following example is correct:
... (hostA[1-20] hostC[101-120])You cannot define subgroups that contain wildcards and special characters. The following definition for
groupBis not correct becausegroupAdefines hosts with a wildcard:Begin HostGroup GROUP_NAME GROUP_MEMBER groupA (hostA*) groupB (groupA) End HostGroupDefining condensed host groups
You can define condensed host groups to display information for its hosts as a summary for the entire group. This is useful because it allows you to see the total statistics of the host group as a whole instead of having to add up the data yourself. This allows you to better plan the distribution of jobs submitted to the hosts and host groups in your cluster.
To define condensed host groups, add a CONDENSE column to the
HostGroupsection. Under this column, enterYto define a condensed host group orNto define an uncondensed host group, as shown in the following:Begin HostGroup GROUP_NAME CONDENSE GROUP_MEMBER groupA Y (hostA hostB hostD) groupB N (hostC hostE) End HostGroupThe following commands display condensed host group information:
bhostsbhosts -wbjobsbjobs -wFor the
bhostsoutput of this configuration, see Viewing Host Information.Use
bmgroup -lto see whether host groups are condensed or not.Hosts belonging to multiple condensed host groups
If you configure a host to belong to more than one condensed host group using wildcards,
bjobscan display any of the host groups as execution host name.For example, host groups
hg1andhg2include the same hosts:Begin HostGroup GROUP_NAME CONDENSE GROUP_MEMBER # Key words hg1 Y (host*) hg2 Y (hos*) End HostGroupSubmit jobs using
bsub -m:bsub -m "hg2" sleep 1001
bjobsdisplayshg1as the execution host instead ofhg2:bjobsJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 520 user1 RUN normal host5 hg1 sleep 1001 Apr 15 13:50 521 user1 RUN normal host5 hg1 sleep 1001 Apr 15 13:50 522 user1 PEND normal host5 sleep 1001 Apr 15 13:51Importing external host groups (egroup)
When the membership of a host group changes frequently, or when the group contains a large number of members, you can use an external executable called
egroupto retrieve a list of members rather than having to configure the group membership manually. You can write a site-specificegroupexecutable that retrieves host group names and the hosts that belong to each group. For information about how to use the external host and user groups feature, see thePlatform LSF Configuration Reference.Compute Units
Compute units are similar to host groups, with the added feature of granularity allowing the construction of clusterwide structures that mimic network architecture. Job scheduling using compute unit resource requirements optimizes job placement based on the underlying system architecture, minimizing communications bottlenecks. Compute units are especially useful when running communication-intensive parallel jobs spanning several hosts.
Resource requirement strings can specify compute units requirements such as running a job exclusively (
excl), spreading a job evenly over multiple compute units (balance), or choosing compute units based on other criteria.For a complete description of compute units see Controlling Job Locality using Compute Units in Chapter 34, "Running Parallel Jobs".
Compute unit configuration
To enforce consistency, compute unit configuration has the following requirements:
- Hosts and host groups appear in the finest granularity compute unit type, and nowhere else.
- Hosts appear in the membership list of at most one compute unit of the finest granularity.
- All compute units of the same type have the same type of compute units (or hosts) as members.
tip:Configure each individual host as a compute unit to use the compute unit features for host level job allocation.Where to use compute units
LSF compute units can be used in defining the following parameters in LSF configuration files:
EXCLUSIVEinlsb.queuesfor the compute unit type allowed for the queue.HOSTSinlsb.queuesfor the hosts on which jobs from this queue can be run.RES_REQinlsb.queuesfor queue compute unit resource requirements.RES_REQinlsb.applicationsfor application profile compute unit resource requirements.Configure compute units
- Log in as the LSF administrator to any host in the cluster.
- Open
lsb.params.- Add the
COMPUTE_UNIT_TYPESparameter if it does not already exist and list your compute unit types in order of granularity (finest first).COMPUTE_UNIT_TYPES=enclosure rack cabinet
- Save your changes.
- Open
lsb.hosts.- Add the
ComputeUnitsection if it does not exist.Begin ComputeUnit NAME MEMBER TYPE encl1 (hostA hg1) enclosure encl2 (hostC hostD) enclosure encl3 (hostE hostF) enclosure encl4 (hostG hg2) enclosure rack1 (encl1 encl2) rack rack2 (encl3 encl4) rack cab1 (rack1 rack2) cabinet End ComputeUnit- Enter a compute unit name under the NAME column.
External compute units must be defined in the
egroupexecutable.- Specify hosts or host groups in the MEMBER column of the finest granularity compute unit type. Specify compute units in the MEMBER column of coarser compute unit types.
(Optional) To tell LSF that the compute unit members of a finest granularity compute unit should be retrieved using
egroup, put an exclamation mark (!) in the MEMBER column.- Specify the type of compute unit in the TYPE column.
- Save your changes.
- Run
badmin ckconfigto check the compute unit definition. If any errors are reported, fix the problem and check the configuration again.- Run
badmin mbdrestartto apply the new configuration.To view configured compute units, run
bmgroup -cu.Using wildcards and special characters to define names in compute units
You can use special characters when defining compute unit members under the MEMBER column to specify hosts, host groups, and compute units. These are useful to define several names in a single entry such as a range of hosts, or for all names with a certain text string.
When defining host, host group, and compute unit members of compute units, you can use string literals and the following special characters:
- Use a tilde (
~) to exclude specified hosts, host groups, or compute units from the list. The tilde can be used in conjunction with the other special characters listed below. The following example matches all hosts ingroup12except forhostA, andhostB:... (group12 ~hostA ~hostB)Use an asterisk ( *) as a wildcard character to represent any number of characters. The following example matches all hosts beginning with the text string "hostC" (such ashostCa,hostC1, orhostCZ1):... (hostC*)Use square brackets with a hyphen ( [integer1-integer2]) to define a range of non-negative integers at the end of a name. The first integer must be less than the second integer. The following example matches all hosts fromhostD51tohostD100:... (hostD[51-100])Use square brackets with commas ( [integer1,integer2...]) to define individual non-negative integers at the end of a name. The following example matcheshostD101,hostD123, andhostD321:... (hostD[101,123,321])Use square brackets with commas and hyphens (such as [integer1-integer2,integer3,integer4-integer5]) to define different ranges of non-negative integers at the end of a name. The following example matches all hosts fromhostD1tohostD100,hostD102, all hosts fromhostD201tohostD300, andhostD320):... (hostD[1-100,102,201-300,320])Restrictions
You cannot use more than one set of square brackets in a single compute unit definition.
The following example is
notcorrect:... (hostA[1-10]B[1-20] hostC[101-120])The following example is correct:
... (hostA[1-20] hostC[101-120])The keywords
all,allremote,all@cluster,otheranddefaultcannot be used when defining compute units.Defining condensed compute units
You can define condensed compute units to display information for its hosts as a summary for the entire group, including the slot usage for each compute unit. This is useful because it allows you to see statistics of the compute unit as a whole instead of having to add up the data yourself. This allows you to better plan the distribution of jobs submitted to the hosts and compute units in your cluster.
To define condensed compute units, add a CONDENSE column to the
ComputeUnitsection. Under this column, enterYto define a condensed host group orNto define an uncondensed host group, as shown in the following:Begin ComputeUnit NAME CONDENSE MEMBER TYPE enclA Y (hostA hostB hostD) enclosure enclB N (hostC hostE) enclosure End HostGroupThe following commands display condensed host information:
bhostsbhosts -wbjobsbjobs -wFor the
bhostsoutput of this configuration, see Viewing Host Information.Use
bmgroup -lto see whether host groups are condensed or not.Importing external host groups (egroup)
When the membership of a compute unit changes frequently, or when the compute unit contains a large number of members, you can use an external executable called
egroupto retrieve a list of members rather than having to configure the membership manually. You can write a site-specificegroupexecutable that retrieves compute unit names and the hosts that belong to each group, and compute units of the finest granularity can containegroups as members. For information about how to use the external host and user groups feature, see thePlatform LSF Configuration Reference.Using compute units with advance reservation
When running exclusive compute unit jobs (with the resource requirement
cu[excl]), the advance reservation can affect hosts outside the advance reservation but in the same compute unit as follows:
- An exclusive compute unit job dispatched to a host inside the advance reservation will lock the entire compute unit, including any hosts outside the advance reservation.
- An exclusive compute unit job dispatched to a host outside the advance reservation will lock the entire compute unit, including any hosts inside the advance reservation.
Ideally all hosts belonging to a compute unit should be inside or outside of an advance reservation.
Tuning CPU Factors
CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that response time is minimized.
To achieve this, it is important that you define correct CPU factors for each machine model in your cluster.
How CPU factors affect performance
Incorrect CPU factors can reduce performance the following ways.
- If the CPU factor for a host is too low, that host may not be selected for job placement when a slower host is available. This means that jobs would not always run on the fastest available host.
- If the CPU factor is too high, jobs are run on the fast host even when they would finish sooner on a slower but lightly loaded host. This causes the faster host to be overused while the slower hosts are underused.
Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. LSF then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.
Guidelines for setting CPU factors
CPU factors should be set based on a benchmark that reflects your workload. If there is no such benchmark, CPU factors can be set based on raw CPU power.
The CPU factor of the slowest hosts should be set to 1, and faster hosts should be proportional to the slowest.
Example
Consider a cluster with two hosts:
hostAandhostB. In this cluster,hostAtakes 30 seconds to run a benchmark andhostBtakes 15 seconds to run the same test. The CPU factor forhostAshould be 1, and the CPU factor ofhostBshould be 2 because it is twice as fast ashostA.View normalized ratings
Run lsload -Nto display normalized ratings.LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. Hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.
Tune CPU factors
- Log in as the LSF administrator on any host in the cluster.
- Edit
lsf.shared, and change theHostModelsection:Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE # keyword #HPUX (HPPA) HP9K712S 2.5 (HP9000712_60) HP9K712M 2.5 (HP9000712_80) HP9K712F 4.0 (HP9000712_100)See the
Platform LSF Configuration Referencefor information about thelsf.sharedfile.- Save the changes to
lsf.shared.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.Handling Host-level Job Exceptions
You can configure hosts so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.
Host exceptions LSF can detect
If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such "black hole", or "job-eating" hosts exit abnormally. LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in
lsb.hosts).If EXIT_RATE is specified for the host, LSF invokes
eadminif the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION inlsb.paramsto change how frequently LSF checks the job exit rate.Use GLOBAL_EXIT_RATE in
lsb.paramsto set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host inlsb.hosts, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.Configuring host exception handling (lsb.hosts)
EXIT_RATE
Specify a threshold for exited jobs. If the job exit rate is exceeded for 5 minutes or the period specified by JOB_EXIT_RATE_DURATION in
lsb.params, LSF invokeseadminto trigger a host exception.Example
The following Host section defines a job exit rate of 20 jobs for all hosts, and an exit rate of 10 jobs on
hostA.Begin Host HOST_NAME MXJ EXIT_RATE # Keywords Default ! 20 hostA ! 10 End HostConfiguring thresholds for host exception handling
By default, LSF checks the number of exited jobs every 5 minutes. Use JOB_EXIT_RATE_DURATION in
lsb.paramsto change this default.Tuning
tip:Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.Example
In the following diagram, the job exit rate of
hostAexceeds the configured threshold (EXIT_RATE for hostA inlsb.hosts) LSF monitorshostAfrom time t1 to time t2 (t2=t1 + JOB_EXIT_RATE_DURATION inlsb.params). At t2, the exit rate is still high, and a host exception is detected. At t3 (EADMIN_TRIGGER_DURATION inlsb.params), LSF invokeseadminand the host exception is handled. By default, LSF closeshostAand sends email to the LSF administrator. SincehostAis closed and cannot accept any new jobs, the exit rate drops quickly.
![]()
|
Platform Computing Inc.
www.platform.com |
| Knowledge Center Contents Previous Next Index |