freebsd_amp_hwpstate/contrib/ofed/management/opensm/man/opensm.8.in

1013 lines
39 KiB
Groff

.TH OPENSM 8 "June 13, 2008" "OpenIB" "OpenIB Management"
.SH NAME
opensm \- InfiniBand subnet manager and administration (SM/SA)
.SH SYNOPSIS
.B opensm
[\-\-version]]
[\-F | \-\-config <file_name>]
[\-c(reate-config) <file_name>]
[\-g(uid) <GUID in hex>]
[\-l(mc) <LMC>]
[\-p(riority) <PRIORITY>]
[\-smkey <SM_Key>]
[\-r(eassign_lids)]
[\-R <engine name(s)> | \-\-routing_engine <engine name(s)>]
[\-A | \-\-ucast_cache]
[\-z | \-\-connect_roots]
[\-M <file name> | \-\-lid_matrix_file <file name>]
[\-U <file name> | \-\-lfts_file <file name>]
[\-S | \-\-sadb_file <file name>]
[\-a | \-\-root_guid_file <path to file>]
[\-u | \-\-cn_guid_file <path to file>]
[\-X | \-\-guid_routing_order_file <path to file>]
[\-m | \-\-ids_guid_file <path to file>]
[\-o(nce)]
[\-s(weep) <interval>]
[\-t(imeout) <milliseconds>]
[\-maxsmps <number>]
[\-console [off | local | socket | loopback]]
[\-console-port <port>]
[\-i(gnore-guids) <equalize-ignore-guids-file>]
[\-f <log file path> | \-\-log_file <log file path> ]
[\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)]
[\-P(config) <partition config file> ]
[\-N | \-\-no_part_enforce]
[\-Q | \-\-qos [\-Y | \-\-qos_policy_file <file name>]]
[\-y | \-\-stay_on_fatal]
[\-B | \-\-daemon]
[\-I | \-\-inactive]
[\-\-perfmgr]
[\-\-perfmgr_sweep_time_s <seconds>]
[\-\-prefix_routes_file <path>]
[\-\-consolidate_ipv6_snm_req]
[\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>]
[\-h(elp)] [\-?]
.SH DESCRIPTION
.PP
opensm is an InfiniBand compliant Subnet Manager and Administration,
and runs on top of OpenIB.
opensm provides an implementation of an InfiniBand Subnet Manager and
Administration. Such a software entity is required to run for in order
to initialize the InfiniBand hardware (at least one per each
InfiniBand subnet).
opensm also now contains an experimental version of a performance
manager as well.
opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
fabric, initialize it, and sweep occasionally for changes.
opensm attaches to a specific IB port on the local machine and configures only
the fabric connected to it. (If the local machine has other IB ports,
opensm will ignore the fabrics connected to those other ports). If no port is
specified, it will select the first "best" available port.
opensm can present the available ports and prompt for a port number to
attach to.
By default, the run is logged to two files: /var/log/messages and /var/log/opensm.log.
The first file will register only general major events, whereas the second
will include details of reported errors. All errors reported in this second
file should be treated as indicators of IB fabric health issues.
(Note that when a fatal and non-recoverable error occurs, opensm will exit.)
Both log files should include the message "SUBNET UP" if opensm was able to
setup the subnet correctly.
.SH OPTIONS
.PP
.TP
\fB\-\-version\fR
Prints OpenSM version and exits.
.TP
\fB\-F\fR, \fB\-\-config\fR <config file>
The name of the OpenSM config file. When not specified
\fB\% @OPENSM_CONFIG_DIR@/@OPENSM_CONFIG_FILE@\fP will be used (if exists).
.TP
\fB\-c\fR, \fB\-\-create-config\fR <file name>
OpenSM will dump its configuration to the specified file and exit.
This is a way to generate OpenSM configuration file template.
.TP
\fB\-g\fR, \fB\-\-guid\fR <GUID in hex>
This option specifies the local port GUID value
with which OpenSM should bind. OpenSM may be
bound to 1 port at a time.
If GUID given is 0, OpenSM displays a list
of possible port GUIDs and waits for user input.
Without -g, OpenSM tries to use the default port.
.TP
\fB\-l\fR, \fB\-\-lmc\fR <LMC value>
This option specifies the subnet's LMC value.
The number of LIDs assigned to each port is 2^LMC.
The LMC value must be in the range 0-7.
LMC values > 0 allow multiple paths between ports.
LMC values > 0 should only be used if the subnet
topology actually provides multiple paths between
ports, i.e. multiple interconnects between switches.
Without -l, OpenSM defaults to LMC = 0, which allows
one path between any two ports.
.TP
\fB\-p\fR, \fB\-\-priority\fR <Priority value>
This option specifies the SM\'s PRIORITY.
This will effect the handover cases, where master
is chosen by priority and GUID. Range goes from 0
(default and lowest priority) to 15 (highest).
.TP
\fB\-smkey\fR <SM_Key value>
This option specifies the SM\'s SM_Key (64 bits).
This will effect SM authentication.
Note that OpenSM version 3.2.1 and below used the default value '1'
in a host byte order, it is fixed now but you may need this option to
interoperate with old OpenSM running on a little endian machine.
.TP
\fB\-r\fR, \fB\-\-reassign_lids\fR
This option causes OpenSM to reassign LIDs to all
end nodes. Specifying -r on a running subnet
may disrupt subnet traffic.
Without -r, OpenSM attempts to preserve existing
LID assignments resolving multiple use of same LID.
.TP
\fB\-R\fR, \fB\-\-routing_engine\fR <Routing engine names>
This option chooses routing engine(s) to use instead of Min Hop
algorithm (default). Multiple routing engines can be specified
separated by commas so that specific ordering of routing algorithms
will be tried if earlier routing engines fail.
Supported engines: minhop, updn, file, ftree, lash, dor
.TP
\fB\-A\fR, \fB\-\-ucast_cache\fR
This option enables unicast routing cache and prevents routing
recalculation (which is a heavy task in a large cluster) when
there was no topology change detected during the heavy sweep, or
when the topology change does not require new routing calculation,
e.g. when one or more CAs/RTRs/leaf switches going down, or one or
more of these nodes coming back after being down.
A very common case that is handled by the unicast routing cache
is host reboot, which otherwise would cause two full routing
recalculations: one when the host goes down, and the other when
the host comes back online.
.TP
\fB\-z\fR, \fB\-\-connect_roots\fR
This option enforces a routing engine (currently up/down
only) to make connectivity between root switches and in
this way to be fully IBA complaint. In many cases this can
violate "pure" deadlock free algorithm, so use it carefully.
.TP
\fB\-M\fR, \fB\-\-lid_matrix_file\fR <file name>
This option specifies the name of the lid matrix dump file
from where switch lid matrices (min hops tables will be
loaded.
.TP
\fB\-U\fR, \fB\-\-lfts_file\fR <file name>
This option specifies the name of the LFTs file
from where switch forwarding tables will be loaded.
.TP
\fB\-S\fR, \fB\-\-sadb_file\fR <file name>
This option specifies the name of the SA DB dump file
from where SA database will be loaded.
.TP
\fB\-a\fR, \fB\-\-root_guid_file\fR <file name>
Set the root nodes for the Up/Down or Fat-Tree routing
algorithm to the guids provided in the given file (one to a line).
.TP
\fB\-u\fR, \fB\-\-cn_guid_file\fR <file name>
Set the compute nodes for the Fat-Tree routing algorithm
to the guids provided in the given file (one to a line).
.TP
\fB\-m\fR, \fB\-\-ids_guid_file\fR <file name>
Name of the map file with set of the IDs which will be used
by Up/Down routing algorithm instead of node GUIDs
(format: <guid> <id> per line).
.TP
\fB\-X\fR, \fB\-\-guid_routing_order_file\fR <file name>
Set the order port guids will be routed for the MinHop
and Up/Down routing algorithms to the guids provided in the
given file (one to a line).
.TP
\fB\-o\fR, \fB\-\-once\fR
This option causes OpenSM to configure the subnet
once, then exit. Ports remain in the ACTIVE state.
.TP
\fB\-s\fR, \fB\-\-sweep\fR <interval value>
This option specifies the number of seconds between
subnet sweeps. Specifying -s 0 disables sweeping.
Without -s, OpenSM defaults to a sweep interval of
10 seconds.
.TP
\fB\-t\fR, \fB\-\-timeout\fR <value>
This option specifies the time in milliseconds
used for transaction timeouts.
Specifying -t 0 disables timeouts.
Without -t, OpenSM defaults to a timeout value of
200 milliseconds.
.TP
\fB\-maxsmps\fR <number>
This option specifies the number of VL15 SMP MADs
allowed on the wire at any one time.
Specifying -maxsmps 0 allows unlimited outstanding
SMPs.
Without -maxsmps, OpenSM defaults to a maximum of
4 outstanding SMPs.
.TP
\fB\-console [off | local | socket | loopback]\fR
This option brings up the OpenSM console (default off).
Note that the socket and loopback options will only be available
if OpenSM was built with --enable-console-socket.
.TP
\fB\-console-port\fR <port>
Specify an alternate telnet port for the socket console (default 10000).
Note that this option only appears if OpenSM was built with
--enable-console-socket.
.TP
\fB\-i\fR, \fB\-ignore-guids\fR <equalize-ignore-guids-file>
This option provides the means to define a set of ports
(by node guid and port number) that will be ignored by the link load
equalization algorithm.
.TP
\fB\-x\fR, \fB\-\-honor_guid2lid\fR
This option forces OpenSM to honor the guid2lid file,
when it comes out of Standby state, if such file exists
under OSM_CACHE_DIR, and is valid.
By default, this is FALSE.
.TP
\fB\-f\fR, \fB\-\-log_file\fR <file name>
This option defines the log to be the given file.
By default, the log goes to /var/log/opensm.log.
For the log to go to standard output use -f stdout.
.TP
\fB\-L\fR, \fB\-\-log_limit\fR <size in MB>
This option defines maximal log file size in MB. When
specified the log file will be truncated upon reaching
this limit.
.TP
\fB\-e\fR, \fB\-\-erase_log_file\fR
This option will cause deletion of the log file
(if it previously exists). By default, the log file
is accumulative.
.TP
\fB\-P\fR, \fB\-\-Pconfig\fR <partition config file>
This option defines the optional partition configuration file.
The default name is \fB\%@OPENSM_CONFIG_DIR@/@PARTITION_CONFIG_FILE@\fP.
.TP
\fB\-\-prefix_routes_file\fR <file name>
Prefix routes control how the SA responds to path record queries for
off-subnet DGIDs. By default, the SA fails such queries. The
.B PREFIX ROUTES
section below describes the format of the configuration file.
The default path is \fB\%@OPENSM_CONFIG_DIR@/prefix\-routes.conf\fP.
.TP
\fB\-Q\fR, \fB\-\-qos\fR
This option enables QoS setup. It is disabled by default.
.TP
\fB\-Y\fR, \fB\-\-qos_policy_file\fR <file name>
This option defines the optional QoS policy file. The default
name is \fB\%@OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@\fP.
.TP
\fB\-N\fR, \fB\-\-no_part_enforce\fR
This option disables partition enforcement on switch external ports.
.TP
\fB\-y\fR, \fB\-\-stay_on_fatal\fR
This option will cause SM not to exit on fatal initialization
issues: if SM discovers duplicated guids or a 12x link with
lane reversal badly configured.
By default, the SM will exit on these errors.
.TP
\fB\-B\fR, \fB\-\-daemon\fR
Run in daemon mode - OpenSM will run in the background.
.TP
\fB\-I\fR, \fB\-\-inactive\fR
Start SM in inactive rather than init SM state. This
option can be used in conjunction with the perfmgr so as to
run a standalone performance manager without SM/SA. However,
this is NOT currently implemented in the performance manager.
.TP
\fB\-perfmgr\fR
Enable the perfmgr. Only takes effect if --enable-perfmgr was specified at
configure time.
.TP
\fB\-perfmgr_sweep_time_s\fR <seconds>
Specify the sweep time for the performance manager in seconds
(default is 180 seconds). Only takes
effect if --enable-perfmgr was specified at configure time.
.TP
.BI --consolidate_ipv6_snm_req
Consolidate IPv6 Solicited Node Multicast group join requests into one
multicast group per MGID PKey.
.TP
\fB\-v\fR, \fB\-\-verbose\fR
This option increases the log verbosity level.
The -v option may be specified multiple times
to further increase the verbosity level.
See the -D option for more information about
log verbosity.
.TP
\fB\-V\fR
This option sets the maximum verbosity level and
forces log flushing.
The -V option is equivalent to \'-D 0xFF -d 2\'.
See the -D option for more information about
log verbosity.
.TP
\fB\-D\fR <value>
This option sets the log verbosity level.
A flags field must follow the -D option.
A bit set/clear in the flags enables/disables a
specific log level as follows:
BIT LOG LEVEL ENABLED
---- -----------------
0x01 - ERROR (error messages)
0x02 - INFO (basic messages, low volume)
0x04 - VERBOSE (interesting stuff, moderate volume)
0x08 - DEBUG (diagnostic, high volume)
0x10 - FUNCS (function entry/exit, very high volume)
0x20 - FRAMES (dumps all SMP and GMP frames)
0x40 - ROUTING (dump FDB routing information)
0x80 - currently unused.
Without -D, OpenSM defaults to ERROR + INFO (0x3).
Specifying -D 0 disables all messages.
Specifying -D 0xFF enables all messages (see -V).
High verbosity levels may require increasing
the transaction timeout with the -t option.
.TP
\fB\-d\fR, \fB\-\-debug\fR <value>
This option specifies a debug option.
These options are not normally needed.
The number following -d selects the debug
option to enable as follows:
OPT Description
--- -----------------
-d0 - Ignore other SM nodes
-d1 - Force single threaded dispatching
-d2 - Force log flushing after each log message
-d3 - Disable multicast support
.TP
\fB\-h\fR, \fB\-\-help\fR
Display this usage info then exit.
.TP
\fB\-?\fR
Display this usage info then exit.
.SH ENVIRONMENT VARIABLES
.PP
The following environment variables control opensm behavior:
OSM_TMP_DIR - controls the directory in which the temporary files generated by
opensm are created. These files are: opensm-subnet.lst, opensm.fdbs, and
opensm.mcfdbs. By default, this directory is /var/log.
OSM_CACHE_DIR - opensm stores certain data to the disk such that subsequent
runs are consistent. The default directory used is /var/cache/opensm.
The following file is included in it:
guid2lid - stores the LID range assigned to each GUID
.SH NOTES
.PP
When opensm receives a HUP signal, it starts a new heavy sweep as if a trap was received or a topology change was found.
.PP
Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for
logrotate purposes.
.SH PARTITION CONFIGURATION
.PP
The default name of OpenSM partitions configuration file is
\fB\%@OPENSM_CONFIG_DIR@/@PARTITION_CONFIG_FILE@\fP. The default may be changed by using
--Pconfig (-P) option with OpenSM.
The default partition will be created by OpenSM unconditionally even
when partition configuration file does not exist or cannot be accessed.
The default partition has P_Key value 0x7fff. OpenSM\'s port will have
full membership in default partition. All other end ports will have
partial membership.
File Format
Comments:
Line content followed after \'#\' character is comment and ignored by
parser.
General file format:
<Partition Definition>:<PortGUIDs list> ;
Partition Definition:
[PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
PartitionName - string, will be used with logging. When omitted
empty string will be used.
PKey - P_Key value for this partition. Only low 15 bits will
be used. When omitted will be autogenerated.
flag - used to indicate IPoIB capability of this partition.
defmember=full|limited - specifies default membership for port guid
list. Default is limited.
Currently recognized flags are:
ipoib - indicates that this partition may be used for IPoIB, as
result IPoIB capable MC group will be created.
rate=<val> - specifies rate for this IPoIB MC group
(default is 3 (10GBps))
mtu=<val> - specifies MTU for this IPoIB MC group
(default is 4 (2048))
sl=<val> - specifies SL for this IPoIB MC group
(default is 0)
scope=<val> - specifies scope for this IPoIB MC group
(default is 2 (link local)). Multiple scope settings
are permitted for a partition.
Note that values for rate, mtu, and scope should be specified as
defined in the IBTA specification (for example, mtu=4 for 2048).
PortGUIDs list:
PortGUID - GUID of partition member EndPort. Hexadecimal
numbers should start from 0x, decimal numbers
are accepted too.
full or limited - indicates full or limited membership for this
port. When omitted (or unrecognized) limited
membership is assumed.
There are two useful keywords for PortGUID definition:
- 'ALL' means all end ports in this subnet.
- 'SELF' means subnet manager's port.
Empty list means no ports in this partition.
Notes:
White space is permitted between delimiters ('=', ',',':',';').
The line can be wrapped after ':' followed after Partition Definition and
between.
PartitionName does not need to be unique, PKey does need to be unique.
If PKey is repeated then those partition configurations will be merged
and first PartitionName will be used (see also next note).
It is possible to split partition configuration in more than one
definition, but then PKey should be explicitly specified (otherwise
different PKey values will be generated for those definitions).
Examples:
Default=0x7fff : ALL, SELF=full ;
NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
YetAnotherOne = 0x300 : SELF=full ;
YetAnotherOne = 0x300 : ALL=limited ;
ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
# 0x123453, 0x123454 will be limited
ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
# 0x123456, 0x123457 will be limited
ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
Note:
The following rule is equivalent to how OpenSM used to run prior to the
partition manager:
Default=0x7fff,ipoib:ALL=full;
.SH QOS CONFIGURATION
.PP
There are a set of QoS related low-level configuration parameters.
All these parameter names are prefixed by "qos_" string. Here is a full
list of these parameters:
qos_max_vls - The maximum number of VLs that will be on the subnet
qos_high_limit - The limit of High Priority component of VL
Arbitration table (IBA 7.6.9)
qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9)
template
qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
template
Both VL arbitration templates are pairs of
VL and weight
qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is
a list of VLs corresponding to SLs 0-15 (Note
that VL15 used here means drop this SL)
Typical default values (hard-coded in OpenSM initialization) are:
qos_max_vls 15
qos_high_limit 0
qos_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
The syntax is compatible with rest of OpenSM configuration options and
values may be stored in OpenSM config file (cached options file).
In addition to the above, we may define separate QoS configuration
parameters sets for various target types. As targets, we currently support
CAs, routers, switch external ports, and switch's enhanced port 0. The
names of such specialized parameters are prefixed by "qos_<type>_"
string. Here is a full list of the currently supported sets:
qos_ca_ - QoS configuration parameters set for CAs.
qos_rtr_ - parameters set for routers.
qos_sw0_ - parameters set for switches' port 0.
qos_swe_ - parameters set for switches' external ports.
Examples:
qos_sw0_max_vls=2
qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
qos_swe_high_limit=0
.SH PREFIX ROUTES
.PP
Prefix routes control how the SA responds to path record queries for
off-subnet DGIDs. By default, the SA fails such queries.
Note that IBA does not specify how the SA should obtain off-subnet path
record information.
The prefix routes configuration is meant as a stop-gap until the
specification is completed.
.PP
Each line in the configuration file is a 64-bit prefix followed by a
64-bit GUID, separated by white space.
The GUID specifies the router port on the local subnet that will
handle the prefix.
Blank lines are ignored, as is anything between a \fB#\fP character
and the end of the line.
The prefix and GUID are both in hex, the leading 0x is optional.
Either, or both, can be wild-carded by specifying an
asterisk instead of an explicit prefix or GUID.
.PP
When responding to a path record query for an off-subnet DGID,
opensm searches for the first prefix match in the configuration file.
Therefore, the order of the lines in the configuration file is important:
a wild-carded prefix at the beginning of the configuration file renders
all subsequent lines useless.
If there is no match, then opensm fails the query.
It is legal to repeat prefixes in the configuration file,
opensm will return the path to the first available matching router.
A configuration file with a single line where both prefix and GUID
are wild-carded means that a path record query specifying any
off-subnet DGID should return a path to the first available router.
This configuration yields the same behaviour formerly achieved by
compiling opensm with -DROUTER_EXP.
.SH ROUTING
.PP
OpenSM now offers five routing engines:
1. Min Hop Algorithm - based on the minimum hops to each node where the
path length is optimized.
2. UPDN Unicast routing algorithm - also based on the minimum hops to each
node, but it is constrained to ranking rules. This algorithm should be chosen
if the subnet is not a pure Fat Tree, and deadlock may occur due to a
loop in the subnet.
3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing
for congestion-free "shift" communication pattern.
It should be chosen if a subnet is a symmetrical or almost symmetrical
fat-tree of various types, not just K-ary-N-Trees: non-constant K, not
fully staffed, any Constant Bisectional Bandwidth (CBB) ratio.
Similar to UPDN, Fat Tree routing is constrained to ranking rules.
4. LASH unicast routing algorithm - uses Infiniband virtual layers
(SL) to provide deadlock-free shortest-path routing while also
distributing the paths between layers. LASH is an alternative
deadlock-free topology-agnostic routing algorithm to the non-minimal
UPDN algorithm avoiding the use of a potentially congested root node.
5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
avoids port equalization except for redundant links between the same
two switches. This provides deadlock free routes for hypercubes when
the fabric is cabled as a hypercube and for meshes when cabled as a
mesh (see details below).
OpenSM also supports a file method which
can load routes from a table. See \'Modular Routing Engine\' for more
information on this.
The basic routing algorithm is comprised of two stages:
1. MinHop matrix calculation
How many hops are required to get from each port to each LID ?
The algorithm to fill these tables is different if you run standard
(min hop) or Up/Down.
For standard routing, a "relaxation" algorithm is used to propagate
min hop from every destination LID through neighbor switches
For Up/Down routing, a BFS from every target is used. The BFS tracks link
direction (up or down) and avoid steps that will perform up after a down
step was used.
2. Once MinHop matrices exist, each switch is visited and for each target LID a
decision is made as to what port should be used to get to that LID.
This step is common to standard and Up/Down routing. Each port has a
counter counting the number of target LIDs going through it.
When there are multiple alternative ports with same MinHop to a LID,
the one with less previously assigned ports is selected.
If LMC > 0, more checks are added: Within each group of LIDs assigned to
same target port,
a. use only ports which have same MinHop
b. first prefer the ones that go to different systemImageGuid (then
the previous LID of the same LMC group)
c. if none - prefer those which go through another NodeGuid
d. fall back to the number of paths method (if all go to same node).
Effect of Topology Changes
OpenSM will preserve existing routing in any case where there is no change in
the fabric switches unless the -r (--reassign_lids) option is specified.
-r
.br
--reassign_lids
This option causes OpenSM to reassign LIDs to all
end nodes. Specifying -r on a running subnet
may disrupt subnet traffic.
Without -r, OpenSM attempts to preserve existing
LID assignments resolving multiple use of same LID.
If a link is added or removed, OpenSM does not recalculate
the routes that do not have to change. A route has to change
if the port is no longer UP or no longer the MinHop. When routing changes
are performed, the same algorithm for balancing the routes is invoked.
In the case of using the file based routing, any topology changes are
currently ignored The 'file' routing engine just loads the LFTs from the file
specified, with no reaction to real topology. Obviously, this will not be able
to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent
switches will be skipped. Multicast is not affected by 'file' routing engine
(this uses min hop tables).
Min Hop Algorithm
The Min Hop algorithm is invoked by default if no routing algorithm is
specified. It can also be invoked by specifying '-R minhop'.
The Min Hop algorithm is divided into two stages: computation of
min-hop tables on every switch and LFT output port assignment. Link
subscription is also equalized with the ability to override based on
port GUID. The latter is supplied by:
-i <equalize-ignore-guids-file>
.br
-ignore-guids <equalize-ignore-guids-file>
This option provides the means to define a set of ports
(by guid) that will be ignored by the link load
equalization algorithm. Note that only endports (CA,
switch port 0, and router ports) and not switch external
ports are supported.
LMC awareness routes based on (remote) system or switch basis.
Purpose of UPDN Algorithm
The UPDN algorithm is designed to prevent deadlocks from occurring in loops
of the subnet. A loop-deadlock is a situation in which it is no longer
possible to send data between any two hosts connected through the loop. As
such, the UPDN routing algorithm should be used if the subnet is not a pure
Fat Tree, and one of its loops may experience a deadlock (due, for example,
to high pressure).
The UPDN algorithm is based on the following main stages:
1. Auto-detect root nodes - based on the CA hop length from any switch in
the subnet, a statistical histogram is built for each switch (hop num vs
number of occurrences). If the histogram reflects a specific column (higher
than others) for a certain node, then it is marked as a root node. Since
the algorithm is statistical, it may not find any root nodes. The list of
the root nodes found by this auto-detect stage is used by the ranking
process stage.
Note 1: The user can override the node list manually.
Note 2: If this stage cannot find any root nodes, and the user did
not specify a guid list file, OpenSM defaults back to the
Min Hop routing algorithm.
2. Ranking process - All root switch nodes (found in stage 1) are assigned
a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the
subnet are ranked incrementally. This ranking aids in the process of enforcing
rules that ensure loop-free paths.
3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from
each (CA or switch) node in the subnet. During the BFS process, the FDB table
of each switch node traversed by BFS is updated, in reference to the starting
node, based on the ranking rules and guid values.
At the end of the process, the updated FDB tables ensure loop-free paths
through the subnet.
Note: Up/Down routing does not allow LID routing communication between
switches that are located inside spine "switch systems".
The reason is that there is no way to allow a LID route between them
that does not break the Up/Down rule.
One ramification of this is that you cannot run SM on switches other
than the leaf switches of the fabric.
UPDN Algorithm Usage
Activation through OpenSM
Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
Use '-a <root_guid_file>' for adding an UPDN guid file that contains the
root nodes for ranking.
If the `-a' option is not used, OpenSM uses its auto-detect root nodes
algorithm.
Notes on the guid list file:
1. A valid guid file specifies one guid in each line. Lines with an invalid
format will be discarded.
.br
2. The user should specify the root switch guids. However, it is also
possible to specify CA guids; OpenSM will use the guid of the switch (if
it exists) that connects the CA to the subnet as a root node.
Fat-tree Routing Algorithm
The fat-tree algorithm optimizes routing for "shift" communication pattern.
It should be chosen if a subnet is a symmetrical or almost symmetrical
fat-tree of various types.
It supports not just K-ary-N-Trees, by handling for non-constant K,
cases where not all leafs (CAs) are present, any CBB ratio.
As in UPDN, fat-tree also prevents credit-loop-deadlocks.
If the root guid file is not provided ('-a' or '--root_guid_file' options),
the topology has to be pure fat-tree that complies with the following rules:
- Tree rank should be between two and eight (inclusively)
- Switches of the same rank should have the same number
of UP-going port groups*, unless they are root switches,
in which case the shouldn't have UP-going ports at all.
- Switches of the same rank should have the same number
of DOWN-going port groups, unless they are leaf switches.
- Switches of the same rank should have the same number
of ports in each UP-going port group.
- Switches of the same rank should have the same number
of ports in each DOWN-going port group.
- All the CAs have to be at the same tree level (rank).
If the root guid file is provided, the topology doesn't have to be pure
fat-tree, and it should only comply with the following rules:
- Tree rank should be between two and eight (inclusively)
- All the Compute Nodes** have to be at the same tree level (rank).
Note that non-compute node CAs are allowed here to be at different
tree ranks.
* ports that are connected to the same remote switch are referenced as
\'port group\'.
** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\'
OpenSM options.
Topologies that do not comply cause a fallback to min hop routing.
Note that this can also occur on link failures which cause the topology
to no longer be "pure" fat-tree.
Note that although fat-tree algorithm supports trees with non-integer CBB
ratio, the routing will not be as balanced as in case of integer CBB ratio.
In addition to this, although the algorithm allows leaf switches to have any
number of CAs, the closer the tree is to be fully populated, the more
effective the "shift" communication pattern will be.
In general, even if the root list is provided, the closer the topology to a
pure and symmetrical fat-tree, the more optimal the routing will be.
The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
in the same directory where the OpenSM log resides. This ordering file provides
the CN order that may be used to create efficient communication pattern, that
will match the routing tables.
Activation through OpenSM
Use '-R ftree' option to activate the fat-tree algorithm.
Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option
is not used, routing algorithm will detect roots automatically.
Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option
is not used, all the CAs are considered as compute nodes.
Note: LMC > 0 is not supported by fat-tree routing. If this is
specified, the default routing algorithm is invoked instead.
LASH Routing Algorithm
LASH is an acronym for LAyered SHortest Path Routing. It is a
deterministic shortest path routing algorithm that enables topology
agnostic deadlock-free routing within communication networks.
When computing the routing function, LASH analyzes the network
topology for the shortest-path routes between all pairs of sources /
destinations and groups these paths into virtual layers in such a way
as to avoid deadlock.
Note LASH analyzes routes and ensures deadlock freedom between switch
pairs. The link from HCA between and switch does not need virtual
layers as deadlock will not arise between switch and HCA.
In more detail, the algorithm works as follows:
1) LASH determines the shortest-path between all pairs of source /
destination switches. Note, LASH ensures the same SL is used for all
SRC/DST - DST/SRC pairs and there is no guarantee that the return
path for a given DST/SRC will be the reverse of the route SRC/DST.
2) LASH then begins an SL assignment process where a route is assigned
to a layer (SL) if the addition of that route does not cause deadlock
within that layer. This is achieved by maintaining and analysing a
channel dependency graph for each layer. Once the potential addition
of a path could lead to deadlock, LASH opens a new layer and continues
the process.
3) Once this stage has been completed, it is highly likely that the
first layers processed will contain more paths than the latter ones.
To better balance the use of layers, LASH moves paths from one layer
to another so that the number of paths in each layer averages out.
Note, the implementation of LASH in opensm attempts to use as few layers
as possible. This number can be less than the number of actual layers
available.
In general LASH is a very flexible algorithm. It can, for example,
reduce to Dimension Order Routing in certain topologies, it is topology
agnostic and fares well in the face of faults.
It has been shown that for both regular and irregular topologies, LASH
outperforms Up/Down. The reason for this is that LASH distributes the
traffic more evenly through a network, avoiding the bottleneck issues
related to a root node and always routes shortest-path.
The algorithm was developed by Simula Research Laboratory.
Use '-R lash -Q ' option to activate the LASH algorithm.
Note: QoS support has to be turned on in order that SL/VL mappings are
used.
Note: LMC > 0 is not supported by the LASH routing. If this is
specified, the default routing algorithm is invoked instead.
DOR Routing Algorithm
The Dimension Order Routing algorithm is based on the Min Hop
algorithm and so uses shortest paths. Instead of spreading traffic
out across different paths with the same shortest distance, it chooses
among the available shortest paths based on an ordering of dimensions.
Each port must be consistently cabled to represent a hypercube
dimension or a mesh dimension. Paths are grown from a destination
back to a source using the lowest dimension (port) of available paths
at each step. This provides the ordering necessary to avoid deadlock.
When there are multiple links between any two switches, they still
represent only one dimension and traffic is balanced across them
unless port equalization is turned off. In the case of hypercubes,
the same port must be used throughout the fabric to represent the
hypercube dimension and match on both ends of the cable. In the case
of meshes, the dimension should consistently use the same pair of
ports, one port on one end of the cable, and the other port on the
other end, continuing along the mesh dimension.
Use '-R dor' option to activate the DOR algorithm.
Routing References
To learn more about deadlock-free routing, see the article
"Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
by William J Dally and Charles L Seitz (1985).
To learn more about the up/down algorithm, see the article
"Effective Strategy to Compute Forwarding Tables for InfiniBand Networks"
by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the
Universidad Politecnica de Valencia.
To learn more about LASH and the flexibility behind it, the requirement
for layers, performance comparisons to other algorithms, see the
following articles:
"Layered Routing in Irregular Networks", Lysne et al, IEEE
Transactions on Parallel and Distributed Systems, VOL.16, No12,
December 2005.
"Routing for the ASI Fabric Manager", Solheim et al. IEEE
Communications Magazine, Vol.44, No.7, July 2006.
"Layered Shortest Path (LASH) Routing in Irregular System Area
Networks", Skeie et al. IEEE Computer Society Communication
Architecture for Clusters 2002.
Modular Routine Engine
Modular routing engine structure allows for the ease of
"plugging" new routing modules.
Currently, only unicast callbacks are supported. Multicast
can be added later.
One existing routing module is up-down "updn", which may be
activated with '-R updn' option (instead of old '-u').
General usage is:
$ opensm -R 'module-name'
There is also a trivial routing module which is able
to load LFT tables from a file.
Main features:
- this will load switch LFTs and/or LID matrices (min hops tables)
- this will load switch LFTs according to the path entries introduced
in the file
- no additional checks will be performed (such as "is port connected",
etc.)
- in case when fabric LIDs were changed this will try to reconstruct
LFTs correctly if endport GUIDs are represented in the file
(in order to disable this, GUIDs may be removed from the file
or zeroed)
The file format is compatible with output of 'ibroute' util and for
whole fabric can be generated with dump_lfts.sh script.
To activate file based routing module, use:
opensm -R file -U /path/to/lfts_file
If the lfts_file is not found or is in error, the default routing
algorithm is utilized.
The ability to dump switch lid matrices (aka min hops tables) to file and
later to load these is also supported.
The usage is similar to unicast forwarding tables loading from a lfts
file (introduced by 'file' routing engine), but new lid matrix file
name should be specified by -M or --lid_matrix_file option. For example:
opensm -R file -M ./opensm-lid-matrix.dump
The dump file is named \'opensm-lid-matrix.dump\' and will be generated
in standard opensm dump directory (/var/log by default) when
OSM_LOG_ROUTING logging flag is set.
When routing engine 'file' is activated, but the lfts file is not specified
or not cannot be open default lid matrix algorithm will be used.
There is also a switch forwarding tables dumper which generates
a file compatible with dump_lfts.sh output. This file can be used
as input for forwarding tables loading by 'file' routing engine.
Both or one of options -U and -M can be specified together with \'-R file\'.
.SH FILES
.TP
.B @OPENSM_CONFIG_DIR@/@OPENSM_CONFIG_FILE@
default OpenSM config file.
.TP
.B @OPENSM_CONFIG_DIR@/@NODENAMEMAPFILE@
default node name map file. See ibnetdiscover for more information on format.
.TP
.B @OPENSM_CONFIG_DIR@/@PARTITION_CONFIG_FILE@
default partition config file
.TP
.B @OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@
default QOS policy config file
.TP
.B @OPENSM_CONFIG_DIR@/@PREFIX_ROUTES_FILE@
default prefix routes file.
.SH AUTHORS
.TP
Hal Rosenstock
.RI < hal.rosenstock@gmail.com >
.TP
Sasha Khapyorsky
.RI < sashak@voltaire.com >
.TP
Eitan Zahavi
.RI < eitan@mellanox.co.il >
.TP
Yevgeny Kliteynik
.RI < kliteyn@mellanox.co.il >
.TP
Thomas Sodring
.RI < tsodring@simula.no >
.TP
Ira Weiny
.RI < weiny2@llnl.gov >