Man Linux: Main Page and Category List

NAME

       opensm - InfiniBand subnet manager and administration (SM/SA)

SYNOPSIS

       opensm  [--version]]  [-F  |  --config  <file_name>]  [-c(reate-config)
       <file_name>]  [-g(uid)  <GUID  in  hex>]  [-l(mc)  <LMC>]  [-p(riority)
       <PRIORITY>] [-smkey <SM_Key>] [-r(eassign_lids)] [-R <engine name(s)> |
       --routing_engine  <engine  name(s)>]  [-A  |   --ucast_cache]   [-z   |
       --connect_roots]  [-M  <file name> | --lid_matrix_file <file name>] [-U
       <file name> | --lfts_file <file name>] [-S | --sadb_file  <file  name>]
       [-a  |  --root_guid_file  <path to file>] [-u | --cn_guid_file <path to
       file>]  [-X  |  --guid_routing_order_file  <path  to   file>]   [-m   |
       --ids_guid_file   <path   to  file>]  [-o(nce)]  [-s(weep)  <interval>]
       [-t(imeout) <milliseconds>] [-maxsmps <number>] [-console [off |  local
       |   socket   |   loopback]]   [-console-port  <port>]  [-i(gnore-guids)
       <equalize-ignore-guids-file>] [-f <log file  path>  |  --log_file  <log
       file  path>  ]  [-L  |  --log_limit  <size  in MB>] [-e(rase_log_file)]
       [-P(config) <partition config file> ] [-N |  --no_part_enforce]  [-Q  |
       --qos  [-Y | --qos_policy_file <file name>]] [-y | --stay_on_fatal] [-B
       |  --daemon]  [-I  |  --inactive]  [--perfmgr]  [--perfmgr_sweep_time_s
       <seconds>]  [--prefix_routes_file  <path>] [--consolidate_ipv6_snm_req]
       [-v(erbose)] [-V] [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]

DESCRIPTION

       opensm is an InfiniBand compliant Subnet  Manager  and  Administration,
       and runs on top of OpenIB.

       opensm  provides  an implementation of an InfiniBand Subnet Manager and
       Administration. Such a software entity is required to run for in  order
       to initialize the InfiniBand hardware (at least one per each InfiniBand
       subnet).

       opensm also now contains  an  experimental  version  of  a  performance
       manager as well.

       opensm defaults were designed to meet the common case usage on clusters
       with up to a few hundred nodes. Thus, in this default mode, opensm will
       scan  the IB fabric, initialize it, and sweep occasionally for changes.

       opensm attaches to  a  specific  IB  port  on  the  local  machine  and
       configures  only  the fabric connected to it. (If the local machine has
       other IB ports, opensm will ignore the fabrics connected to those other
       ports).  If  no  port  is  specified,  it  will select the first "best"
       available port.

       opensm can present the available ports and prompt for a port number  to
       attach to.

       By  default,  the  run  is  logged  to two files: /var/log/messages and
       /var/log/opensm.log.  The first file will register only  general  major
       events, whereas the second will include details of reported errors. All
       errors reported in this second file should be treated as indicators  of
       IB  fabric  health issues.  (Note that when a fatal and non-recoverable
       error occurs, opensm will exit.)  Both log  files  should  include  the
       message "SUBNET UP" if opensm was able to setup the subnet correctly.

OPTIONS

       --version
              Prints OpenSM version and exits.

       -F, --config <config file>
              The  name  of  the  OpenSM  config  file.  When  not  specified
              /etc/opensm/opensm.conf will be used (if exists).

       -c, --create-config <file name>
              OpenSM will dump its configuration to  the  specified  file  and
              exit.   This  is  a  way  to  generate OpenSM configuration file
              template.

       -g, --guid <GUID in hex>
              This option specifies the  local  port  GUID  value  with  which
              OpenSM  should  bind.   OpenSM may be bound to 1 port at a time.
              If GUID given is 0, OpenSM displays  a  list  of  possible  port
              GUIDs and waits for user input.  Without -g, OpenSM tries to use
              the default port.

       -l, --lmc <LMC value>
              This option specifies the subnet’s LMC  value.   The  number  of
              LIDs  assigned  to each port is 2^LMC.  The LMC value must be in
              the range 0-7.  LMC values >  0  allow  multiple  paths  between
              ports.   LMC  values  >  0  should  only  be  used if the subnet
              topology actually provides multiple paths  between  ports,  i.e.
              multiple  interconnects  between  switches.   Without -l, OpenSM
              defaults to LMC = 0, which  allows  one  path  between  any  two
              ports.

       -p, --priority <Priority value>
              This  option  specifies the SM´s PRIORITY.  This will effect the
              handover cases, where master is chosen  by  priority  and  GUID.
              Range goes from 0 (default and lowest priority) to 15 (highest).

       -smkey <SM_Key value>
              This option specifies the SM´s  SM_Key  (64  bits).   This  will
              effect  SM  authentication.   Note that OpenSM version 3.2.1 and
              below used the default value ’1’ in a host  byte  order,  it  is
              fixed  now but you may need this option to interoperate with old
              OpenSM running on a little endian machine.

       -r, --reassign_lids
              This option causes OpenSM to reassign LIDs  to  all  end  nodes.
              Specifying  -r  on  a running subnet may disrupt subnet traffic.
              Without -r, OpenSM attempts to preserve existing LID assignments
              resolving multiple use of same LID.

       -R, --routing_engine <Routing engine names>
              This  option chooses routing engine(s) to use instead of Min Hop
              algorithm (default).  Multiple routing engines can be  specified
              separated  by  commas  so  that  specific  ordering  of  routing
              algorithms will  be  tried  if  earlier  routing  engines  fail.
              Supported engines: minhop, updn, file, ftree, lash, dor

       -A, --ucast_cache
              This  option  enables unicast routing cache and prevents routing
              recalculation (which is a heavy task in a  large  cluster)  when
              there was no topology change detected during the heavy sweep, or
              when  the  topology  change  does  not   require   new   routing
              calculation,  e.g. when one or more CAs/RTRs/leaf switches going
              down, or one or more of these  nodes  coming  back  after  being
              down.  A very common case that is handled by the unicast routing
              cache is host reboot,  which  otherwise  would  cause  two  full
              routing  recalculations:  one  when  the host goes down, and the
              other when the host comes back online.

       -z, --connect_roots
              This option enforces a routing engine (currently  up/down  only)
              to make connectivity between root switches and in this way to be
              fully IBA complaint. In  many  cases  this  can  violate  "pure"
              deadlock free algorithm, so use it carefully.

       -M, --lid_matrix_file <file name>
              This  option specifies the name of the lid matrix dump file from
              where switch lid matrices (min hops tables will be loaded.

       -U, --lfts_file <file name>
              This option specifies the name  of  the  LFTs  file  from  where
              switch forwarding tables will be loaded.

       -S, --sadb_file <file name>
              This option specifies the name of the SA DB dump file from where
              SA database will be loaded.

       -a, --root_guid_file <file name>
              Set the root nodes for the Up/Down or Fat-Tree routing algorithm
              to the guids provided in the given file (one to a line).

       -u, --cn_guid_file <file name>
              Set  the compute nodes for the Fat-Tree routing algorithm to the
              guids provided in the given file (one to a line).

       -m, --ids_guid_file <file name>
              Name of the map file with set of the IDs which will be  used  by
              Up/Down  routing algorithm instead of node GUIDs (format: <guid>
              <id> per line).

       -X, --guid_routing_order_file <file name>
              Set the order port guids will  be  routed  for  the  MinHop  and
              Up/Down  routing  algorithms  to the guids provided in the given
              file (one to a line).

       -o, --once
              This option causes OpenSM to configure  the  subnet  once,  then
              exit.  Ports remain in the ACTIVE state.

       -s, --sweep <interval value>
              This  option  specifies  the  number  of  seconds between subnet
              sweeps.  Specifying -s 0 disables sweeping.  Without -s,  OpenSM
              defaults to a sweep interval of 10 seconds.

       -t, --timeout <value>
              This   option  specifies  the  time  in  milliseconds  used  for
              transaction  timeouts.   Specifying  -t  0  disables   timeouts.
              Without   -t,   OpenSM  defaults  to  a  timeout  value  of  200
              milliseconds.

       -maxsmps <number>
              This option specifies the number of VL15 SMP MADs allowed on the
              wire  at  any  one time.  Specifying -maxsmps 0 allows unlimited
              outstanding  SMPs.   Without  -maxsmps,  OpenSM  defaults  to  a
              maximum of 4 outstanding SMPs.

       -console [off | local | socket | loopback]
              This  option  brings  up the OpenSM console (default off).  Note
              that the socket and loopback options will only be  available  if
              OpenSM was built with --enable-console-socket.

       -console-port <port>
              Specify an alternate telnet port for the socket console (default
              10000).  Note that this option only appears if OpenSM was  built
              with --enable-console-socket.

       -i, -ignore-guids <equalize-ignore-guids-file>
              This option provides the means to define a set of ports (by node
              guid and port number) that will be  ignored  by  the  link  load
              equalization algorithm.

       -x, --honor_guid2lid
              This  option  forces  OpenSM to honor the guid2lid file, when it
              comes  out  of  Standby  state,  if  such  file   exists   under
              OSM_CACHE_DIR, and is valid.  By default, this is FALSE.

       -f, --log_file <file name>
              This  option  defines the log to be the given file.  By default,
              the log goes to /var/log/opensm.log.   For  the  log  to  go  to
              standard output use -f stdout.

       -L, --log_limit <size in MB>
              This  option defines maximal log file size in MB. When specified
              the log file will be truncated upon reaching this limit.

       -e, --erase_log_file
              This  option  will  cause  deletion  of  the  log  file  (if  it
              previously exists). By default, the log file is accumulative.

       -P, --Pconfig <partition config file>
              This  option  defines the optional partition configuration file.
              The default name is /etc/opensm/partitions.conf.

       --prefix_routes_file <file name>
              Prefix routes control how the SA responds to path record queries
              for  off-subnet  DGIDs.   By default, the SA fails such queries.
              The PREFIX ROUTES section below  describes  the  format  of  the
              configuration      file.       The      default      path     is
              /etc/opensm/prefix-routes.conf.

       -Q, --qos
              This option enables QoS setup. It is disabled by default.

       -Y, --qos_policy_file <file name>
              This option defines the optional QoS policy  file.  The  default
              name is /etc/opensm/qos-policy.conf.

       -N, --no_part_enforce
              This  option  disables  partition enforcement on switch external
              ports.

       -y, --stay_on_fatal
              This option will cause SM not to exit  on  fatal  initialization
              issues: if SM discovers duplicated guids or a 12x link with lane
              reversal badly configured.  By default,  the  SM  will  exit  on
              these errors.

       -B, --daemon
              Run in daemon mode - OpenSM will run in the background.

       -I, --inactive
              Start SM in inactive rather than init SM state.  This option can
              be used  in  conjunction  with  the  perfmgr  so  as  to  run  a
              standalone  performance manager without SM/SA.  However, this is
              NOT currently implemented in the performance manager.

       -perfmgr
              Enable the perfmgr.  Only takes effect if  --enable-perfmgr  was
              specified at configure time.

       -perfmgr_sweep_time_s <seconds>
              Specify  the  sweep  time for the performance manager in seconds
              (default is 180 seconds).  Only takes effect if --enable-perfmgr
              was specified at configure time.

       --consolidate_ipv6_snm_req
              Consolidate  IPv6  Solicited  Node Multicast group join requests
              into one multicast group per MGID PKey.

       -v, --verbose
              This option increases the log verbosity level.   The  -v  option
              may   be  specified  multiple  times  to  further  increase  the
              verbosity level.  See the -D option for more  information  about
              log verbosity.

       -V     This  option  sets  the  maximum  verbosity level and forces log
              flushing.  The -V option is equivalent to ´-D 0xFF -d  2´.   See
              the -D option for more information about log verbosity.

       -D <value>
              This  option  sets  the log verbosity level.  A flags field must
              follow  the  -D  option.   A  bit   set/clear   in   the   flags
              enables/disables a specific log level as follows:

               BIT    LOG LEVEL ENABLED
               ----   -----------------
               0x01 - ERROR (error messages)
               0x02 - INFO (basic messages, low volume)
               0x04 - VERBOSE (interesting stuff, moderate volume)
               0x08 - DEBUG (diagnostic, high volume)
               0x10 - FUNCS (function entry/exit, very high volume)
               0x20 - FRAMES (dumps all SMP and GMP frames)
               0x40 - ROUTING (dump FDB routing information)
               0x80 - currently unused.

              Without  -D,  OpenSM defaults to ERROR + INFO (0x3).  Specifying
              -D 0 disables all messages.   Specifying  -D  0xFF  enables  all
              messages (see -V).  High verbosity levels may require increasing
              the transaction timeout with the -t option.

       -d, --debug <value>
              This option specifies a debug option.   These  options  are  not
              normally  needed.   The  number  following  -d selects the debug
              option to enable as follows:

               OPT   Description
               ---    -----------------
               -d0  - Ignore other SM nodes
               -d1  - Force single threaded dispatching
               -d2  - Force log flushing after each log message
               -d3  - Disable multicast support

       -h, --help
              Display this usage info then exit.

       -?     Display this usage info then exit.

ENVIRONMENT VARIABLES

       The following environment variables control opensm behavior:

       OSM_TMP_DIR - controls the  directory  in  which  the  temporary  files
       generated  by  opensm  are created. These files are: opensm-subnet.lst,
       opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.

       OSM_CACHE_DIR  -  opensm  stores  certain  data  to  the disk such that
       subsequent  runs  are  consistent.  The  default  directory   used   is
       /var/cache/opensm.  The following file is included in it:

        guid2lid - stores the LID range assigned to each GUID

NOTES

       When  opensm receives a HUP signal, it starts a new heavy sweep as if a
       trap was received or a topology change was found.

       Also, SIGUSR1 can be used to trigger a  reopen  of  /var/log/opensm.log
       for logrotate purposes.

PARTITION CONFIGURATION

       The   default   name   of   OpenSM  partitions  configuration  file  is
       /etc/opensm/partitions.conf.  The  default  may  be  changed  by  using
       --Pconfig (-P) option with OpenSM.

       The  default  partition  will be created by OpenSM unconditionally even
       when partition configuration file does not exist or cannot be accessed.

       The  default  partition has P_Key value 0x7fff. OpenSM´s port will have
       full membership in default partition. All other  end  ports  will  have
       partial membership.

       File Format

       Comments:

       Line  content  followed  after  ´#´ character is comment and ignored by
       parser.

       General file format:

       <Partition Definition>:<PortGUIDs list> ;

       Partition Definition:

       [PartitionName][=PKey][,flag[=value]][,defmember=full|limited]

        PartitionName - string, will be used with logging. When omitted
                        empty string will be used.
        PKey          - P_Key value for this partition. Only low 15 bits will
                        be used. When omitted will be autogenerated.
        flag          - used to indicate IPoIB capability of this partition.
        defmember=full|limited - specifies default membership for port guid
                        list. Default is limited.

       Currently recognized flags are:

        ipoib       - indicates that this partition may be used for IPoIB, as
                      result IPoIB capable MC group will be created.
        rate=<val>  - specifies rate for this IPoIB MC group
                      (default is 3 (10GBps))
        mtu=<val>   - specifies MTU for this IPoIB MC group
                      (default is 4 (2048))
        sl=<val>    - specifies SL for this IPoIB MC group
                      (default is 0)
        scope=<val> - specifies scope for this IPoIB MC group
                      (default is 2 (link local)).  Multiple scope settings
                      are permitted for a partition.

       Note that values for rate,  mtu,  and  scope  should  be  specified  as
       defined in the IBTA specification (for example, mtu=4 for 2048).

       PortGUIDs list:

        PortGUID         - GUID of partition member EndPort. Hexadecimal
                           numbers should start from 0x, decimal numbers
                           are accepted too.
        full or limited  - indicates full or limited membership for this
                           port.  When omitted (or unrecognized) limited
                           membership is assumed.

       There are two useful keywords for PortGUID definition:

        - ’ALL’ means all end ports in this subnet.
        - ’SELF’ means subnet manager’s port.

       Empty list means no ports in this partition.

       Notes:

       White space is permitted between delimiters (’=’, ’,’,’:’,’;’).

       The  line  can be wrapped after ’:’ followed after Partition Definition
       and between.

       PartitionName does not need to be unique, PKey does need to be  unique.
       If  PKey is repeated then those partition configurations will be merged
       and first PartitionName will be used (see also next note).

       It is possible to  split  partition  configuration  in  more  than  one
       definition,  but  then  PKey  should be explicitly specified (otherwise
       different PKey values will be generated for those definitions).

       Examples:

        Default=0x7fff : ALL, SELF=full ;

        NewPartition , ipoib : 0x123456=full, 0x3456789034=limi,  0x2134af2306
       ;

        YetAnotherOne = 0x300 : SELF=full ;
        YetAnotherOne = 0x300 : ALL=limited ;

        ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
        # 0x123453, 0x123454 will be limited
        ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
        # 0x123456, 0x123457 will be limited
        ShareIO   =   0x80   :   defmember=limited   :   0x123456,   0x123457,
       0x123458=full;
        ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
        ShareIO  =  0x80  ,  defmember=full  :   0x12345b,   0x12345c=limited,
       0x12345d;

       Note:

       The following rule is equivalent to how OpenSM used to run prior to the
       partition manager:

        Default=0x7fff,ipoib:ALL=full;

QOS CONFIGURATION

       There are a set of QoS related low-level configuration parameters.  All
       these  parameter  names  are  prefixed by "qos_" string. Here is a full
       list of these parameters:

        qos_max_vls    - The maximum number of VLs that will be on the subnet
        qos_high_limit - The limit of High Priority component of VL
                         Arbitration table (IBA 7.6.9)
        qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
                         template
        qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
                         template
                         Both VL arbitration templates are pairs of
                         VL and weight
        qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
                         a list of VLs corresponding to SLs 0-15 (Note
                         that VL15 used here means drop this SL)

       Typical default values (hard-coded in OpenSM initialization) are:

        qos_max_vls 15
        qos_high_limit 0
        qos_vlarb_low
       0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
        qos_vlarb_high
       0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
        qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

       The syntax is compatible with rest of OpenSM configuration options  and
       values may be stored in OpenSM config file (cached options file).

       In  addition  to  the  above,  we may define separate QoS configuration
       parameters sets for various target  types.  As  targets,  we  currently
       support CAs, routers, switch external ports, and switch’s enhanced port
       0.  The  names  of  such  specialized  parameters   are   prefixed   by
       "qos_<type>_"  string.  Here  is a full list of the currently supported
       sets:

        qos_ca_  - QoS configuration parameters set for CAs.
        qos_rtr_ - parameters set for routers.
        qos_sw0_ - parameters set for switches’ port 0.
        qos_swe_ - parameters set for switches’ external ports.

       Examples:
        qos_sw0_max_vls=2
        qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
        qos_swe_high_limit=0

PREFIX ROUTES

       Prefix routes control how the SA responds to path  record  queries  for
       off-subnet  DGIDs.   By  default, the SA fails such queries.  Note that
       IBA does not specify how the SA should obtain  off-subnet  path  record
       information.   The  prefix  routes configuration is meant as a stop-gap
       until the specification is completed.

       Each line in the configuration file is a 64-bit prefix  followed  by  a
       64-bit  GUID,  separated by white space.  The GUID specifies the router
       port on the local subnet that will handle the prefix.  Blank lines  are
       ignored,  as is anything between a # character and the end of the line.
       The prefix and GUID are both  in  hex,  the  leading  0x  is  optional.
       Either,  or  both, can be wild-carded by specifying an asterisk instead
       of an explicit prefix or GUID.

       When responding to a path record query for an off-subnet  DGID,  opensm
       searches  for  the  first  prefix  match  in  the  configuration  file.
       Therefore, the  order  of  the  lines  in  the  configuration  file  is
       important:  a  wild-carded prefix at the beginning of the configuration
       file renders all subsequent lines useless.  If there is no match,  then
       opensm  fails  the  query.   It  is  legal  to  repeat  prefixes in the
       configuration file, opensm will return the path to the first  available
       matching  router.   A  configuration file with a single line where both
       prefix and  GUID  are  wild-carded  means  that  a  path  record  query
       specifying  any  off-subnet  DGID  should  return  a  path to the first
       available  router.   This  configuration  yields  the  same   behaviour
       formerly achieved by compiling opensm with -DROUTER_EXP.

ROUTING

       OpenSM now offers five routing engines:

       1.   Min  Hop  Algorithm - based on the minimum hops to each node where
       the path length is optimized.

       2.  UPDN Unicast routing algorithm - also based on the minimum hops  to
       each  node,  but  it  is  constrained  to ranking rules. This algorithm
       should be chosen if the subnet is not a pure Fat Tree, and deadlock may
       occur due to a loop in the subnet.

       3.   Fat  Tree  Unicast  routing  algorithm  - this algorithm optimizes
       routing for congestion-free "shift" communication pattern.   It  should
       be  chosen  if a subnet is a symmetrical or almost symmetrical fat-tree
       of various types, not just K-ary-N-Trees:  non-constant  K,  not  fully
       staffed,  any  Constant  Bisectional Bandwidth (CBB) ratio.  Similar to
       UPDN, Fat Tree routing is constrained to ranking rules.

       4. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
       to  provide deadlock-free shortest-path routing while also distributing
       the  paths  between  layers.  LASH  is  an  alternative   deadlock-free
       topology-agnostic  routing  algorithm to the non-minimal UPDN algorithm
       avoiding the use of a potentially congested root node.

       5. DOR Unicast routing algorithm - based on the Min Hop algorithm,  but
       avoids  port  equalization  except for redundant links between the same
       two switches.  This provides deadlock free routes for  hypercubes  when
       the  fabric  is  cabled  as a hypercube and for meshes when cabled as a
       mesh (see details below).

       OpenSM also supports a file method which can load routes from a  table.
       See ´Modular Routing Engine´ for more information on this.

       The basic routing algorithm is comprised of two stages:

       1. MinHop matrix calculation
          How many hops are required to get from each port to each LID ?
          The  algorithm to fill these tables is different if you run standard
       (min hop) or Up/Down.
          For standard routing, a "relaxation" algorithm is used to  propagate
       min hop from every destination LID through neighbor switches
          For Up/Down routing, a BFS from every target is used. The BFS tracks
       link direction (up or down) and avoid steps that will perform up  after
       a down step was used.

       2.  Once  MinHop  matrices  exist,  each switch is visited and for each
       target LID a decision is made as to what port should be used to get  to
       that LID.
          This step is common to standard and Up/Down routing. Each port has a
       counter counting the number of target LIDs going through it.
          When there are multiple alternative ports with same MinHop to a LID,
       the one with less previously assigned ports is selected.
          If  LMC  >  0,  more  checks  are  added:  Within each group of LIDs
       assigned to same target port,
          a. use only ports which have same MinHop
          b. first prefer the ones that go to different systemImageGuid  (then
       the previous LID of the same LMC group)
          c. if none - prefer those which go through another NodeGuid
          d. fall back to the number of paths method (if all go to same node).

       Effect of Topology Changes

       OpenSM will preserve existing routing in any case  where  there  is  no
       change in the fabric switches unless the -r (--reassign_lids) option is
       specified.

       -r
       --reassign_lids
                 This option causes OpenSM to reassign LIDs to all
                 end nodes. Specifying -r on a running subnet
                 may disrupt subnet traffic.
                 Without -r, OpenSM attempts to preserve existing
                 LID assignments resolving multiple use of same LID.

       If a link is added or removed, OpenSM does not recalculate  the  routes
       that  do  not  have  to change. A route has to change if the port is no
       longer UP or no longer the MinHop. When routing changes are  performed,
       the same algorithm for balancing the routes is invoked.

       In  the  case of using the file based routing, any topology changes are
       currently ignored The ’file’ routing engine just loads  the  LFTs  from
       the  file specified, with no reaction to real topology. Obviously, this
       will not be able to recheck LIDs (by GUID) for disconnected nodes,  and
       LFTs  for  non-existent  switches  will  be  skipped.  Multicast is not
       affected by ’file’ routing engine (this uses min hop tables).

       Min Hop Algorithm

       The Min Hop algorithm is invoked by default if no routing algorithm  is
       specified.  It can also be invoked by specifying ’-R minhop’.

       The  Min  Hop algorithm is divided into two stages: computation of min-
       hop tables on  every  switch  and  LFT  output  port  assignment.  Link
       subscription  is  also  equalized with the ability to override based on
       port GUID. The latter is supplied by:

       -i <equalize-ignore-guids-file>
       -ignore-guids <equalize-ignore-guids-file>
                 This option provides the means to define a set of ports
                 (by guid) that will be ignored by the link load
                 equalization algorithm. Note that only endports (CA,
                 switch port 0, and router ports) and not switch external
                 ports are supported.

       LMC awareness routes based on (remote) system or switch basis.

       Purpose of UPDN Algorithm

       The UPDN algorithm is designed to prevent deadlocks from  occurring  in
       loops  of  the subnet. A loop-deadlock is a situation in which it is no
       longer possible to send data between any two  hosts  connected  through
       the  loop.  As  such,  the UPDN routing algorithm should be used if the
       subnet is not a pure Fat Tree, and one of its loops  may  experience  a
       deadlock (due, for example, to high pressure).

       The UPDN algorithm is based on the following main stages:

       1.  Auto-detect root nodes - based on the CA hop length from any switch
       in the subnet, a statistical histogram is built for  each  switch  (hop
       num  vs  number  of  occurrences). If the histogram reflects a specific
       column (higher than others) for a certain node, then it is marked as  a
       root node. Since the algorithm is statistical, it may not find any root
       nodes. The list of the root nodes found by this  auto-detect  stage  is
       used by the ranking process stage.

           Note 1: The user can override the node list manually.
           Note 2: If this stage cannot find any root nodes, and the user did
                   not specify a guid list file, OpenSM defaults back to the
                   Min Hop routing algorithm.

       2.   Ranking  process  -  All  root switch nodes (found in stage 1) are
       assigned a rank of 0. Using the BFS algorithm, the rest of  the  switch
       nodes  in the subnet are ranked incrementally. This ranking aids in the
       process of enforcing rules that ensure loop-free paths.

       3.  Min Hop Table setting - after ranking is done, a BFS  algorithm  is
       run  from  each  (CA  or  switch)  node  in  the subnet. During the BFS
       process, the FDB table of each switch node traversed by BFS is updated,
       in  reference to the starting node, based on the ranking rules and guid
       values.

       At the end of the process, the  updated  FDB  tables  ensure  loop-free
       paths through the subnet.

       Note:  Up/Down routing does not allow LID routing communication between
       switches that are located inside spine "switch systems".  The reason is
       that  there  is  no way to allow a LID route between them that does not
       break the Up/Down rule.  One ramification of this is  that  you  cannot
       run SM on switches other than the leaf switches of the fabric.

       UPDN Algorithm Usage

       Activation through OpenSM

       Use  ’-R  updn’  option  (instead  of  old  ’-u’)  to activate the UPDN
       algorithm.  Use ’-a <root_guid_file>’ for adding an UPDN guid file that
       contains  the  root nodes for ranking.  If the ‘-a’ option is not used,
       OpenSM uses its auto-detect root nodes algorithm.

       Notes on the guid list file:

       1.   A valid guid file specifies one guid in each line. Lines  with  an
       invalid format will be discarded.
       2.   The user should specify the root switch guids. However, it is also
       possible to specify CA guids; OpenSM will use the guid  of  the  switch
       (if it exists) that connects the CA to the subnet as a root node.

       Fat-tree Routing Algorithm

       The  fat-tree  algorithm  optimizes  routing  for "shift" communication
       pattern.  It should be chosen if a subnet is a  symmetrical  or  almost
       symmetrical  fat-tree  of various types.  It supports not just K-ary-N-
       Trees, by handling for non-constant K, cases where not all leafs  (CAs)
       are present, any CBB ratio.  As in UPDN, fat-tree also prevents credit-
       loop-deadlocks.

       If the root guid file  is  not  provided  (’-a’  or  ’--root_guid_file’
       options),  the  topology has to be pure fat-tree that complies with the
       following rules:
         - Tree rank should be between two and eight (inclusively)
         - Switches of the same rank should have the same number
           of UP-going port groups*, unless they are root switches,
           in which case the shouldn’t have UP-going ports at all.
         - Switches of the same rank should have the same number
           of DOWN-going port groups, unless they are leaf switches.
         - Switches of the same rank should have the same number
           of ports in each UP-going port group.
         - Switches of the same rank should have the same number
           of ports in each DOWN-going port group.
         - All the CAs have to be at the same tree level (rank).

       If the root guid file is provided, the topology doesn’t have to be pure
       fat-tree, and it should only comply with the following rules:
         - Tree rank should be between two and eight (inclusively)
         - All the Compute Nodes** have to be at the same tree level (rank).
           Note that non-compute node CAs are allowed here to be at different
           tree ranks.

       *  ports that are connected to the same remote switch are referenced as
       ´port group´.

       **  list  of  compute  nodes  (CNs)  can  be  specified  by   ´-u´   or
       ´--cn_guid_file´ OpenSM options.

       Topologies  that  do  not  comply  cause a fallback to min hop routing.
       Note that this can also occur on link failures which cause the topology
       to no longer be "pure" fat-tree.

       Note  that  although fat-tree algorithm supports trees with non-integer
       CBB ratio, the routing will not be as balanced as in  case  of  integer
       CBB  ratio.   In  addition  to this, although the algorithm allows leaf
       switches to have any number of CAs, the closer the tree is to be  fully
       populated,  the  more  effective the "shift" communication pattern will
       be.  In general, even if the root list  is  provided,  the  closer  the
       topology  to  a  pure  and  symmetrical  fat-tree, the more optimal the
       routing will be.

       The algorithm also dumps compute node ordering  file  (opensm-ftree-ca-
       order.dump)  in  the  same directory where the OpenSM log resides. This
       ordering file provides  the  CN  order  that  may  be  used  to  create
       efficient communication pattern, that will match the routing tables.

       Activation through OpenSM

       Use  ’-R  ftree’  option  to  activate the fat-tree algorithm.  Use ’-a
       <root_guid_file>’ to provide root nodes for ranking. If the ‘-a’ option
       is  not  used,  routing algorithm will detect roots automatically.  Use
       ’-u <root_cn_file>’ to provide the list of compute nodes. If  the  ‘-u’
       option is not used, all the CAs are considered as compute nodes.

       Note:  LMC  >  0  is  not  supported  by  fat-tree  routing. If this is
       specified, the default routing algorithm is invoked instead.

       LASH Routing Algorithm

       LASH is  an  acronym  for  LAyered  SHortest  Path  Routing.  It  is  a
       deterministic  shortest  path  routing  algorithm that enables topology
       agnostic deadlock-free routing within communication networks.

       When computing the routing function, LASH analyzes the network topology
       for   the   shortest-path   routes  between  all  pairs  of  sources  /
       destinations and groups these paths into virtual layers in such  a  way
       as to avoid deadlock.

       Note  LASH  analyzes routes and ensures deadlock freedom between switch
       pairs. The link from HCA between  and  switch  does  not  need  virtual
       layers as deadlock will not arise between switch and HCA.

       In more detail, the algorithm works as follows:

       1)  LASH  determines  the  shortest-path  between all pairs of source /
       destination switches. Note, LASH ensures the same SL is  used  for  all
       SRC/DST  - DST/SRC pairs and there is no guarantee that the return path
       for a given DST/SRC will be the reverse of the route SRC/DST.

       2) LASH then begins an SL assignment process where a route is  assigned
       to  a  layer (SL) if the addition of that route does not cause deadlock
       within that layer. This is achieved  by  maintaining  and  analysing  a
       channel dependency graph for each layer. Once the potential addition of
       a path could lead to deadlock, LASH opens a new layer and continues the
       process.

       3)  Once  this  stage  has been completed, it is highly likely that the
       first layers processed will contain more paths than  the  latter  ones.
       To better balance the use of layers, LASH moves paths from one layer to
       another so that the number of paths in each layer averages out.

       Note, the implementation of LASH in  opensm  attempts  to  use  as  few
       layers  as  possible. This number can be less than the number of actual
       layers available.

       In general LASH is a very flexible  algorithm.  It  can,  for  example,
       reduce to Dimension Order Routing in certain topologies, it is topology
       agnostic and fares well in the face of faults.

       It has been shown that for both regular and irregular topologies,  LASH
       outperforms  Up/Down.  The reason for this is that LASH distributes the
       traffic more evenly through a network, avoiding the  bottleneck  issues
       related to a root node and always routes shortest-path.

       The algorithm was developed by Simula Research Laboratory.

       Use ’-R lash -Q ’ option to activate the LASH algorithm.

       Note:  QoS support has to be turned on in order that SL/VL mappings are
       used.

       Note: LMC > 0 is  not  supported  by  the  LASH  routing.  If  this  is
       specified, the default routing algorithm is invoked instead.

       DOR Routing Algorithm

       The Dimension Order Routing algorithm is based on the Min Hop algorithm
       and so uses shortest paths.  Instead of spreading  traffic  out  across
       different  paths  with the same shortest distance, it chooses among the
       available shortest paths based on an ordering of dimensions.  Each port
       must  be  consistently  cabled  to represent a hypercube dimension or a
       mesh dimension.  Paths are grown from a destination back  to  a  source
       using  the  lowest  dimension  (port)  of available paths at each step.
       This provides the ordering necessary to avoid deadlock.  When there are
       multiple  links between any two switches, they still represent only one
       dimension and traffic is balanced across them unless port  equalization
       is  turned  off.  In the case of hypercubes, the same port must be used
       throughout the fabric to represent the hypercube dimension and match on
       both  ends  of  the cable.  In the case of meshes, the dimension should
       consistently use the same pair of ports, one port on  one  end  of  the
       cable,  and  the other port on the other end, continuing along the mesh
       dimension.

       Use ’-R dor’ option to activate the DOR algorithm.

       Routing References

       To learn more about deadlock-free routing, see  the  article  "Deadlock
       Free  Message  Routing  in  Multiprocessor Interconnection Networks" by
       William J Dally and Charles L Seitz (1985).

       To learn more about the up/down algorithm, see the  article  "Effective
       Strategy  to Compute Forwarding Tables for InfiniBand Networks" by Jose
       Carlos Sancho, Antonio  Robles,  and  Jose  Duato  at  the  Universidad
       Politecnica de Valencia.

       To learn more about LASH and the flexibility behind it, the requirement
       for layers,  performance  comparisons  to  other  algorithms,  see  the
       following articles:

       "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
       on Parallel and Distributed Systems, VOL.16, No12, December 2005.

       "Routing  for  the  ASI  Fabric   Manager",   Solheim   et   al.   IEEE
       Communications Magazine, Vol.44, No.7, July 2006.

       "Layered   Shortest  Path  (LASH)  Routing  in  Irregular  System  Area
       Networks",  Skeie  et   al.   IEEE   Computer   Society   Communication
       Architecture for Clusters 2002.

       Modular Routine Engine

       Modular  routing engine structure allows for the ease of "plugging" new
       routing modules.

       Currently, only unicast callbacks are supported. Multicast can be added
       later.

       One  existing  routing module is up-down "updn", which may be activated
       with ’-R updn’ option (instead of old ’-u’).

       General usage is: $ opensm -R ’module-name’

       There is also a trivial routing module which is able to load LFT tables
       from a file.

       Main features:

        - this will load switch LFTs and/or LID matrices (min hops tables)
        - this will load switch LFTs according to the path entries introduced
          in the file
        - no additional checks will be performed (such as "is port connected",
          etc.)
        - in case when fabric LIDs were changed this will try to reconstruct
          LFTs correctly if endport GUIDs are represented in the file
          (in order to disable this, GUIDs may be removed from the file
           or zeroed)

       The file format is compatible with output of  ’ibroute’  util  and  for
       whole fabric can be generated with dump_lfts.sh script.

       To activate file based routing module, use:

         opensm -R file -U /path/to/lfts_file

       If  the  lfts_file  is  not  found  or is in error, the default routing
       algorithm is utilized.

       The ability to dump switch lid matrices (aka min hops tables)  to  file
       and later to load these is also supported.

       The  usage  is similar to unicast forwarding tables loading from a lfts
       file (introduced by ’file’ routing engine), but  new  lid  matrix  file
       name  should  be  specified  by  -M  or  --lid_matrix_file  option. For
       example:

         opensm -R file -M ./opensm-lid-matrix.dump

       The dump file is named ´opensm-lid-matrix.dump´ and will  be  generated
       in   standard   opensm   dump  directory  (/var/log  by  default)  when
       OSM_LOG_ROUTING logging flag is set.

       When routing engine ’file’ is activated,  but  the  lfts  file  is  not
       specified  or  not  cannot be open default lid matrix algorithm will be
       used.

       There is also a switch forwarding tables dumper which generates a  file
       compatible with dump_lfts.sh output. This file can be used as input for
       forwarding tables loading by ’file’ routing engine.   Both  or  one  of
       options -U and -M can be specified together with ´-R file´.

FILES

       /etc/opensm/opensm.conf
              default OpenSM config file.

       /etc/opensm/ib-node-name-map
              default   node  name  map  file.   See  ibnetdiscover  for  more
              information on format.

       /etc/opensm/partitions.conf
              default partition config file

       /etc/opensm/qos-policy.conf
              default QOS policy config file

       /etc/opensm/prefix-routes.conf
              default prefix routes file.

AUTHORS

       Hal Rosenstock
              <hal.rosenstock@gmail.com>

       Sasha Khapyorsky
              <sashak@voltaire.com>

       Eitan Zahavi
              <eitan@mellanox.co.il>

       Yevgeny Kliteynik
              <kliteyn@mellanox.co.il>

       Thomas Sodring
              <tsodring@simula.no>

       Ira Weiny
              <weiny2@llnl.gov>