doc/charm++/partition.tex

   1 With the latest 6.5.0 release, \charmpp has been augmented with support for
   2 partitioning. The key idea is to
   3 divide the allocated set of nodes into subsets that run independent \charmpp
   4 instances. These \charmpp instances (called partitions from now on) have a
   5 unique identifier, can be programmed to do different tasks, and can interact
   6 with each other. Addition of the partitioning scheme does not affect the
   7 existing code base or codes that do not want to use partitioning. Some of the use
   8 cases of partitioning are replicated NAMD, replica based fault tolerance,
   9 studying mapping performance etc. In some aspects, partitioning is similar to
  10 disjoint communicator creation in MPI.
  11
  12 \section{Overview}
  13 \charmpp stack has three components - Charm++, Converse and machine layer. In
  14 general, machine layer handles the exchange of messages among nodes, and
  15 interacts with the next layer in the stack - Converse. Converse is responsible
  16 for scheduling of tasks (including user code) and is used by Charm++ to execute
  17 the user application. Charm++ is the top-most level in which the applications
  18 are written. During partitioning, Charm++ and machine layers are unaware of
  19 the partitioning. Charm++ assumes its partition to be the entire world, whereas
  20 machine layer considers the whole set of allocated nodes as one partition.
  21 During start up, converse divides the allocated set of nodes into partitions, in
  22 each of which Charm++ instances are run. It performs the necessary translations
  23 as interactions happen with Charm++ and the machine layer.  The partitions can
  24 communicate with each other using the Converse API described later.
  25
  26 \section{Ranking}
  27 \charmpp stack assigns a rank to every processing element (PE). In the non-partitioned
  28 version, a rank assigned to a PE is same at all three layers of \charmpp
  29 stack. This rank also (generally) coincides with the rank provided to processors/cores
  30 by the underlying job scheduler. The importance of these ranks derive from the
  31 fact that they are used for multiple purposes. Partitioning leads to segregation of the
  32 notion of ranks at different levels of Charm++ stack. What used to be the PE is now a
  33 local rank within a partition running a Charm++ instance. Existing methods such as {\tt CkMyPe()},
  34 {\tt CkMyNode()}, { \tt CmiMyPe()}, etc continue to provide these local ranks. Hence, existing
  35 codes do not require any change as long as inter-partition interaction is not required.
  36
  37 On the other hand, machine layer is provided with the target ranks that
  38 are globally unique. These ranks can be obtained using functions with \emph{Global}
  39 suffix such as {\tt CmiNumNodesGlobal()}, {\tt CmiMyNodeGlobal()}, {\tt CmiMyPeGlobal()} etc.
  40
  41 Converse, which operates at a layer between Charm++ and machine layer,
  42 performs the required transitions. It maintains relevant information for any
  43 conversion. Information related to partitions can be obtained using Converse
  44 level functions such as {\tt CmiMyPartition()}, {\tt CmiNumPartitions()},
  45 etc. If required, one can also obtain the mapping of a local rank to a global
  46 rank using functions such as {\tt CmiGetPeGlobal(int perank, int partition)} and
  47 {\tt CmiGetNodeGlobal(int noderank, int partition)}. These functions
  48 take two arguments - the local rank and the partition number. For example,
  49 CmiGetNodeGlobal(5, 2) will return the global rank of the node that belongs to
  50 partition 2 and has a local rank of 5 in partition 2. The inverse
  51 translation, from global rank to local rank, is not supported.
  52
  53
  54 \section{Startup and Partitioning}
  55
  56 A number of compile time and runtime parameters are available for users who want
  57 to run multiple partitions in one single job.
  58
  59 \begin{itemize}
  60 \item Runtime parameter: {\tt +partitions <part\_number>} or {\tt +replicas
  61 <replica\_number>} - number of partitions to be created. If no further options are
  62 provided, allocated cores/nodes are divided equally among partitions. Only this option
  63 is supported from the 6.5.0 release; remaining options are supported starting 6.6.0.
  64
  65 \item Runtime parameter: {\tt +master\_partition} - assign one core/node as the master
  66 partition (partition 0), and divide the remaining cores/nodes equally among remaining
  67 partitions.
  68
  69 \item Runtime parameter: {\tt +partition\_sizes L[-U[:S[.R]]]\#W[,...]} - defines the size of
  70 partitions.  A single number identifies a particular partition. Two numbers separated by
  71 a dash identify an inclusive range (\emph{lower bound} and \emph{upper bound}). If they
  72 are followed by a colon and another number (a \emph{stride}), that range will be stepped
  73 through in increments of the additional number. Within each stride, a dot followed by a
  74 \emph{run} will indicate how many partitions to use from that starting point.  Finally,
  75 a compulsory number sign (\#) followed by a \emph{width} defines the size of each of the
  76 partitions identified so far. For example, the sequence {\tt 0-4:2\#10,1\#5,3\#15} states that
  77 partitions 0, 2, 4 should be of size 10, partition 1 of size 5 and partition 3 of size 15.
  78 In SMP mode, these sizes are in terms of nodes. All workers threads associated with a node are
  79 assigned to the partition of the node. This option conflicts with {\tt +assign\_master}.
  80
  81 \item Runtime parameter: {\tt +partition\_topology} - use a default topology aware scheme
  82 to partition the allocated nodes.
  83
  84 \item Runtime parameter: {\tt +partition\_topology\_scheme <scheme>} - use the given scheme
  85 to partition the allocated nodes. Currently, two generalized schemes are supported that
  86 should be useful on torus networks. If scheme is set to 1, allocated nodes are traversed
  87 plane by plane during partitioning. A hilbert curve based traversal is used with scheme 2.
  88
  89 \item Compilation parameter: {\tt -custom-part}, runtime parameter: {\tt +use\_custom\_partition} -
  90 enables use of user defined partitioning. In order to implement a new partitioning scheme,
  91 a user must link an object exporting a C function with following prototype: \\
  92
  93 extern ``C'' void createCustomPartitions(int numparts, int *partitionSize, int *nodeMap);\\
  94 {\tt numparts} (input) - number of partitions to be created. \\
  95 {\tt partitionSize} (input) - an array that contains size of each partition. \\
  96 {\tt nodeMap} (output, preallocated) - a preallocated array of length {\tt CmiNumNodesGlobal()}.
  97 Entry \emph{i} in this array specifies the new global node rank of a node with default node rank \emph{i}.
  98 The entries in this array are block wise divided to create partitions, i.e entries 0 to
  99 partitionSize[0]-1 belong to partition 1, partitionSize[0] to
 100 partitionSize[0]+partitionSize[1]-1 to partition 2 and so on.\\
 101
 102 When this function is invoked to create partitions, TopoManager is configured to
 103 view all the allocated node as one partition. Partition based API is yet to be
 104 initialized, and should not be used. A link time parameter {\tt -custom-part}
 105 is required to be passed to {\tt charmc} for successful compilation.
 106 \end{itemize}
 107
 108 \section{Redirecting output from individual partitions}
 109 Output to standard output (stdout) from various partitions can be directed
 110 to separate files by passing the target path as a command line option. The run
 111 time parameter {\tt +stdout <path>} is to be used for this purpose. The
 112 {\tt <path>} may contain the C format specifier \emph{\%d}, which will be replaced by the
 113 partition number. In case, \emph{\%d} is specified multiple times, only the first
 114 three instances from the left will be replaced by the partition number (other or additional format specifiers will result in undefined behavior). If a format specifier is not specified,
 115 the partition number will be appended as a suffix to the specified path. Example usage:
 116
 117 \begin{itemize}
 118 \item {\tt +stdout out/\%d/log} will write to \emph{out/0/log, out/1/log,
 119 out/2/log,} $\cdots$.
 120 \item {\tt +stdout log} will write to \emph{log.0, log.1, log.2,} $\cdots$.
 121 \item {\tt +stdout out/\%d/log\%d} will write to \emph{out/0/log0, out/1/log1,
 122 out/2/log2,} $\cdots$.
 123 \end{itemize}
 124
 125 \section{Inter-partition Communication}
 126
 127 A new API was added to Converse to enable sending messages from one replica to
 128 another. Currently, following functions are available for the same
 129 \begin{itemize}
 130 \item CmiInterSyncSend(local\_rank, partition, size, message)
 131 \item CmiInterSyncSendAndFree(local\_rank, partition, size, message)
 132 \item CmiInterSyncNodeSend(local\_node, partition, size, message)
 133 \item CmiInterSyncNodeSendAndFree(local\_node, partition, size, message)
 134 \end{itemize}
 135
 136 Users who have coded in Converse will find these functions to be very similar
 137 to basic Converse functions for send – CmiSyncSend and CmiSyncSendAndFree.
 138 Given the local rank of a PE and the partition it belongs to, these two
 139 functions will pass the message to the machine layer. CmiInterSyncSend does
 140 not return till “message” is ready for reuse. CmiInterSyncSendAndFree passes
 141 the ownership of “message” to Charm++ RTS, which will free the message when
 142 the send is complete. Each converse message contains a message header, which
 143 makes those messages active – they contain information about their handlers.
 144 These handlers can be registered using existing API in Charm++ -
 145 CmiRegisterHandler.  CmiInterNodeSend and CmiInterNodeSendAndFree are
 146 counterparts to these functions that allow sending of a message to a node (in
 147     SMP mode).
 148
 149