Enable support for building mpi-win-x86_64-gcc
[charm.git] / doc / charm++ / partition.tex
blob88b7845a8df1706fbfdc8f562eac2a2ac80c81a8
1 With the latest 6.5.0 release, \charmpp has been augmented with support for
2 partitioning. The key idea is to
3 divide the allocated set of nodes into subsets that run independent \charmpp
4 instances. These \charmpp instances (called partitions from now on) have a
5 unique identifier, can be programmed to do different tasks, and can interact
6 with each other. Addition of the partitioning scheme does not affect the
7 existing code base or codes that do not want to use partitioning. Some of the use
8 cases of partitioning are replicated NAMD, replica based fault tolerance,
9 studying mapping performance etc. In some aspects, partitioning is similar to
10 disjoint communicator creation in MPI.
12 \section{Overview}
13 \charmpp stack has three components - Charm++, Converse and machine layer. In
14 general, machine layer handles the exchange of messages among nodes, and
15 interacts with the next layer in the stack - Converse. Converse is responsible
16 for scheduling of tasks (including user code) and is used by Charm++ to execute
17 the user application. Charm++ is the top-most level in which the applications
18 are written. During partitioning, Charm++ and machine layers are unaware of
19 the partitioning. Charm++ assumes its partition to be the entire world, whereas
20 machine layer considers the whole set of allocated nodes as one partition.
21 During start up, converse divides the allocated set of nodes into partitions, in
22 each of which Charm++ instances are run. It performs the necessary translations
23 as interactions happen with Charm++ and the machine layer. The partitions can
24 communicate with each other using the Converse API described later.
26 \section{Ranking}
27 \charmpp stack assigns a rank to every processing element (PE). In the non-partitioned
28 version, a rank assigned to a PE is same at all three layers of \charmpp
29 stack. This rank also (generally) coincides with the rank provided to processors/cores
30 by the underlying job scheduler. The importance of these ranks derive from the
31 fact that they are used for multiple purposes. Partitioning leads to segregation of the
32 notion of ranks at different levels of Charm++ stack. What used to be the PE is now a
33 local rank within a partition running a Charm++ instance. Existing methods such as {\tt CkMyPe()},
34 {\tt CkMyNode()}, { \tt CmiMyPe()}, etc continue to provide these local ranks. Hence, existing
35 codes do not require any change as long as inter-partition interaction is not required.
37 On the other hand, machine layer is provided with the target ranks that
38 are globally unique. These ranks can be obtained using functions with \emph{Global}
39 suffix such as {\tt CmiNumNodesGlobal()}, {\tt CmiMyNodeGlobal()}, {\tt CmiMyPeGlobal()} etc.
41 Converse, which operates at a layer between Charm++ and machine layer,
42 performs the required transitions. It maintains relevant information for any
43 conversion. Information related to partitions can be obtained using Converse
44 level functions such as {\tt CmiMyPartition()}, {\tt CmiNumPartitions()},
45 etc. If required, one can also obtain the mapping of a local rank to a global
46 rank using functions such as {\tt CmiGetPeGlobal(int perank, int partition)} and
47 {\tt CmiGetNodeGlobal(int noderank, int partition)}. These functions
48 take two arguments - the local rank and the partition number. For example,
49 CmiGetNodeGlobal(5, 2) will return the global rank of the node that belongs to
50 partition 2 and has a local rank of 5 in partition 2. The inverse
51 translation, from global rank to local rank, is not supported.
54 \section{Startup and Partitioning}
56 A number of compile time and runtime parameters are available for users who want
57 to run multiple partitions in one single job.
59 \begin{itemize}
60 \item Runtime parameter: {\tt +partitions <part\_number>} or {\tt +replicas
61 <replica\_number>} - number of partitions to be created. If no further options are
62 provided, allocated cores/nodes are divided equally among partitions. Only this option
63 is supported from the 6.5.0 release; remaining options are supported starting 6.6.0.
65 \item Runtime parameter: {\tt +master\_partition} - assign one core/node as the master
66 partition (partition 0), and divide the remaining cores/nodes equally among remaining
67 partitions.
69 \item Runtime parameter: {\tt +partition\_sizes L[-U[:S[.R]]]\#W[,...]} - defines the size of
70 partitions. A single number identifies a particular partition. Two numbers separated by
71 a dash identify an inclusive range (\emph{lower bound} and \emph{upper bound}). If they
72 are followed by a colon and another number (a \emph{stride}), that range will be stepped
73 through in increments of the additional number. Within each stride, a dot followed by a
74 \emph{run} will indicate how many partitions to use from that starting point. Finally,
75 a compulsory number sign (\#) followed by a \emph{width} defines the size of each of the
76 partitions identified so far. For example, the sequence {\tt 0-4:2\#10,1\#5,3\#15} states that
77 partitions 0, 2, 4 should be of size 10, partition 1 of size 5 and partition 3 of size 15.
78 In SMP mode, these sizes are in terms of nodes. All workers threads associated with a node are
79 assigned to the partition of the node. This option conflicts with {\tt +assign\_master}.
81 \item Runtime parameter: {\tt +partition\_topology} - use a default topology aware scheme
82 to partition the allocated nodes.
84 \item Runtime parameter: {\tt +partition\_topology\_scheme <scheme>} - use the given scheme
85 to partition the allocated nodes. Currently, two generalized schemes are supported that
86 should be useful on torus networks. If scheme is set to 1, allocated nodes are traversed
87 plane by plane during partitioning. A hilbert curve based traversal is used with scheme 2.
89 \item Compilation parameter: {\tt -custom-part}, runtime parameter: {\tt +use\_custom\_partition} -
90 enables use of user defined partitioning. In order to implement a new partitioning scheme,
91 a user must link an object exporting a C function with following prototype: \\
93 extern ``C'' void createCustomPartitions(int numparts, int *partitionSize, int *nodeMap);\\
94 {\tt numparts} (input) - number of partitions to be created. \\
95 {\tt partitionSize} (input) - an array that contains size of each partition. \\
96 {\tt nodeMap} (output, preallocated) - a preallocated array of length {\tt CmiNumNodesGlobal()}.
97 Entry \emph{i} in this array specifies the new global node rank of a node with default node rank \emph{i}.
98 The entries in this array are block wise divided to create partitions, i.e entries 0 to
99 partitionSize[0]-1 belong to partition 1, partitionSize[0] to
100 partitionSize[0]+partitionSize[1]-1 to partition 2 and so on.\\
102 When this function is invoked to create partitions, TopoManager is configured to
103 view all the allocated node as one partition. Partition based API is yet to be
104 initialized, and should not be used. A link time parameter {\tt -custom-part}
105 is required to be passed to {\tt charmc} for successful compilation.
106 \end{itemize}
108 \section{Redirecting output from individual partitions}
109 Output to standard output (stdout) from various partitions can be directed
110 to separate files by passing the target path as a command line option. The run
111 time parameter {\tt +stdout <path>} is to be used for this purpose. The
112 {\tt <path>} may contain the C format specifier \emph{\%d}, which will be replaced by the
113 partition number. In case, \emph{\%d} is specified multiple times, only the first
114 three instances from the left will be replaced by the partition number (other or additional format specifiers will result in undefined behavior). If a format specifier is not specified,
115 the partition number will be appended as a suffix to the specified path. Example usage:
117 \begin{itemize}
118 \item {\tt +stdout out/\%d/log} will write to \emph{out/0/log, out/1/log,
119 out/2/log,} $\cdots$.
120 \item {\tt +stdout log} will write to \emph{log.0, log.1, log.2,} $\cdots$.
121 \item {\tt +stdout out/\%d/log\%d} will write to \emph{out/0/log0, out/1/log1,
122 out/2/log2,} $\cdots$.
123 \end{itemize}
125 \section{Inter-partition Communication}
127 A new API was added to Converse to enable sending messages from one replica to
128 another. Currently, following functions are available for the same
129 \begin{itemize}
130 \item CmiInterSyncSend(local\_rank, partition, size, message)
131 \item CmiInterSyncSendAndFree(local\_rank, partition, size, message)
132 \item CmiInterSyncNodeSend(local\_node, partition, size, message)
133 \item CmiInterSyncNodeSendAndFree(local\_node, partition, size, message)
134 \end{itemize}
136 Users who have coded in Converse will find these functions to be very similar
137 to basic Converse functions for send – CmiSyncSend and CmiSyncSendAndFree.
138 Given the local rank of a PE and the partition it belongs to, these two
139 functions will pass the message to the machine layer. CmiInterSyncSend does
140 not return till “message” is ready for reuse. CmiInterSyncSendAndFree passes
141 the ownership of “message” to Charm++ RTS, which will free the message when
142 the send is complete. Each converse message contains a message header, which
143 makes those messages active – they contain information about their handlers.
144 These handlers can be registered using existing API in Charm++ -
145 CmiRegisterHandler. CmiInterNodeSend and CmiInterNodeSendAndFree are
146 counterparts to these functions that allow sending of a message to a node (in
147 SMP mode).