1 /* Copyright (C) 2012-2014 Free Software Foundation, Inc.
2 This file is part of the UPC runtime library.
3 Written by Gary Funck <gary@intrepid.com>
4 and Nenad Vukicevic <nenad@intrepid.com>
6 This file is part of GCC.
8 GCC is free software; you can redistribute it and/or modify
9 it under the terms of the GNU General Public License as published by
10 the Free Software Foundation; either version 3, or (at your option)
13 GCC is distributed in the hope that it will be useful,
14 but WITHOUT ANY WARRANTY; without even the implied warranty of
15 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 GNU General Public License for more details.
18 Under Section 7 of GPL version 3, you are granted additional
19 permissions described in the GCC Runtime Library Exception, version
20 3.1, as published by the Free Software Foundation.
22 You should have received a copy of the GNU General Public License and
23 a copy of the GCC Runtime Library Exception along with this program;
24 see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
25 <http://www.gnu.org/licenses/>. */
27 /*****************************************************************************/
29 /* Copyright (c) 2004, Michigan Technological University */
30 /* All rights reserved. */
32 /* Redistribution and use in source and binary forms, with or without */
33 /* modification, are permitted provided that the following conditions */
36 /* * Redistributions of source code must retain the above copyright */
37 /* notice, this list of conditions and the following disclaimer. */
38 /* * Redistributions in binary form must reproduce the above */
39 /* copyright notice, this list of conditions and the following */
40 /* disclaimer in the documentation and/or other materials provided */
41 /* with the distribution. */
42 /* * Neither the name of the Michigan Technological University */
43 /* nor the names of its contributors may be used to endorse or promote */
44 /* products derived from this software without specific prior written */
47 /* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS */
48 /* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT */
49 /* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A */
50 /* PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER */
51 /* OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, */
52 /* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, */
53 /* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR */
54 /* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF */
55 /* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING */
56 /* NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS */
57 /* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */
59 /*****************************************************************************/
63 This is a reference implementation of the UPC V1.0 collective functions.
64 This implementation conforms to UPC V1.1. The specification documents
65 can be found at http://www.upc.gwu/~upc/documentation.html.
69 The Makefile compiles the collective functions to collective.o and it
70 compiles three main programs, try_*.c, that exercise the collective
71 functions. The Makefile has macros that control the number of threads,
72 whether argument checking is turned on, and whether certain collective
73 functions "push" or "pull". Assuming it can find your upc compiler,
74 typing "make" will make collective.o with the "pull" versions of the
75 collective function for the dynamic threads environment and with
76 argument checking turned on.
78 2) Implementation notes
80 This implementation is written in UPC. The 6 "relocalization"
81 functions (upc_all_broadcast(), upc_all_scatter(), etc.) are
82 implemented using memcpy. Each function has a "push" and a "pull"
83 version. "Push" means that the source thread(s) memcpy's the data to
84 the destination thread(s); "pull" means that the destination thread(s)
85 memcpy's the data from the source thread(s). For example, in a "pull"
86 broadcast, each thread does one memcpy. In the "push" version the src
87 thread does THREADS memcpys. Since there is more explicit parallelism
88 in a "pull", that is the default. For upc_all_reduceT(), "push" means
89 that each thread writes its local result to the shared dst argument;
90 "pull" means that the dst thread reads each thread's local result to
91 compute the final result. No performance measurements have been made
92 yet to compare these alternatives. upc_all_prefix_reduceT() and
93 upc_all_sort() do not distinguish "push" and "pull".
97 The relocalization functions are implemented as simple copies. Tree-
98 based recursive algorithms, or other more sophisticated approaches,
99 are not used. Similarly, in upc_all_reduceT() the local "sums"
100 are combined sequentially. upc_all_prefix_reduceT() is also uses a linear
101 algorithm. Performance of this function is likely to suffer if the src
102 and dst arrays "wrap" a lot, such as when their block size is small.
103 The same is true if the affinities of the src and dst are different.
104 The sort algorithm in upc_all_sort is the simplest possible and all
105 sorting is done on thread 0. Users are advised to implement their
106 own parallel sort if performance is a concern.
110 MYSYNC and ALLSYNC (IN and OUT) are all implemented as barriers.
111 NOSYNC is implemented as no barrier, of course. upc_all_reduceT()
112 has one internal barrier to synchronize access to the local "sums".
113 upc_prefix_reduceT() has a minimum of three barriers. Two synchronize
114 access to the "sums" and the third synchronizes the freeing of a
115 dynamically allocated array. Two additional barriers are incurred
116 for each "wrap" of the src array.
120 An initialization function is provided. The first time a collective
121 function is called the initialization function is called automatically.
122 However, the initialization function currently does not do anything
123 and it is safe to eliminate it.
127 If the _UPC_COLL_DEBUG macro is defined at compile time the
128 upc_coll_err() function will be invoked before each call to a
129 collective function. This function checks the single-valuedness of
130 arguments to the collective function and it does a few other simple
131 sanity checks. It uses two barriers to coordinate inter-thread
132 checking. The results of the collective functions are not checked.
136 upc_all_reduceT(), upc_all_prefix_reduceT(), and upc_coll_err()
137 dynamically allocate a block of memory proportional to the number of
138 threads. That memory is freed on function exit.
140 8) Compilation environment
142 This set of functions can be compiled in both the dynamic and the
143 static (fixed number of threads) environment.
147 upc_all_prefix_reduceT() violates the collectives spec because it
148 cannot handle the case of src and dst arguments that do not have
149 the same phase. This condition is detected if argument checking is
152 The PUSH version of upc_all_reduceT() has exhibited an instability
153 on the Alphaserver platform due to Elan's handling of locks.
157 Three small test programs are provided. try_all.c exercises all
158 of the functions at least once, except only one instance of each
159 of upc_all_reduceT() and upc_all_prefix_reduceT() is called.
160 try_reduce.c and try_prefix.c exercise upc_all_reduceT() and
161 upc_all_prefix_reduceT(). Compliance, correctness, and performance
162 tests are not provided.