1 # Efficient and performance-portable SIMD
3 Highway is a C++ library for SIMD (Single Instruction, Multiple Data), i.e.
4 applying the same operation to multiple 'lanes' using a single CPU instruction.
8 - more portable (same source code) than platform-specific intrinsics,
9 - works on a wider range of compilers than compiler-specific vector extensions,
10 - more dependable than autovectorization,
11 - easier to write/maintain than assembly language,
12 - supports **runtime dispatch**,
13 - supports **variable-length vector** architectures.
17 Supported targets: scalar, S-SSE3, SSE4, AVX2, AVX-512, AVX3_DL (~Icelake,
18 requires opt-in by defining `HWY_WANT_AVX3_DL`), NEON (ARMv7 and v8), SVE,
21 SVE is tested using farm_sve (see acknowledgments). SVE2 is implemented but not
22 yet validated. A subset of RVV is implemented and tested with GCC and QEMU.
23 Work is underway to compile using LLVM, which has different intrinsics with AVL.
25 Version 0.11 is considered stable enough to use in other projects, and is
26 expected to remain backwards compatible unless serious issues are discovered
27 while finishing the RVV target. After that, Highway will reach version 1.0.
29 Continuous integration tests build with a recent version of Clang (running on
30 x86 and QEMU for ARM) and MSVC from VS2015 (running on x86).
32 Before releases, we also test on x86 with Clang and GCC, and ARMv7/8 via
33 GCC cross-compile and QEMU. See the
34 [testing process](g3doc/release_testing_process.md) for details.
36 The `contrib` directory contains SIMD-related utilities: an image class with
37 aligned rows, and a math library (16 functions already implemented, mostly
42 This project uses cmake to generate and build. In a Debian-based system you can
46 sudo apt install cmake
49 Highway's unit tests use [googletest](https://github.com/google/googletest).
50 By default, Highway's CMake downloads this dependency at configuration time.
51 You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
52 installing gtest separately:
55 sudo apt install libgtest-dev
58 To build and test the library the standard cmake workflow can be used:
61 mkdir -p build && cd build
66 Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
68 Bazel is also supported for building, but it is not as widely used/tested.
72 You can use the `benchmark` inside examples/ as a starting point.
74 A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
75 and their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)
76 indicates the number of instructions per operation.
78 We recommend using full SIMD vectors whenever possible for maximum performance
79 portability. To obtain them, pass a `HWY_FULL(float)` tag to functions such as
80 `Zero/Set/Load`. There is also the option of a vector of up to `N` (a power of
81 two <= 16/sizeof(T)) lanes of type `T`: `HWY_CAPPED(T, N)`. If `HWY_TARGET ==
82 HWY_SCALAR`, the vector always has one lane. For all other targets, up to
83 128-bit vectors are guaranteed to be available.
85 Functions using Highway must be inside `namespace HWY_NAMESPACE {`
86 (possibly nested in one or more other namespaces defined by the project), and
87 additionally either prefixed with `HWY_ATTR`, or residing between
88 `HWY_BEFORE_NAMESPACE()` and `HWY_AFTER_NAMESPACE()`.
90 * For static dispatch, `HWY_TARGET` will be the best available target among
91 `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
92 [quick-reference](g3doc/quick_reference.md)). Functions inside `HWY_NAMESPACE`
93 can be called using `HWY_STATIC_DISPATCH(func)(args)` within the same module
94 they are defined in. You can call the function from other modules by
95 wrapping it in a regular function and declaring the regular function in a
98 * For dynamic dispatch, a table of function pointers is generated via the
99 `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
100 call the best function pointer for the current CPU's supported targets. A
101 module is automatically compiled for each target in `HWY_TARGETS` (see
102 [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
103 defined and foreach_target.h is included.
107 Applications should be compiled with optimizations enabled - without inlining,
108 SIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is
109 generally sufficient.
111 For MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to
112 pass vector arguments in registers. If intending to use the AVX2 target together
113 with half-width vectors (e.g. for `PromoteTo`), it is also important to compile
114 with `/arch:AVX2`. This seems to be the only way to generate VEX-encoded SSE4
115 instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
116 non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
117 resulting binary will then require AVX2. Note that no such flag is needed for
118 clang and GCC because they support target-specific attributes, which we use to
119 ensure proper VEX code generation for AVX2 targets.
121 ## Strip-mining loops
123 To vectorize a loop, "strip-mining" transforms it into an outer loop and inner
124 loop with number of iterations matching the preferred vector width.
126 In this section, let `T` denote the element type, `d = HWY_FULL(T)`, `count` the
127 number of elements to process, and `N = Lanes(d)` the number of lanes in a full
128 vector. Assume the loop body is given as a function `template<bool partial,
129 class D> void LoopBody(D d, size_t max_n)`.
131 Highway offers several ways to express loops where `N` need not divide `count`:
133 * Ensure all inputs/outputs are padded. Then the loop is simply
136 for (size_t i = 0; i < count; i += N) LoopBody<false>(d, 0);
138 Here, the template parameter and second function argument are not needed.
140 This is the preferred option, unless `N` is in the thousands and vector
141 operations are pipelined with long latencies. This was the case for
142 supercomputers in the 90s, but nowadays ALUs are cheap and we see most
143 implementations split vectors into 1, 2 or 4 parts, so there is little cost
144 to processing entire vectors even if we do not need all their lanes. Indeed
145 this avoids the (potentially large) cost of predication or partial
146 loads/stores on older targets, and does not duplicate code.
148 * Process whole vectors as above, followed by a scalar loop:
152 for (; i + N <= count; i += N) LoopBody<false>(d, 0);
153 for (; i < count; ++i) LoopBody<false>(HWY_CAPPED(T, 1)(), 0);
155 The template parameter and second function arguments are again not needed.
157 This avoids duplicating code, and is reasonable if `count` is large.
158 If `count` is small, the second loop may be slower than the next option.
160 * Process whole vectors as above, followed by a single call to a modified
161 `LoopBody` with masking:
165 for (; i + N <= count; i += N) {
166 LoopBody<false>(d, 0);
169 LoopBody<true>(d, count - i);
172 Now the template parameter and second function argument can be used inside
173 `LoopBody` to 'blend' the new partial vector with previous memory contents:
174 `Store(IfThenElse(FirstN(d, N), partial, prev_full), d, aligned_pointer);`.
176 This is a good default when it is infeasible to ensure vectors are padded.
177 In contrast to the scalar loop, only a single final iteration is needed.
179 ## Additional resources
181 * [Highway introduction (slides)](g3doc/highway_intro.pdf)
182 * [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)
183 * [Design philosophy and comparison](g3doc/design_philosophy.md)
187 We have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger
188 Bramas; it has proved useful for checking the SVE port on an x86 development
191 This is not an officially supported Google product.
192 Contact: janwas@google.com