1 # Efficient and performance-portable vector software
3 [//]: # (placeholder, do not remove)
5 Highway is a C++ library that provides portable SIMD/vector intrinsics.
7 [Documentation](https://google.github.io/highway/en/master/)
11 We are passionate about high-performance software. We see major untapped
12 potential in CPUs (servers, mobile, desktops). Highway is for engineers who want
13 to reliably and economically push the boundaries of what is possible in
18 CPUs provide SIMD/vector instructions that apply the same operation to multiple
19 data items. This can reduce energy usage e.g. *fivefold* because fewer
20 instructions are executed. We also often see *5-10x* speedups.
22 Highway makes SIMD/vector programming practical and workable according to these
25 **Does what you expect**: Highway is a C++ library with carefully-chosen
26 functions that map well to CPU instructions without extensive compiler
27 transformations. The resulting code is more predictable and robust to code
28 changes/compiler updates than autovectorization.
30 **Works on widely-used platforms**: Highway supports four architectures; the
31 same application code can target eight instruction sets, including those with
32 'scalable' vectors (size unknown at compile time). Highway only requires C++11
33 and supports four families of compilers. If you would like to use Highway on
34 other platforms, please raise an issue.
36 **Flexible to deploy**: Applications using Highway can run on heterogeneous
37 clouds or client devices, choosing the best available instruction set at
38 runtime. Alternatively, developers may choose to target a single instruction set
39 without any runtime overhead. In both cases, the application code is the same
40 except for swapping `HWY_STATIC_DISPATCH` with `HWY_DYNAMIC_DISPATCH` plus one
43 **Suitable for a variety of domains**: Highway provides an extensive set of
44 operations, used for image processing (floating-point), compression, video
45 analysis, linear algebra, cryptography, sorting and random generation. We
46 recognise that new use-cases may require additional ops and are happy to add
47 them where it makes sense (e.g. no performance cliffs on some architectures). If
48 you would like to discuss, please file an issue.
50 **Rewards data-parallel design**: Highway provides tools such as Gather,
51 MaskedLoad, and FixedTag to enable speedups for legacy data structures. However,
52 the biggest gains are unlocked by designing algorithms and data structures for
53 scalable vectors. Helpful techniques include batching, structure-of-array
54 layouts, and aligned/padded allocations.
58 Online demos using Compiler Explorer:
60 - [multiple targets with dynamic dispatch](https://gcc.godbolt.org/z/zP7MYe9Yf)
62 - [single target using -m flags](https://gcc.godbolt.org/z/rGnjMevKG)
64 We observe that Highway is referenced in the following open source projects,
65 found via sourcegraph.com. Most are Github repositories. If you would like to
66 add your project or link to it directly, feel free to raise an issue or contact
67 us via the below email.
69 * Browsers: Chromium (+Vivaldi), Firefox (+floorp / foxhound / librewolf / Waterfox)
70 * Cryptography: google/distributed_point_functions
71 * Image codecs: eustas/2im, [Grok JPEG 2000](https://github.com/GrokImageCompression/grok), [JPEG XL](https://github.com/libjxl/libjxl), OpenHTJ2K
72 * Image processing: cloudinary/ssimulacra2, m-ab-s/media-autobuild_suite
73 * Image viewers: AlienCowEatCake/ImageViewer, mirillis/jpegxl-wic,
74 [Lux panorama/image viewer](https://bitbucket.org/kfj/pv/)
75 * Information retrieval: [iresearch database index](https://github.com/iresearch-toolkit/iresearch/blob/e7638e7a4b99136ca41f82be6edccf01351a7223/core/utils/simd_utils.hpp), michaeljclark/zvec
79 * [zimt](https://github.com/kfjahnke/zimt): C++11 template library to process n-dimensional arrays with multi-threaded SIMD code
80 * [vectorized Quicksort](https://github.com/google/highway/tree/master/hwy/contrib/sort) ([paper](https://arxiv.org/abs/2205.05982))
82 If you'd like to get Highway, in addition to cloning from this Github repository
83 or using it as a Git submodule, you can also find it in the following package
84 managers or repositories: alpinelinux, conan-io, conda-forge, DragonFlyBSD,
85 freebsd, ghostbsd, microsoft/vcpkg, MidnightBSD, NetBSD, openSUSE, opnsense,
86 Xilinx/Vitis_Libraries.
92 Highway supports 19 targets, listed in alphabetical order of platform:
94 - Any: `EMU128`, `SCALAR`;
95 - Arm: `NEON` (Armv7+), `SVE`, `SVE2`, `SVE_256`, `SVE2_128`;
96 - POWER: `PPC8` (v2.07), `PPC9` (v3.0), `PPC10` (v3.1B, not yet supported
97 due to compiler bugs, see #1207; also requires QEMU 7.2);
98 - RISC-V: `RVV` (1.0);
99 - WebAssembly: `WASM`, `WASM_EMU256` (a 2x unrolled version of wasm128,
100 enabled if `HWY_WANT_WASM2` is defined. This will remain supported until it
101 is potentially superseded by a future version of WASM.);
104 - `SSSE3` (~Intel Core)
105 - `SSE4` (~Nehalem, also includes AES + CLMUL).
106 - `AVX2` (~Haswell, also includes BMI2 + F16 + FMA)
107 - `AVX3` (~Skylake, AVX-512F/BW/CD/DQ/VL)
108 - `AVX3_DL` (~Icelake, includes BitAlg + CLMUL + GFNI + VAES + VBMI +
109 VBMI2 + VNNI + VPOPCNT; requires opt-in by defining `HWY_WANT_AVX3_DL`
110 unless compiling for static dispatch),
111 - `AVX3_ZEN4` (like AVX3_DL but optimized for AMD Zen4; requires opt-in by
112 defining `HWY_WANT_AVX3_ZEN4` if compiling for static dispatch)
114 SVE was initially tested using farm_sve (see acknowledgments).
118 Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH),
119 incrementing MINOR after backward-compatible additions and PATCH after
120 backward-compatible fixes. We recommend using releases (rather than the Git tip)
121 because they are tested more extensively, see below.
123 The current version 1.0 signals an increased focus on backwards compatibility.
124 Applications using documented functionality will remain compatible with future
125 updates that have the same major version number.
129 Continuous integration tests build with a recent version of Clang (running on
130 native x86, or QEMU for RISC-V and Arm) and MSVC 2019 (v19.28, running on native
133 Before releases, we also test on x86 with Clang and GCC, and Armv7/8 via GCC
134 cross-compile. See the [testing process](g3doc/release_testing_process.md) for
139 The `contrib` directory contains SIMD-related utilities: an image class with
140 aligned rows, a math library (16 functions already implemented, mostly
141 trigonometry), and functions for computing dot products and sorting.
145 If you only require x86 support, you may also use Agner Fog's
146 [VCL vector class library](https://github.com/vectorclass). It includes many
147 functions including a complete math library.
149 If you have existing code using x86/NEON intrinsics, you may be interested in
150 [SIMDe](https://github.com/simd-everywhere/simde), which emulates those
151 intrinsics using other platforms' intrinsics or autovectorization.
155 This project uses CMake to generate and build. In a Debian-based system you can
159 sudo apt install cmake
162 Highway's unit tests use [googletest](https://github.com/google/googletest).
163 By default, Highway's CMake downloads this dependency at configuration time.
164 You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
165 installing gtest separately:
168 sudo apt install libgtest-dev
171 Running cross-compiled tests requires support from the OS, which on Debian is
172 provided by the `qemu-user-binfmt` package.
174 To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS),
175 the standard CMake workflow can be used:
178 mkdir -p build && cd build
183 Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
185 Bazel is also supported for building, but it is not as widely used/tested.
187 When building for Armv7, a limitation of current compilers requires you to add
188 `-DHWY_CMAKE_ARM7:BOOL=ON` to the CMake command line; see #834 and #1032. We
189 understand that work is underway to remove this limitation.
191 Building on 32-bit x86 is not officially supported, and AVX2/3 are disabled by
192 default there. Note that johnplatts has successfully built and run the Highway
193 tests on 32-bit x86, including AVX2/3, on GCC 7/8 and Clang 8/11/12. On Ubuntu
194 22.04, Clang 11 and 12, but not later versions, require extra compiler flags
195 `-m32 -isystem /usr/i686-linux-gnu/include`. Clang 10 and earlier require the
196 above plus `-isystem /usr/i686-linux-gnu/include/c++/12/i686-linux-gnu`. See
201 You can use the `benchmark` inside examples/ as a starting point.
203 A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
204 and their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)
205 indicates the number of instructions per operation.
207 The [FAQ](g3doc/faq.md) answers questions about portability, API design and
208 where to find more information.
210 We recommend using full SIMD vectors whenever possible for maximum performance
211 portability. To obtain them, pass a `ScalableTag<float>` (or equivalently
212 `HWY_FULL(float)`) tag to functions such as `Zero/Set/Load`. There are two
213 alternatives for use-cases requiring an upper bound on the lanes:
215 - For up to `N` lanes, specify `CappedTag<T, N>` or the equivalent
216 `HWY_CAPPED(T, N)`. The actual number of lanes will be `N` rounded down to
217 the nearest power of two, such as 4 if `N` is 5, or 8 if `N` is 8. This is
218 useful for data structures such as a narrow matrix. A loop is still required
219 because vectors may actually have fewer than `N` lanes.
221 - For exactly a power of two `N` lanes, specify `FixedTag<T, N>`. The largest
222 supported `N` depends on the target, but is guaranteed to be at least
225 Due to ADL restrictions, user code calling Highway ops must either:
226 * Reside inside `namespace hwy { namespace HWY_NAMESPACE {`; or
227 * prefix each op with an alias such as `namespace hn = hwy::HWY_NAMESPACE;
229 * add using-declarations for each op used: `using hwy::HWY_NAMESPACE::Add;`.
231 Additionally, each function that calls Highway ops (such as `Load`) must either
232 be prefixed with `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and
233 `HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before
236 The entry points into code using Highway differ slightly depending on whether
237 they use static or dynamic dispatch.
239 * For static dispatch, `HWY_TARGET` will be the best available target among
240 `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
241 [quick-reference](g3doc/quick_reference.md)). Functions inside
242 `HWY_NAMESPACE` can be called using `HWY_STATIC_DISPATCH(func)(args)` within
243 the same module they are defined in. You can call the function from other
244 modules by wrapping it in a regular function and declaring the regular
245 function in a header.
247 * For dynamic dispatch, a table of function pointers is generated via the
248 `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
249 call the best function pointer for the current CPU's supported targets. A
250 module is automatically compiled for each target in `HWY_TARGETS` (see
251 [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
252 defined and `foreach_target.h` is included.
254 When using dynamic dispatch, `foreach_target.h` is included from translation
255 units (.cc files), not headers. Headers containing vector code shared between
256 several translation units require a special include guard, for example the
257 following taken from `examples/skeleton-inl.h`:
260 #if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)
261 #ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
262 #undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
264 #define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
267 #include "hwy/highway.h"
272 By convention, we name such headers `-inl.h` because their contents (often
273 function templates) are usually inlined.
277 Applications should be compiled with optimizations enabled - without inlining,
278 SIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is
279 generally sufficient.
281 For MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to
282 pass vector arguments in registers. If intending to use the AVX2 target together
283 with half-width vectors (e.g. for `PromoteTo`), it is also important to compile
284 with `/arch:AVX2`. This seems to be the only way to generate VEX-encoded SSE4
285 instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
286 non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
287 resulting binary will then require AVX2. Note that no such flag is needed for
288 clang and GCC because they support target-specific attributes, which we use to
289 ensure proper VEX code generation for AVX2 targets.
291 ## Strip-mining loops
293 When vectorizing a loop, an important question is whether and how to deal with
294 a number of iterations ('trip count', denoted `count`) that does not evenly
295 divide the vector size `N = Lanes(d)`. For example, it may be necessary to avoid
296 writing past the end of an array.
298 In this section, let `T` denote the element type and `d = ScalableTag<T>`.
299 Assume the loop body is given as a function `template<bool partial, class D>
300 void LoopBody(D d, size_t index, size_t max_n)`.
302 "Strip-mining" is a technique for vectorizing a loop by transforming it into an
303 outer loop and inner loop, such that the number of iterations in the inner loop
304 matches the vector width. Then, the inner loop is replaced with vector
307 Highway offers several strategies for loop vectorization:
309 * Ensure all inputs/outputs are padded. Then the (outer) loop is simply
312 for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
314 Here, the template parameter and second function argument are not needed.
316 This is the preferred option, unless `N` is in the thousands and vector
317 operations are pipelined with long latencies. This was the case for
318 supercomputers in the 90s, but nowadays ALUs are cheap and we see most
319 implementations split vectors into 1, 2 or 4 parts, so there is little cost
320 to processing entire vectors even if we do not need all their lanes. Indeed
321 this avoids the (potentially large) cost of predication or partial
322 loads/stores on older targets, and does not duplicate code.
324 * Process whole vectors and include previously processed elements
327 for (size_t i = 0; i < count; i += N) LoopBody<false>(d, HWY_MIN(i, count - N), 0);
330 This is the second preferred option provided that `count >= N`
331 and `LoopBody` is idempotent. Some elements might be processed twice, but
332 a single code path and full vectorization is usually worth it. Even if
333 `count < N`, it usually makes sense to pad inputs/outputs up to `N`.
335 * Use the `Transform*` functions in hwy/contrib/algo/transform-inl.h. This
336 takes care of the loop and remainder handling and you simply define a
337 generic lambda function (C++14) or functor which receives the current vector
338 from the input/output array, plus optionally vectors from up to two extra
339 input arrays, and returns the value to write to the input/output array.
341 Here is an example implementing the BLAS function SAXPY (`alpha * x + y`):
344 Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {
345 return MulAdd(Set(d, alpha), v, v1);
349 * Process whole vectors as above, followed by a scalar loop:
353 for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
354 for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
356 The template parameter and second function arguments are again not needed.
358 This avoids duplicating code, and is reasonable if `count` is large.
359 If `count` is small, the second loop may be slower than the next option.
361 * Process whole vectors as above, followed by a single call to a modified
362 `LoopBody` with masking:
366 for (; i + N <= count; i += N) {
367 LoopBody<false>(d, i, 0);
370 LoopBody<true>(d, i, count - i);
373 Now the template parameter and third function argument can be used inside
374 `LoopBody` to non-atomically 'blend' the first `num_remaining` lanes of `v`
375 with the previous contents of memory at subsequent locations:
376 `BlendedStore(v, FirstN(d, num_remaining), d, pointer);`. Similarly,
377 `MaskedLoad(FirstN(d, num_remaining), d, pointer)` loads the first
378 `num_remaining` elements and returns zero in other lanes.
380 This is a good default when it is infeasible to ensure vectors are padded,
381 but is only safe `#if !HWY_MEM_OPS_MIGHT_FAULT`!
382 In contrast to the scalar loop, only a single final iteration is needed.
383 The increased code size from two loop bodies is expected to be worthwhile
384 because it avoids the cost of masking in all but the final iteration.
386 ## Additional resources
388 * [Highway introduction (slides)](g3doc/highway_intro.pdf)
389 * [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)
390 * [Design philosophy and comparison](g3doc/design_philosophy.md)
391 * [Implementation details](g3doc/impl_details.md)
395 We have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger
396 Bramas; it has proved useful for checking the SVE port on an x86 development
399 This is not an officially supported Google product.
400 Contact: janwas@google.com