third_party/highway/README.md

   1 # Efficient and performance-portable vector software
   2
   3 [//]: # (placeholder, do not remove)
   4
   5 Highway is a C++ library that provides portable SIMD/vector intrinsics.
   6
   7 [Documentation](https://google.github.io/highway/en/master/)
   8
   9 ## Why
  10
  11 We are passionate about high-performance software. We see major untapped
  12 potential in CPUs (servers, mobile, desktops). Highway is for engineers who want
  13 to reliably and economically push the boundaries of what is possible in
  14 software.
  15
  16 ## How
  17
  18 CPUs provide SIMD/vector instructions that apply the same operation to multiple
  19 data items. This can reduce energy usage e.g. *fivefold* because fewer
  20 instructions are executed. We also often see *5-10x* speedups.
  21
  22 Highway makes SIMD/vector programming practical and workable according to these
  23 guiding principles:
  24
  25 **Does what you expect**: Highway is a C++ library with carefully-chosen
  26 functions that map well to CPU instructions without extensive compiler
  27 transformations. The resulting code is more predictable and robust to code
  28 changes/compiler updates than autovectorization.
  29
  30 **Works on widely-used platforms**: Highway supports four architectures; the
  31 same application code can target eight instruction sets, including those with
  32 'scalable' vectors (size unknown at compile time). Highway only requires C++11
  33 and supports four families of compilers. If you would like to use Highway on
  34 other platforms, please raise an issue.
  35
  36 **Flexible to deploy**: Applications using Highway can run on heterogeneous
  37 clouds or client devices, choosing the best available instruction set at
  38 runtime. Alternatively, developers may choose to target a single instruction set
  39 without any runtime overhead. In both cases, the application code is the same
  40 except for swapping `HWY_STATIC_DISPATCH` with `HWY_DYNAMIC_DISPATCH` plus one
  41 line of code.
  42
  43 **Suitable for a variety of domains**: Highway provides an extensive set of
  44 operations, used for image processing (floating-point), compression, video
  45 analysis, linear algebra, cryptography, sorting and random generation. We
  46 recognise that new use-cases may require additional ops and are happy to add
  47 them where it makes sense (e.g. no performance cliffs on some architectures). If
  48 you would like to discuss, please file an issue.
  49
  50 **Rewards data-parallel design**: Highway provides tools such as Gather,
  51 MaskedLoad, and FixedTag to enable speedups for legacy data structures. However,
  52 the biggest gains are unlocked by designing algorithms and data structures for
  53 scalable vectors. Helpful techniques include batching, structure-of-array
  54 layouts, and aligned/padded allocations.
  55
  56 ## Examples
  57
  58 Online demos using Compiler Explorer:
  59
  60 -   [multiple targets with dynamic dispatch](https://gcc.godbolt.org/z/zP7MYe9Yf)
  61     (recommended)
  62 -   [single target using -m flags](https://gcc.godbolt.org/z/rGnjMevKG)
  63
  64 We observe that Highway is referenced in the following open source projects,
  65 found via sourcegraph.com. Most are Github repositories. If you would like to
  66 add your project or link to it directly, feel free to raise an issue or contact
  67 us via the below email.
  68
  69 *   Browsers: Chromium (+Vivaldi), Firefox (+floorp / foxhound / librewolf / Waterfox)
  70 *   Cryptography: google/distributed_point_functions
  71 *   Image codecs: eustas/2im, [Grok JPEG 2000](https://github.com/GrokImageCompression/grok), [JPEG XL](https://github.com/libjxl/libjxl), OpenHTJ2K
  72 *   Image processing: cloudinary/ssimulacra2, m-ab-s/media-autobuild_suite
  73 *   Image viewers: AlienCowEatCake/ImageViewer, mirillis/jpegxl-wic,
  74     [Lux panorama/image viewer](https://bitbucket.org/kfj/pv/)
  75 *   Information retrieval: [iresearch database index](https://github.com/iresearch-toolkit/iresearch/blob/e7638e7a4b99136ca41f82be6edccf01351a7223/core/utils/simd_utils.hpp), michaeljclark/zvec
  76
  77 Other
  78
  79 *   [zimt](https://github.com/kfjahnke/zimt): C++11 template library to process n-dimensional arrays with multi-threaded SIMD code
  80 *   [vectorized Quicksort](https://github.com/google/highway/tree/master/hwy/contrib/sort) ([paper](https://arxiv.org/abs/2205.05982))
  81
  82 If you'd like to get Highway, in addition to cloning from this Github repository
  83 or using it as a Git submodule, you can also find it in the following package
  84 managers or repositories: alpinelinux, conan-io, conda-forge, DragonFlyBSD,
  85 freebsd, ghostbsd, microsoft/vcpkg, MidnightBSD, NetBSD, openSUSE, opnsense,
  86 Xilinx/Vitis_Libraries.
  87
  88 ## Current status
  89
  90 ### Targets
  91
  92 Highway supports 19 targets, listed in alphabetical order of platform:
  93
  94 -   Any: `EMU128`, `SCALAR`;
  95 -   Arm: `NEON` (Armv7+), `SVE`, `SVE2`, `SVE_256`, `SVE2_128`;
  96 -   POWER: `PPC8` (v2.07), `PPC9` (v3.0), `PPC10` (v3.1B, not yet supported
  97     due to compiler bugs, see #1207; also requires QEMU 7.2);
  98 -   RISC-V: `RVV` (1.0);
  99 -   WebAssembly: `WASM`, `WASM_EMU256` (a 2x unrolled version of wasm128,
 100     enabled if `HWY_WANT_WASM2` is defined. This will remain supported until it
 101     is potentially superseded by a future version of WASM.);
 102 -   x86:
 103     -   `SSE2`
 104     -   `SSSE3` (~Intel Core)
 105     -   `SSE4` (~Nehalem, also includes AES + CLMUL).
 106     -   `AVX2` (~Haswell, also includes BMI2 + F16 + FMA)
 107     -   `AVX3` (~Skylake, AVX-512F/BW/CD/DQ/VL)
 108     -   `AVX3_DL` (~Icelake, includes BitAlg + CLMUL + GFNI + VAES + VBMI +
 109         VBMI2 + VNNI + VPOPCNT; requires opt-in by defining `HWY_WANT_AVX3_DL`
 110         unless compiling for static dispatch),
 111     -   `AVX3_ZEN4` (like AVX3_DL but optimized for AMD Zen4; requires opt-in by
 112         defining `HWY_WANT_AVX3_ZEN4` if compiling for static dispatch)
 113
 114 SVE was initially tested using farm_sve (see acknowledgments).
 115
 116 ### Versioning
 117
 118 Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH),
 119 incrementing MINOR after backward-compatible additions and PATCH after
 120 backward-compatible fixes. We recommend using releases (rather than the Git tip)
 121 because they are tested more extensively, see below.
 122
 123 The current version 1.0 signals an increased focus on backwards compatibility.
 124 Applications using documented functionality will remain compatible with future
 125 updates that have the same major version number.
 126
 127 ### Testing
 128
 129 Continuous integration tests build with a recent version of Clang (running on
 130 native x86, or QEMU for RISC-V and Arm) and MSVC 2019 (v19.28, running on native
 131 x86).
 132
 133 Before releases, we also test on x86 with Clang and GCC, and Armv7/8 via GCC
 134 cross-compile. See the [testing process](g3doc/release_testing_process.md) for
 135 details.
 136
 137 ### Related modules
 138
 139 The `contrib` directory contains SIMD-related utilities: an image class with
 140 aligned rows, a math library (16 functions already implemented, mostly
 141 trigonometry), and functions for computing dot products and sorting.
 142
 143 ### Other libraries
 144
 145 If you only require x86 support, you may also use Agner Fog's
 146 [VCL vector class library](https://github.com/vectorclass). It includes many
 147 functions including a complete math library.
 148
 149 If you have existing code using x86/NEON intrinsics, you may be interested in
 150 [SIMDe](https://github.com/simd-everywhere/simde), which emulates those
 151 intrinsics using other platforms' intrinsics or autovectorization.
 152
 153 ## Installation
 154
 155 This project uses CMake to generate and build. In a Debian-based system you can
 156 install it via:
 157
 158 ```bash
 159 sudo apt install cmake
 160 ```
 161
 162 Highway's unit tests use [googletest](https://github.com/google/googletest).
 163 By default, Highway's CMake downloads this dependency at configuration time.
 164 You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
 165 installing gtest separately:
 166
 167 ```bash
 168 sudo apt install libgtest-dev
 169 ```
 170
 171 Running cross-compiled tests requires support from the OS, which on Debian is
 172 provided by the `qemu-user-binfmt` package.
 173
 174 To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS),
 175 the standard CMake workflow can be used:
 176
 177 ```bash
 178 mkdir -p build && cd build
 179 cmake ..
 180 make -j && make test
 181 ```
 182
 183 Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
 184
 185 Bazel is also supported for building, but it is not as widely used/tested.
 186
 187 When building for Armv7, a limitation of current compilers requires you to add
 188 `-DHWY_CMAKE_ARM7:BOOL=ON` to the CMake command line; see #834 and #1032. We
 189 understand that work is underway to remove this limitation.
 190
 191 Building on 32-bit x86 is not officially supported, and AVX2/3 are disabled by
 192 default there. Note that johnplatts has successfully built and run the Highway
 193 tests on 32-bit x86, including AVX2/3, on GCC 7/8 and Clang 8/11/12. On Ubuntu
 194 22.04, Clang 11 and 12, but not later versions, require extra compiler flags
 195 `-m32 -isystem /usr/i686-linux-gnu/include`. Clang 10 and earlier require the
 196 above plus `-isystem /usr/i686-linux-gnu/include/c++/12/i686-linux-gnu`. See
 197 #1279.
 198
 199 ## Quick start
 200
 201 You can use the `benchmark` inside examples/ as a starting point.
 202
 203 A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
 204 and their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)
 205 indicates the number of instructions per operation.
 206
 207 The [FAQ](g3doc/faq.md) answers questions about portability, API design and
 208 where to find more information.
 209
 210 We recommend using full SIMD vectors whenever possible for maximum performance
 211 portability. To obtain them, pass a `ScalableTag<float>` (or equivalently
 212 `HWY_FULL(float)`) tag to functions such as `Zero/Set/Load`. There are two
 213 alternatives for use-cases requiring an upper bound on the lanes:
 214
 215 -   For up to `N` lanes, specify `CappedTag<T, N>` or the equivalent
 216     `HWY_CAPPED(T, N)`. The actual number of lanes will be `N` rounded down to
 217     the nearest power of two, such as 4 if `N` is 5, or 8 if `N` is 8. This is
 218     useful for data structures such as a narrow matrix. A loop is still required
 219     because vectors may actually have fewer than `N` lanes.
 220
 221 -   For exactly a power of two `N` lanes, specify `FixedTag<T, N>`. The largest
 222     supported `N` depends on the target, but is guaranteed to be at least
 223     `16/sizeof(T)`.
 224
 225 Due to ADL restrictions, user code calling Highway ops must either:
 226 *   Reside inside `namespace hwy { namespace HWY_NAMESPACE {`; or
 227 *   prefix each op with an alias such as `namespace hn = hwy::HWY_NAMESPACE;
 228     hn::Add()`; or
 229 *   add using-declarations for each op used: `using hwy::HWY_NAMESPACE::Add;`.
 230
 231 Additionally, each function that calls Highway ops (such as `Load`) must either
 232 be prefixed with `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and
 233 `HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before
 234 their opening brace.
 235
 236 The entry points into code using Highway differ slightly depending on whether
 237 they use static or dynamic dispatch.
 238
 239 *   For static dispatch, `HWY_TARGET` will be the best available target among
 240     `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
 241     [quick-reference](g3doc/quick_reference.md)). Functions inside
 242     `HWY_NAMESPACE` can be called using `HWY_STATIC_DISPATCH(func)(args)` within
 243     the same module they are defined in. You can call the function from other
 244     modules by wrapping it in a regular function and declaring the regular
 245     function in a header.
 246
 247 *   For dynamic dispatch, a table of function pointers is generated via the
 248     `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
 249     call the best function pointer for the current CPU's supported targets. A
 250     module is automatically compiled for each target in `HWY_TARGETS` (see
 251     [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
 252     defined and `foreach_target.h` is included.
 253
 254 When using dynamic dispatch, `foreach_target.h` is included from translation
 255 units (.cc files), not headers. Headers containing vector code shared between
 256 several translation units require a special include guard, for example the
 257 following taken from `examples/skeleton-inl.h`:
 258
 259 ```
 260 #if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)
 261 #ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
 262 #undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
 263 #else
 264 #define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
 265 #endif
 266
 267 #include "hwy/highway.h"
 268 // Your vector code
 269 #endif
 270 ```
 271
 272 By convention, we name such headers `-inl.h` because their contents (often
 273 function templates) are usually inlined.
 274
 275 ## Compiler flags
 276
 277 Applications should be compiled with optimizations enabled - without inlining,
 278 SIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is
 279 generally sufficient.
 280
 281 For MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to
 282 pass vector arguments in registers. If intending to use the AVX2 target together
 283 with half-width vectors (e.g. for `PromoteTo`), it is also important to compile
 284 with `/arch:AVX2`. This seems to be the only way to generate VEX-encoded SSE4
 285 instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
 286 non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
 287 resulting binary will then require AVX2. Note that no such flag is needed for
 288 clang and GCC because they support target-specific attributes, which we use to
 289 ensure proper VEX code generation for AVX2 targets.
 290
 291 ## Strip-mining loops
 292
 293 When vectorizing a loop, an important question is whether and how to deal with
 294 a number of iterations ('trip count', denoted `count`) that does not evenly
 295 divide the vector size `N = Lanes(d)`. For example, it may be necessary to avoid
 296 writing past the end of an array.
 297
 298 In this section, let `T` denote the element type and `d = ScalableTag<T>`.
 299 Assume the loop body is given as a function `template<bool partial, class D>
 300 void LoopBody(D d, size_t index, size_t max_n)`.
 301
 302 "Strip-mining" is a technique for vectorizing a loop by transforming it into an
 303 outer loop and inner loop, such that the number of iterations in the inner loop
 304 matches the vector width. Then, the inner loop is replaced with vector
 305 operations.
 306
 307 Highway offers several strategies for loop vectorization:
 308
 309 *   Ensure all inputs/outputs are padded. Then the (outer) loop is simply
 310
 311     ```
 312     for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
 313     ```
 314     Here, the template parameter and second function argument are not needed.
 315
 316     This is the preferred option, unless `N` is in the thousands and vector
 317     operations are pipelined with long latencies. This was the case for
 318     supercomputers in the 90s, but nowadays ALUs are cheap and we see most
 319     implementations split vectors into 1, 2 or 4 parts, so there is little cost
 320     to processing entire vectors even if we do not need all their lanes. Indeed
 321     this avoids the (potentially large) cost of predication or partial
 322     loads/stores on older targets, and does not duplicate code.
 323
 324 *   Process whole vectors and include previously processed elements
 325     in the last vector:
 326     ```
 327     for (size_t i = 0; i < count; i += N) LoopBody<false>(d, HWY_MIN(i, count - N), 0);
 328     ```
 329
 330     This is the second preferred option provided that `count >= N`
 331     and `LoopBody` is idempotent. Some elements might be processed twice, but
 332     a single code path and full vectorization is usually worth it. Even if
 333     `count < N`, it usually makes sense to pad inputs/outputs up to `N`.
 334
 335 *   Use the `Transform*` functions in hwy/contrib/algo/transform-inl.h. This
 336     takes care of the loop and remainder handling and you simply define a
 337     generic lambda function (C++14) or functor which receives the current vector
 338     from the input/output array, plus optionally vectors from up to two extra
 339     input arrays, and returns the value to write to the input/output array.
 340
 341     Here is an example implementing the BLAS function SAXPY (`alpha * x + y`):
 342
 343     ```
 344     Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {
 345       return MulAdd(Set(d, alpha), v, v1);
 346     });
 347     ```
 348
 349 *   Process whole vectors as above, followed by a scalar loop:
 350
 351     ```
 352     size_t i = 0;
 353     for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
 354     for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
 355     ```
 356     The template parameter and second function arguments are again not needed.
 357
 358     This avoids duplicating code, and is reasonable if `count` is large.
 359     If `count` is small, the second loop may be slower than the next option.
 360
 361 *   Process whole vectors as above, followed by a single call to a modified
 362     `LoopBody` with masking:
 363
 364     ```
 365     size_t i = 0;
 366     for (; i + N <= count; i += N) {
 367       LoopBody<false>(d, i, 0);
 368     }
 369     if (i < count) {
 370       LoopBody<true>(d, i, count - i);
 371     }
 372     ```
 373     Now the template parameter and third function argument can be used inside
 374     `LoopBody` to non-atomically 'blend' the first `num_remaining` lanes of `v`
 375     with the previous contents of memory at subsequent locations:
 376     `BlendedStore(v, FirstN(d, num_remaining), d, pointer);`. Similarly,
 377     `MaskedLoad(FirstN(d, num_remaining), d, pointer)` loads the first
 378     `num_remaining` elements and returns zero in other lanes.
 379
 380     This is a good default when it is infeasible to ensure vectors are padded,
 381     but is only safe `#if !HWY_MEM_OPS_MIGHT_FAULT`!
 382     In contrast to the scalar loop, only a single final iteration is needed.
 383     The increased code size from two loop bodies is expected to be worthwhile
 384     because it avoids the cost of masking in all but the final iteration.
 385
 386 ## Additional resources
 387
 388 *   [Highway introduction (slides)](g3doc/highway_intro.pdf)
 389 *   [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)
 390 *   [Design philosophy and comparison](g3doc/design_philosophy.md)
 391 *   [Implementation details](g3doc/impl_details.md)
 392
 393 ## Acknowledgments
 394
 395 We have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger
 396 Bramas; it has proved useful for checking the SVE port on an x86 development
 397 machine.
 398
 399 This is not an officially supported Google product.
 400 Contact: janwas@google.com