third_party/highway/README.md

   1 # Efficient and performance-portable vector software
   2
   3 [//]: # (placeholder, do not remove)
   4
   5 Highway is a C++ library that provides portable SIMD/vector intrinsics.
   6
   7 [Documentation](https://google.github.io/highway/en/master/)
   8
   9 Previously licensed under Apache 2, now dual-licensed as Apache 2 / BSD-3.
  10
  11 ## Why
  12
  13 We are passionate about high-performance software. We see major untapped
  14 potential in CPUs (servers, mobile, desktops). Highway is for engineers who want
  15 to reliably and economically push the boundaries of what is possible in
  16 software.
  17
  18 ## How
  19
  20 CPUs provide SIMD/vector instructions that apply the same operation to multiple
  21 data items. This can reduce energy usage e.g. *fivefold* because fewer
  22 instructions are executed. We also often see *5-10x* speedups.
  23
  24 Highway makes SIMD/vector programming practical and workable according to these
  25 guiding principles:
  26
  27 **Does what you expect**: Highway is a C++ library with carefully-chosen
  28 functions that map well to CPU instructions without extensive compiler
  29 transformations. The resulting code is more predictable and robust to code
  30 changes/compiler updates than autovectorization.
  31
  32 **Works on widely-used platforms**: Highway supports five architectures; the
  33 same application code can target various instruction sets, including those with
  34 'scalable' vectors (size unknown at compile time). Highway only requires C++11
  35 and supports four families of compilers. If you would like to use Highway on
  36 other platforms, please raise an issue.
  37
  38 **Flexible to deploy**: Applications using Highway can run on heterogeneous
  39 clouds or client devices, choosing the best available instruction set at
  40 runtime. Alternatively, developers may choose to target a single instruction set
  41 without any runtime overhead. In both cases, the application code is the same
  42 except for swapping `HWY_STATIC_DISPATCH` with `HWY_DYNAMIC_DISPATCH` plus one
  43 line of code.
  44
  45 **Suitable for a variety of domains**: Highway provides an extensive set of
  46 operations, used for image processing (floating-point), compression, video
  47 analysis, linear algebra, cryptography, sorting and random generation. We
  48 recognise that new use-cases may require additional ops and are happy to add
  49 them where it makes sense (e.g. no performance cliffs on some architectures). If
  50 you would like to discuss, please file an issue.
  51
  52 **Rewards data-parallel design**: Highway provides tools such as Gather,
  53 MaskedLoad, and FixedTag to enable speedups for legacy data structures. However,
  54 the biggest gains are unlocked by designing algorithms and data structures for
  55 scalable vectors. Helpful techniques include batching, structure-of-array
  56 layouts, and aligned/padded allocations.
  57
  58 ## Examples
  59
  60 Online demos using Compiler Explorer:
  61
  62 -   [multiple targets with dynamic dispatch](https://gcc.godbolt.org/z/KM3ben7ET)
  63     (more complicated, but flexible and uses best available SIMD)
  64 -   [single target using -m flags](https://gcc.godbolt.org/z/rGnjMevKG)
  65     (simpler, but requires/only uses the instruction set enabled by compiler
  66     flags)
  67
  68 We observe that Highway is referenced in the following open source projects,
  69 found via sourcegraph.com. Most are Github repositories. If you would like to
  70 add your project or link to it directly, feel free to raise an issue or contact
  71 us via the below email.
  72
  73 *   Browsers: Chromium (+Vivaldi), Firefox (+floorp / foxhound / librewolf / Waterfox)
  74 *   Cryptography: google/distributed_point_functions
  75 *   Data structures: bkille/BitLib
  76 *   Image codecs: eustas/2im, [Grok JPEG 2000](https://github.com/GrokImageCompression/grok), [JPEG XL](https://github.com/libjxl/libjxl), OpenHTJ2K, [JPEGenc](https://github.com/osamu620/JPEGenc)
  77 *   Image processing: cloudinary/ssimulacra2, m-ab-s/media-autobuild_suite, [libvips](https://github.com/libvips/libvips)
  78 *   Image viewers: AlienCowEatCake/ImageViewer, mirillis/jpegxl-wic,
  79     [Lux panorama/image viewer](https://bitbucket.org/kfj/pv/)
  80 *   Information retrieval: [iresearch database index](https://github.com/iresearch-toolkit/iresearch/blob/e7638e7a4b99136ca41f82be6edccf01351a7223/core/utils/simd_utils.hpp), michaeljclark/zvec
  81 *   Machine learning: Tensorflow, Numpy, zpye/SimpleInfer
  82 *   Voxels: rools/voxl
  83
  84 Other
  85
  86 *   [Evaluation of C++ SIMD Libraries](https://www.mnm-team.org/pub/Fopras/rock23/):
  87     "Highway excelled with a strong performance across multiple SIMD extensions
  88     [..]. Thus, Highway may currently be the most suitable SIMD library for many
  89     software projects."
  90 *   [zimt](https://github.com/kfjahnke/zimt): C++11 template library to process n-dimensional arrays with multi-threaded SIMD code
  91 *   [vectorized Quicksort](https://github.com/google/highway/tree/master/hwy/contrib/sort) ([paper](https://arxiv.org/abs/2205.05982))
  92
  93 If you'd like to get Highway, in addition to cloning from this Github repository
  94 or using it as a Git submodule, you can also find it in the following package
  95 managers or repositories: alpinelinux, conan-io, conda-forge, DragonFlyBSD,
  96 freebsd, ghostbsd, microsoft/vcpkg, MidnightBSD, MSYS2, NetBSD, openSUSE,
  97 opnsense, Xilinx/Vitis_Libraries. See also the list at
  98 https://repology.org/project/highway-simd-library/versions .
  99
 100 ## Current status
 101
 102 ### Targets
 103
 104 Highway supports 22 targets, listed in alphabetical order of platform:
 105
 106 -   Any: `EMU128`, `SCALAR`;
 107 -   Arm: `NEON` (Armv7+), `SVE`, `SVE2`, `SVE_256`, `SVE2_128`;
 108 -   IBM Z: `Z14`, `Z15`;
 109 -   POWER: `PPC8` (v2.07), `PPC9` (v3.0), `PPC10` (v3.1B, not yet supported
 110     due to compiler bugs, see #1207; also requires QEMU 7.2);
 111 -   RISC-V: `RVV` (1.0);
 112 -   WebAssembly: `WASM`, `WASM_EMU256` (a 2x unrolled version of wasm128,
 113     enabled if `HWY_WANT_WASM2` is defined. This will remain supported until it
 114     is potentially superseded by a future version of WASM.);
 115 -   x86:
 116     -   `SSE2`
 117     -   `SSSE3` (~Intel Core)
 118     -   `SSE4` (~Nehalem, also includes AES + CLMUL).
 119     -   `AVX2` (~Haswell, also includes BMI2 + F16 + FMA)
 120     -   `AVX3` (~Skylake, AVX-512F/BW/CD/DQ/VL)
 121     -   `AVX3_DL` (~Icelake, includes BitAlg + CLMUL + GFNI + VAES + VBMI +
 122         VBMI2 + VNNI + VPOPCNT; requires opt-in by defining `HWY_WANT_AVX3_DL`
 123         unless compiling for static dispatch),
 124     -   `AVX3_ZEN4` (like AVX3_DL but optimized for AMD Zen4; requires opt-in by
 125         defining `HWY_WANT_AVX3_ZEN4` if compiling for static dispatch)
 126     -   `AVX3_SPR` (~Sapphire Rapids, includes AVX-512FP16)
 127
 128 Our policy is that unless otherwise specified, targets will remain supported as
 129 long as they can be (cross-)compiled with currently supported Clang or GCC, and
 130 tested using QEMU. If the target can be compiled with LLVM trunk and tested
 131 using our version of QEMU without extra flags, then it is eligible for inclusion
 132 in our continuous testing infrastructure. Otherwise, the target will be manually
 133 tested before releases with selected versions/configurations of Clang and GCC.
 134
 135 SVE was initially tested using farm_sve (see acknowledgments).
 136
 137 ### Versioning
 138
 139 Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH),
 140 incrementing MINOR after backward-compatible additions and PATCH after
 141 backward-compatible fixes. We recommend using releases (rather than the Git tip)
 142 because they are tested more extensively, see below.
 143
 144 The current version 1.0 signals an increased focus on backwards compatibility.
 145 Applications using documented functionality will remain compatible with future
 146 updates that have the same major version number.
 147
 148 ### Testing
 149
 150 Continuous integration tests build with a recent version of Clang (running on
 151 native x86, or QEMU for RISC-V and Arm) and MSVC 2019 (v19.28, running on native
 152 x86).
 153
 154 Before releases, we also test on x86 with Clang and GCC, and Armv7/8 via GCC
 155 cross-compile. See the [testing process](g3doc/release_testing_process.md) for
 156 details.
 157
 158 ### Related modules
 159
 160 The `contrib` directory contains SIMD-related utilities: an image class with
 161 aligned rows, a math library (16 functions already implemented, mostly
 162 trigonometry), and functions for computing dot products and sorting.
 163
 164 ### Other libraries
 165
 166 If you only require x86 support, you may also use Agner Fog's
 167 [VCL vector class library](https://github.com/vectorclass). It includes many
 168 functions including a complete math library.
 169
 170 If you have existing code using x86/NEON intrinsics, you may be interested in
 171 [SIMDe](https://github.com/simd-everywhere/simde), which emulates those
 172 intrinsics using other platforms' intrinsics or autovectorization.
 173
 174 ## Installation
 175
 176 This project uses CMake to generate and build. In a Debian-based system you can
 177 install it via:
 178
 179 ```bash
 180 sudo apt install cmake
 181 ```
 182
 183 Highway's unit tests use [googletest](https://github.com/google/googletest).
 184 By default, Highway's CMake downloads this dependency at configuration time.
 185 You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
 186 installing gtest separately:
 187
 188 ```bash
 189 sudo apt install libgtest-dev
 190 ```
 191
 192 Running cross-compiled tests requires support from the OS, which on Debian is
 193 provided by the `qemu-user-binfmt` package.
 194
 195 To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS),
 196 the standard CMake workflow can be used:
 197
 198 ```bash
 199 mkdir -p build && cd build
 200 cmake ..
 201 make -j && make test
 202 ```
 203
 204 Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
 205
 206 Bazel is also supported for building, but it is not as widely used/tested.
 207
 208 When building for Armv7, a limitation of current compilers requires you to add
 209 `-DHWY_CMAKE_ARM7:BOOL=ON` to the CMake command line; see #834 and #1032. We
 210 understand that work is underway to remove this limitation.
 211
 212 Building on 32-bit x86 is not officially supported, and AVX2/3 are disabled by
 213 default there. Note that johnplatts has successfully built and run the Highway
 214 tests on 32-bit x86, including AVX2/3, on GCC 7/8 and Clang 8/11/12. On Ubuntu
 215 22.04, Clang 11 and 12, but not later versions, require extra compiler flags
 216 `-m32 -isystem /usr/i686-linux-gnu/include`. Clang 10 and earlier require the
 217 above plus `-isystem /usr/i686-linux-gnu/include/c++/12/i686-linux-gnu`. See
 218 #1279.
 219
 220 ## Building highway - Using vcpkg
 221
 222 highway is now available in [vcpkg](https://github.com/Microsoft/vcpkg)
 223
 224 ```bash
 225 vcpkg install highway
 226 ```
 227
 228 The highway port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.
 229
 230 ## Quick start
 231
 232 You can use the `benchmark` inside examples/ as a starting point.
 233
 234 A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
 235 and their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)
 236 indicates the number of instructions per operation.
 237
 238 The [FAQ](g3doc/faq.md) answers questions about portability, API design and
 239 where to find more information.
 240
 241 We recommend using full SIMD vectors whenever possible for maximum performance
 242 portability. To obtain them, pass a `ScalableTag<float>` (or equivalently
 243 `HWY_FULL(float)`) tag to functions such as `Zero/Set/Load`. There are two
 244 alternatives for use-cases requiring an upper bound on the lanes:
 245
 246 -   For up to `N` lanes, specify `CappedTag<T, N>` or the equivalent
 247     `HWY_CAPPED(T, N)`. The actual number of lanes will be `N` rounded down to
 248     the nearest power of two, such as 4 if `N` is 5, or 8 if `N` is 8. This is
 249     useful for data structures such as a narrow matrix. A loop is still required
 250     because vectors may actually have fewer than `N` lanes.
 251
 252 -   For exactly a power of two `N` lanes, specify `FixedTag<T, N>`. The largest
 253     supported `N` depends on the target, but is guaranteed to be at least
 254     `16/sizeof(T)`.
 255
 256 Due to ADL restrictions, user code calling Highway ops must either:
 257 *   Reside inside `namespace hwy { namespace HWY_NAMESPACE {`; or
 258 *   prefix each op with an alias such as `namespace hn = hwy::HWY_NAMESPACE;
 259     hn::Add()`; or
 260 *   add using-declarations for each op used: `using hwy::HWY_NAMESPACE::Add;`.
 261
 262 Additionally, each function that calls Highway ops (such as `Load`) must either
 263 be prefixed with `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and
 264 `HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before
 265 their opening brace.
 266
 267 The entry points into code using Highway differ slightly depending on whether
 268 they use static or dynamic dispatch. In both cases, we recommend that the
 269 top-level function receives one or more pointers to arrays, rather than
 270 target-specific vector types.
 271
 272 *   For static dispatch, `HWY_TARGET` will be the best available target among
 273     `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
 274     [quick-reference](g3doc/quick_reference.md)). Functions inside
 275     `HWY_NAMESPACE` can be called using `HWY_STATIC_DISPATCH(func)(args)` within
 276     the same module they are defined in. You can call the function from other
 277     modules by wrapping it in a regular function and declaring the regular
 278     function in a header.
 279
 280 *   For dynamic dispatch, a table of function pointers is generated via the
 281     `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
 282     call the best function pointer for the current CPU's supported targets. A
 283     module is automatically compiled for each target in `HWY_TARGETS` (see
 284     [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
 285     defined and `foreach_target.h` is included. Note that the first invocation
 286     of `HWY_DYNAMIC_DISPATCH`, or each call to the pointer returned by the first
 287     invocation of `HWY_DYNAMIC_POINTER`, involves some CPU detection overhead.
 288     You can prevent this by calling the following before any invocation of
 289     `HWY_DYNAMIC_*`: `hwy::GetChosenTarget().Update(hwy::SupportedTargets());`.
 290
 291 When using dynamic dispatch, `foreach_target.h` is included from translation
 292 units (.cc files), not headers. Headers containing vector code shared between
 293 several translation units require a special include guard, for example the
 294 following taken from `examples/skeleton-inl.h`:
 295
 296 ```
 297 #if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)
 298 #ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
 299 #undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
 300 #else
 301 #define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_
 302 #endif
 303
 304 #include "hwy/highway.h"
 305 // Your vector code
 306 #endif
 307 ```
 308
 309 By convention, we name such headers `-inl.h` because their contents (often
 310 function templates) are usually inlined.
 311
 312 ## Compiler flags
 313
 314 Applications should be compiled with optimizations enabled - without inlining,
 315 SIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is
 316 generally sufficient.
 317
 318 For MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to
 319 pass vector arguments in registers. If intending to use the AVX2 target together
 320 with half-width vectors (e.g. for `PromoteTo`), it is also important to compile
 321 with `/arch:AVX2`. This seems to be the only way to generate VEX-encoded SSE4
 322 instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
 323 non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
 324 resulting binary will then require AVX2. Note that no such flag is needed for
 325 clang and GCC because they support target-specific attributes, which we use to
 326 ensure proper VEX code generation for AVX2 targets.
 327
 328 ## Strip-mining loops
 329
 330 When vectorizing a loop, an important question is whether and how to deal with
 331 a number of iterations ('trip count', denoted `count`) that does not evenly
 332 divide the vector size `N = Lanes(d)`. For example, it may be necessary to avoid
 333 writing past the end of an array.
 334
 335 In this section, let `T` denote the element type and `d = ScalableTag<T>`.
 336 Assume the loop body is given as a function `template<bool partial, class D>
 337 void LoopBody(D d, size_t index, size_t max_n)`.
 338
 339 "Strip-mining" is a technique for vectorizing a loop by transforming it into an
 340 outer loop and inner loop, such that the number of iterations in the inner loop
 341 matches the vector width. Then, the inner loop is replaced with vector
 342 operations.
 343
 344 Highway offers several strategies for loop vectorization:
 345
 346 *   Ensure all inputs/outputs are padded. Then the (outer) loop is simply
 347
 348     ```
 349     for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
 350     ```
 351     Here, the template parameter and second function argument are not needed.
 352
 353     This is the preferred option, unless `N` is in the thousands and vector
 354     operations are pipelined with long latencies. This was the case for
 355     supercomputers in the 90s, but nowadays ALUs are cheap and we see most
 356     implementations split vectors into 1, 2 or 4 parts, so there is little cost
 357     to processing entire vectors even if we do not need all their lanes. Indeed
 358     this avoids the (potentially large) cost of predication or partial
 359     loads/stores on older targets, and does not duplicate code.
 360
 361 *   Process whole vectors and include previously processed elements
 362     in the last vector:
 363     ```
 364     for (size_t i = 0; i < count; i += N) LoopBody<false>(d, HWY_MIN(i, count - N), 0);
 365     ```
 366
 367     This is the second preferred option provided that `count >= N`
 368     and `LoopBody` is idempotent. Some elements might be processed twice, but
 369     a single code path and full vectorization is usually worth it. Even if
 370     `count < N`, it usually makes sense to pad inputs/outputs up to `N`.
 371
 372 *   Use the `Transform*` functions in hwy/contrib/algo/transform-inl.h. This
 373     takes care of the loop and remainder handling and you simply define a
 374     generic lambda function (C++14) or functor which receives the current vector
 375     from the input/output array, plus optionally vectors from up to two extra
 376     input arrays, and returns the value to write to the input/output array.
 377
 378     Here is an example implementing the BLAS function SAXPY (`alpha * x + y`):
 379
 380     ```
 381     Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {
 382       return MulAdd(Set(d, alpha), v, v1);
 383     });
 384     ```
 385
 386 *   Process whole vectors as above, followed by a scalar loop:
 387
 388     ```
 389     size_t i = 0;
 390     for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
 391     for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
 392     ```
 393     The template parameter and second function arguments are again not needed.
 394
 395     This avoids duplicating code, and is reasonable if `count` is large.
 396     If `count` is small, the second loop may be slower than the next option.
 397
 398 *   Process whole vectors as above, followed by a single call to a modified
 399     `LoopBody` with masking:
 400
 401     ```
 402     size_t i = 0;
 403     for (; i + N <= count; i += N) {
 404       LoopBody<false>(d, i, 0);
 405     }
 406     if (i < count) {
 407       LoopBody<true>(d, i, count - i);
 408     }
 409     ```
 410     Now the template parameter and third function argument can be used inside
 411     `LoopBody` to non-atomically 'blend' the first `num_remaining` lanes of `v`
 412     with the previous contents of memory at subsequent locations:
 413     `BlendedStore(v, FirstN(d, num_remaining), d, pointer);`. Similarly,
 414     `MaskedLoad(FirstN(d, num_remaining), d, pointer)` loads the first
 415     `num_remaining` elements and returns zero in other lanes.
 416
 417     This is a good default when it is infeasible to ensure vectors are padded,
 418     but is only safe `#if !HWY_MEM_OPS_MIGHT_FAULT`!
 419     In contrast to the scalar loop, only a single final iteration is needed.
 420     The increased code size from two loop bodies is expected to be worthwhile
 421     because it avoids the cost of masking in all but the final iteration.
 422
 423 ## Additional resources
 424
 425 *   [Highway introduction (slides)](g3doc/highway_intro.pdf)
 426 *   [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)
 427 *   [Design philosophy and comparison](g3doc/design_philosophy.md)
 428 *   [Implementation details](g3doc/impl_details.md)
 429
 430 ## Acknowledgments
 431
 432 We have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger
 433 Bramas; it has proved useful for checking the SVE port on an x86 development
 434 machine.
 435
 436 This is not an officially supported Google product.
 437 Contact: janwas@google.com