third_party/highway/README.md

   1 # Efficient and performance-portable SIMD
   2
   3 Highway is a C++ library for SIMD (Single Instruction, Multiple Data), i.e.
   4 applying the same operation to multiple 'lanes' using a single CPU instruction.
   5
   6 ## Why Highway?
   7
   8 - more portable (same source code) than platform-specific intrinsics,
   9 - works on a wider range of compilers than compiler-specific vector extensions,
  10 - more dependable than autovectorization,
  11 - easier to write/maintain than assembly language,
  12 - supports **runtime dispatch**,
  13 - supports **variable-length vector** architectures.
  14
  15 ## Current status
  16
  17 Supported targets: scalar, S-SSE3, SSE4, AVX2, AVX-512, AVX3_DL (~Icelake,
  18 requires opt-in by defining `HWY_WANT_AVX3_DL`), NEON (ARMv7 and v8), SVE,
  19 WASM SIMD.
  20
  21 SVE is tested using farm_sve (see acknowledgments). SVE2 is implemented but not
  22 yet validated. A subset of RVV is implemented and tested with GCC and QEMU.
  23 Work is underway to compile using LLVM, which has different intrinsics with AVL.
  24
  25 Version 0.11 is considered stable enough to use in other projects, and is
  26 expected to remain backwards compatible unless serious issues are discovered
  27 while finishing the RVV target. After that, Highway will reach version 1.0.
  28
  29 Continuous integration tests build with a recent version of Clang (running on
  30 x86 and QEMU for ARM) and MSVC from VS2015 (running on x86).
  31
  32 Before releases, we also test on x86 with Clang and GCC, and ARMv7/8 via
  33 GCC cross-compile and QEMU. See the
  34 [testing process](g3doc/release_testing_process.md) for details.
  35
  36 The `contrib` directory contains SIMD-related utilities: an image class with
  37 aligned rows, and a math library (16 functions already implemented, mostly
  38 trigonometry).
  39
  40 ## Installation
  41
  42 This project uses cmake to generate and build. In a Debian-based system you can
  43 install it via:
  44
  45 ```bash
  46 sudo apt install cmake
  47 ```
  48
  49 Highway's unit tests use [googletest](https://github.com/google/googletest).
  50 By default, Highway's CMake downloads this dependency at configuration time.
  51 You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
  52 installing gtest separately:
  53
  54 ```bash
  55 sudo apt install libgtest-dev
  56 ```
  57
  58 To build and test the library the standard cmake workflow can be used:
  59
  60 ```bash
  61 mkdir -p build && cd build
  62 cmake ..
  63 make -j && make test
  64 ```
  65
  66 Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
  67
  68 Bazel is also supported for building, but it is not as widely used/tested.
  69
  70 ## Quick start
  71
  72 You can use the `benchmark` inside examples/ as a starting point.
  73
  74 A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
  75 and their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)
  76 indicates the number of instructions per operation.
  77
  78 We recommend using full SIMD vectors whenever possible for maximum performance
  79 portability. To obtain them, pass a `HWY_FULL(float)` tag to functions such as
  80 `Zero/Set/Load`. There is also the option of a vector of up to `N` (a power of
  81 two <= 16/sizeof(T)) lanes of type `T`: `HWY_CAPPED(T, N)`. If `HWY_TARGET ==
  82 HWY_SCALAR`, the vector always has one lane. For all other targets, up to
  83 128-bit vectors are guaranteed to be available.
  84
  85 Functions using Highway must be inside `namespace HWY_NAMESPACE {`
  86 (possibly nested in one or more other namespaces defined by the project), and
  87 additionally either prefixed with `HWY_ATTR`, or residing between
  88 `HWY_BEFORE_NAMESPACE()` and `HWY_AFTER_NAMESPACE()`.
  89
  90 *   For static dispatch, `HWY_TARGET` will be the best available target among
  91     `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
  92     [quick-reference](g3doc/quick_reference.md)). Functions inside `HWY_NAMESPACE`
  93     can be called using `HWY_STATIC_DISPATCH(func)(args)` within the same module
  94     they are defined in. You can call the function from other modules by
  95     wrapping it in a regular function and declaring the regular function in a
  96     header.
  97
  98 *   For dynamic dispatch, a table of function pointers is generated via the
  99     `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
 100     call the best function pointer for the current CPU's supported targets. A
 101     module is automatically compiled for each target in `HWY_TARGETS` (see
 102     [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
 103     defined and foreach_target.h is included.
 104
 105 ## Compiler flags
 106
 107 Applications should be compiled with optimizations enabled - without inlining,
 108 SIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is
 109 generally sufficient.
 110
 111 For MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to
 112 pass vector arguments in registers. If intending to use the AVX2 target together
 113 with half-width vectors (e.g. for `PromoteTo`), it is also important to compile
 114 with `/arch:AVX2`. This seems to be the only way to generate VEX-encoded SSE4
 115 instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
 116 non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
 117 resulting binary will then require AVX2. Note that no such flag is needed for
 118 clang and GCC because they support target-specific attributes, which we use to
 119 ensure proper VEX code generation for AVX2 targets.
 120
 121 ## Strip-mining loops
 122
 123 To vectorize a loop, "strip-mining" transforms it into an outer loop and inner
 124 loop with number of iterations matching the preferred vector width.
 125
 126 In this section, let `T` denote the element type, `d = HWY_FULL(T)`, `count` the
 127 number of elements to process, and `N = Lanes(d)` the number of lanes in a full
 128 vector. Assume the loop body is given as a function `template<bool partial,
 129 class D> void LoopBody(D d, size_t max_n)`.
 130
 131 Highway offers several ways to express loops where `N` need not divide `count`:
 132
 133 *   Ensure all inputs/outputs are padded. Then the loop is simply
 134
 135     ```
 136     for (size_t i = 0; i < count; i += N) LoopBody<false>(d, 0);
 137     ```
 138     Here, the template parameter and second function argument are not needed.
 139
 140     This is the preferred option, unless `N` is in the thousands and vector
 141     operations are pipelined with long latencies. This was the case for
 142     supercomputers in the 90s, but nowadays ALUs are cheap and we see most
 143     implementations split vectors into 1, 2 or 4 parts, so there is little cost
 144     to processing entire vectors even if we do not need all their lanes. Indeed
 145     this avoids the (potentially large) cost of predication or partial
 146     loads/stores on older targets, and does not duplicate code.
 147
 148 *   Process whole vectors as above, followed by a scalar loop:
 149
 150     ```
 151     size_t i = 0;
 152     for (; i + N <= count; i += N) LoopBody<false>(d, 0);
 153     for (; i < count; ++i) LoopBody<false>(HWY_CAPPED(T, 1)(), 0);
 154     ```
 155     The template parameter and second function arguments are again not needed.
 156
 157     This avoids duplicating code, and is reasonable if `count` is large.
 158     If `count` is small, the second loop may be slower than the next option.
 159
 160 *   Process whole vectors as above, followed by a single call to a modified
 161     `LoopBody` with masking:
 162
 163     ```
 164     size_t i = 0;
 165     for (; i + N <= count; i += N) {
 166       LoopBody<false>(d, 0);
 167     }
 168     if (i < count) {
 169       LoopBody<true>(d, count - i);
 170     }
 171     ```
 172     Now the template parameter and second function argument can be used inside
 173     `LoopBody` to 'blend' the new partial vector with previous memory contents:
 174     `Store(IfThenElse(FirstN(d, N), partial, prev_full), d, aligned_pointer);`.
 175
 176     This is a good default when it is infeasible to ensure vectors are padded.
 177     In contrast to the scalar loop, only a single final iteration is needed.
 178
 179 ## Additional resources
 180
 181 *   [Highway introduction (slides)](g3doc/highway_intro.pdf)
 182 *   [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)
 183 *   [Design philosophy and comparison](g3doc/design_philosophy.md)
 184
 185 ## Acknowledgments
 186
 187 We have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger
 188 Bramas; it has proved useful for checking the SVE port on an x86 development
 189 machine.
 190
 191 This is not an officially supported Google product.
 192 Contact: janwas@google.com