third-party/folly/src/folly/detail/base64_detail/README.md

   1 *This file contains explanation of how SIMD implementation of Base64 functions*
   2
   3 Our solution is based on 0x80 blog: [encoding](http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html), [decoding](http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html).
   4
   5 This is a different version of the same explanation + we have some differences.
   6
   7 *NOTE: We are using 1 digit for two bits when writing data*
   8
   9 # Understanding Base64 encoding.
  10
  11 *If you find it easier, [Wikipedia](https://en.wikipedia.org/wiki/Base64).*
  12
  13 The goal of base64 is to encode arbitrary byte data as text that can be passed around.
  14 It works as follows:
  15 * we select a 64 character alphabet.
  16 * we insert two 0 bits after each 6 bits. 2^6 is 64.
  17 * because we now only need to encode 64 values, our alphabet is sufficient.
  18
  19 # Encoding
  20
  21 So, we have 2 steps needed to do the conversion: 1. get indexes 0-64, 2. lookup the corresponding char.
  22 There is also a little bit of tail handling, see the apropriate section.
  23
  24 If one letter represents two bits, it looks smth like this (remember - our machines are little endian):
  25
  26 ```
  27 `cddd`bbcc`aaab => 0ddd`0ccc`0bbb`0aaa
  28 ```
  29
  30 After this we just have to convert the bytes into the selected alphabet.
  31
  32 This means we have two steps to the algorithm: encodeToIndexes and lookupByIndex.
  33
  34 ## encodeToIndexes
  35
  36 *NOTE: assume 16 byte registers.*
  37
  38 We will only be converting 12 bytes to output 16 bytes.
  39 The last 4 loaded bytes are ignored.
  40
  41 1. We need to shuffle bytes in a way that each dword contains all the bits to construct output for that dword. Specifically:
  42
  43 ```
  44 PONM,LKJI,HGFE,DCBA => KLJK,GIGH,EFDE,BCAB
  45 ```
  46
  47 > STEP BY STEP:
  48 > * We had bytes in memory ABCD.
  49 > * Loaded on a LE machine they become DCBA.
  50 > * The first output dword will only fit ABC,
  51     D goes into the second dword.
  52 > * We will also reshuffle CBA as BCAB becuase this will make
  53     futher operations simpler.
  54
  55 After this we will mutate each dword (letter+digit indicate 2 bits):
  56
  57 ```
  58 shuffled b2b3c1c2'c3d1d2d3'a1a2a3b1'b2b3c1c2
  59 exepcted 00d1d2d3'00c1c2c3'00b1b2b3'00a1a2a3
  60 ```
  61
  62 At the moment of writing this we only have an SSE4.2 version. We can do it fairly straightforwardly with shifts and blends. However the 0x80 blogs suggests that using a tricky mutliplication scheme is superior, and that's what we do.
  63
  64 For neon there are by element shift instructions which would help.
  65 On avx2 there are `srlv` family of instructions that can come in handy as well.
  66 The 0x80 blog also talks about a BMI implementation (pdep/pext instructions). They had issues on AMD but apparently not on the AMDs in Meta's fleet, so in the future we can consider it as a possibility as well.
  67
  68 ## lookupByIndex
  69
  70 The idea is to use table lookup/shuffle instructions (`pshuvb` on sse4.2). They allow you to shuffle a register by index (`0..16`).
  71
  72 We have values `0..64`. However, we can utilize that the output values are not random. The are split into a few contigious groups. We don't have to lookup the complete encoding, we just need to lookup the offset to add.
  73
  74
  75 | Index | Maps to | Add to encode |
  76 | ----- | --------| ------------- |
  77 | 0-25  | 'A'-'Z' | 'A' - 0       |
  78 | 26-51 | 'a'-'z' | 'a' - 26      |
  79 | 52-61 | '0'-'9' | '0' - 52      |
  80 | 62    | '+'     | '+' - 62      |
  81 | 63    | '/'     | '/' - 63      |
  82
  83
  84 *NOTE: the URL version of Base64 encoding uses -_ instead of +/*.
  85
  86 Since we have 16 values to map to, we are not going to bother with consolidating 52-61 to a single base and instead will them up individually.
  87
  88 This is our final lookup table:
  89
  90 ```
  91 0: 'A' - 0
  92 1: 'a' - 26
  93 2: '0' - 52,
  94 3: '1' - 53
  95 ...
  96 11: '9' - 61
  97 12: '+' - 62
  98 13: '/' - 63
  99 14: -
 100 15: -
 101 ```
 102
 103 We even have two extra indexes `14` and `15` that are unused.
 104
 105 ## Tail handling
 106
 107 What if we have less than register size to load?
 108 What if we have less than 3 elements to encode?
 109
 110 Base64 for incomplete 3 bytes works as if there were
 111 extra 0s at the end, except the garbage elements are replaced
 112 with either a padding character or just removed.
 113
 114 Example:
 115
 116 |input bytes (be) | Basic  | Url  |
 117 |-----------------|--------|------|
 118 | (1)             |  AQ==  | AQ   |
 119 | (1, 0)          |  AQA=  | AQA  |
 120 | (1, 0, 0)       |  AQAA  | AQAA |
 121
 122 After measuring different options, using scalar loop peeling
 123 proved to be the fastest.
 124
 125 # Decoding
 126
 127 Overall the decoding process is similar to inverse of the encoding: 1. from chars get numbers `0..63` 2. pack them
 128 into bytes.
 129
 130 However, unlike encoding, not all the sequences of bytes are valid.
 131 We have to decide which ones to accept and how to handle errors.
 132
 133 *NOTE: error detection is fairly expensive, if in the
 134 future there is a need for a decode that can sometimes
 135 swallow invalid values, it can be done faster*.
 136
 137 Decisions:
 138 `base64Decode` requires `=` padding and accepts only [`+`, `/`] as 62/63 character.
 139 `base64URLDecode`: allows everything encoded with plain base64 and base64URL.
 140 It will succesfully decode some combinations that are not either.
 141
 142 *NOTE:* details on `baseURLDecode` rules:
 143 * Padding bytes follow regual base64: if they are present they are allowed to be only in positions that
 144   regular base64 allows them to be in: 4th or 3rd and 4th in the 4 byte chunk. "aa==", "aaa=" - OK, "aa=", "==" - not OK. This decision follows the previous most common implementation of base64 in the codebase.
 145 * Both ['+', '/'] and ['-', '_'] are allowed in any combinations. "+-" is legal for example.
 146
 147
 148 *NOTE: this follows the previous design and not necessarily how other places define
 149 base64URLDecode. The difference comes to decoding invalid sequence of padding characters.
 150 Example: "ba=" we consider invalid while some other implmenetations might just strip '='. We
 151 suspect it won't be an issue*.
 152
 153 This behaviour matches what the previous implementation did.
 154
 155 We only have a SIMD version of this algorithm for `base64Decode` because the URL decode has
 156 very little use (so far at least).
 157
 158 The SIMD algorithm has 2 steps:
 159 - `decodeCharToIndex`: convert to numbers from `0..63`
 160 - `packIndexesToBytes`: packs `0..63` into the originally encoded bytes.
 161
 162 If we encounter an error, we are not going to stop processing data. We are just going to mark it and then check at the end.
 163
 164 ## decodeCharToIndex
 165
 166 Convertions we want
 167
 168 | Char      | Char's Hex code | Res     | Res Hex     |
 169 |-----------|-----------------|---------|-------------|
 170 | '+'       |   0x2B          | 62      | 0x3E        |
 171 | '/'       |   0x2F          | 63      | 0x3F        |
 172 | '0' - '9' |   0x30 - 0x39   | 52 - 61 | 0x34 - 0x3D |
 173 | 'A' - 'Z' |   0x41 - 0x5A   | 0  - 25 | 0x00 - 0x19 |
 174 | 'a' - 'z' |   0x61 - 0x7A   | 26 - 51 | 0x1A - 0x33 |
 175
 176 ### Idea
 177
 178 We have unique higher 4 bits (nibble) per consecutive group,
 179 except for '+' and '/'.
 180 We can find a unique offset with pshuvb by it.
 181
 182 ### Deal with '+' and '/' nibbles
 183
 184 We need to move '+' and '/' into their own nibbles.
 185 It's important that all previously invalid chars stay invalid.
 186
 187 The best way I see to do it is to substact 0x0f from everything less than or equal to +.
 188 The reason to use 0x0f is that it's a constant we will need anyways ot get 4 higher nibbles.
 189
 190 However on intel there is no unsigned comparison, so all the negative values will be considered as <= '+'. To deal with it, we can use a signed comparison but subtract with satuation.
 191
 192  * <= '+' gives -1 where '+'
 193  * and 0x0f
 194  * subs
 195
 196 Everything bigger than '+' will stay the same.
 197 Everything smaller than '+' is invalid and will become even smaller.
 198 It will saturate to -127 which is still ivalid, so we are OK.
 199
 200  *NOTE: I was also thinking if this is possible to do with just one subtruction and no mask. Unfortunatly we also need to keep a separate nibble for '0' and '/' and '0' == '/' + 1, so that subtraction will guarantee put '/' and '0' on the same higher nibble.*
 201
 202  *NOTE: If we drop the requirement on invalid elements, again - this is much easier*.
 203
 204 New conversion table:
 205
 206 | Char      | New Hex code  | Res     | Res Hex     |
 207 |-----------|---------------|---------|-------------|
 208 | '+'       |   0x1C        | 62      | 0x3E        |
 209 | '/'       |   0x2F        | 63      | 0x3F        |
 210 | '0' - '9' |   0x30 - 0x39 | 52 - 61 | 0x34 - 0x3D |
 211 | 'A' - 'Z' |   0x41 - 0x5A | 0  - 25 | 0x00 - 0x19 |
 212 | 'a' - 'z' |   0x61 - 0x7A | 26 - 51 | 0x1A - 0x33 |
 213
 214 ### Detect invalid values.
 215
 216 The idea is that we have valid higher nibbles in range from `1..7`.
 217 We can convert higher nibbles into a bitmask `0000'0010 ... 1000'0000` (`1 << nibble`)
 218 Due to a lack of proper shift instruction on x86, we will use a table lookup (`pshuvb`).
 219 All other higher nibbles can become 0, indicating invalid input.
 220
 221 Now for all possible lower nibbles we have a set of valid higher nibbles.
 222
 223
 224 | Lower Nibble | Valid higher nibble |
 225 | ------------ | ------------------- |
 226 | 0            | 3, 5, 7             |
 227 | 1 - 9        | 3, 4, 5, 6, 7       |
 228 | A            | 4, 5, 6, 7          |
 229 | B            | 4, 6                |
 230 | C            | 1, 4, 6             |
 231 | D, E         | 4, 6                |
 232 | F            | 2, 4, 6             |
 233
 234
 235 So: by higher nibble lookup a bit that indicates that higher nibble
 236     by lower nible lookup a set of valid bits
 237     and
 238
 239 This will give us a non-zero when it's OK.
 240 And 0 when it's not.
 241
 242 Between iterations we can accumulate error with unsigned min.
 243 0 - is the smallest value - so we will get 0 if there was one.
 244
 245 ### Offset to correct values.
 246
 247 Similar idea to encode - we just lookup the proper offset and add it.
 248 We'll need to lookup by higher nibble.
 249
 250 | Char      | New Hex code  | Res     | Offset (- means minus) |
 251 |-----------|---------------|---------|------------------------|
 252 | '+'       |   0x1C        | 62      | 62 - 0x1C              |
 253 | '/'       |   0x2F        | 63      | 63 - '/'               |
 254 | '0' - '9' |   0x30 - 0x39 | 52 - 61 | 52 - '0'               |
 255 | 'A' - 'O' |   0x41 - 0x4F | 0  - 14 | 0  - 'A'               |
 256 | 'P' - 'Z' |   0x50 - 0x5A | 15 - 25 | 0  - 'A'               |
 257 | 'a' - 'o' |   0x61 - 0x6F | 26 - 40 | 26 - 'a'               |
 258 | 'p' - 'z' |   0x70 - 0x7A | 26 - 51 | 26 - 'a'               |
 259
 260 ## packIndexesToBytes
 261
 262 *NOTE: assume 16 byte registers.*
 263
 264 From a register of 16 bytes we need to get 12 bytes.
 265 I think the only way to do it is to convert each 4 bytes to
 266 appropriate 3 bytes and then put them all together.
 267 After we doing conversion 4 to 3 we'll remove the leftover byte
 268 with one shuffle.
 269
 270 Goal (*One letter - 2 bits*):
 271
 272 - Input: `0ddd'0ccc'0bbb'0aaa`
 273 - Needed bytes: `'cddd'bbcc'aaab`
 274
 275 We can obviously shift and or to get the desired result
 276 (and maybe on ARM we should) but again, shifting story for x86 is poor.
 277
 278 The 0x80 blog suggests the following trick.
 279
 280 ### Multiply add (MADD) solution
 281
 282 On SSE there are `madd` instrucitons for adjacent bytes (`_mm_maddubs_epi16`)  and words (`_mm_madd_epi16`). Multiply is a strictly more powerful operation than shift, so this will work.
 283
 284 *NOTE: all of our numbers are positive so the sign of the operations doesn't matter*
 285
 286 Here is the scalar equivalent of the code we'll do.
 287 (*I omit indicating leading 0s and ignore casts*).
 288
 289 ```
 290 std::uint16_t aaabbb = aaa << 6 + bbb;
 291 std::uint16_t cccddd = ccc << 6 + ddd;
 292 std::uint32_t aaabbbcccddd = aaabbb << 12 + cccddd;
 293 ```
 294
 295 Multiplication equivalent of `<< 6` is `* 0x40` and of `<< 12` is `* 0x1000`.
 296
 297 Now that we have `0000'aaab'bbcc'cddd`.
 298 The correct order `cddd'bccc'aaab` which we can mix
 299 into the final shuffle.
 300
 301 ## looping
 302
 303 We can convert only in registers. There is a question: when can we write the whole
 304 register to the output?
 305 For example, for 16 bytes of input (no padding) we only have 12 bytes of output space.
 306
 307 A simple way to answer if we have enough space is to compute how much output space we'll need overall.
 308 However, this operation is somewhat expensive and is likely already done for the allocation. So computing output size ideally should be avoided.
 309
 310 We can prove that as long as the `Register` is bigger than 16 bytes, `1.5 Register` of input space is enough to guarantee at least one register of the output sopace.
 311
 312 Proof.
 313
 314 Out of `1.5 Register`, `1 Register` converts to `0.75 Register` in the output. So we are left with proving that `0.5 Register` will produce at least `0.25 Register` of the output.
 315
 316 If there is no padding - `0.5` converts to `3/8`, which is `> 0.25`.
 317 If there is padding, the last dword of input might produce just `1` output byte.
 318
 319 So, we need to prove that:
 320
 321 `convertedSize(RegisterSize / 2 - 4) >= RegisterSize / 4 - 1`
 322
 323 For `16` bytes it works `convertedSize(4) == 3`.
 324 For `Register > 16` bytes, we can split:
 325
 326 ```
 327 convertedSize(RegisterSize / 4) +
 328 convertedSize(RegisterSize / 4 - 4) >=
 329 RegisterSize / 8 +
 330 RegisterSize / 8 - 1
 331 ```
 332
 333 Since `convertedSize(RegisterSize / 4) > RegisterSize / 8` we just reduced the problem to 16 bytes.
 334
 335 Proven.
 336
 337 ## Decoding SWAR
 338
 339 SWAR stands for SIMD Within A Register i.e. using bit tricks to
 340 process multiple bytes in a bigger integer register.
 341 Previous version of Base64Decode was using some bit tricks and, as a result,
 342 was beating our performance for tails.
 343
 344 On top of that, if the platform lacks the necessary SIMD capabilities, SWAR is
 345 benefitial, so we decided to implement it.
 346
 347 How does it work?
 348
 349 The simple version of the decoding alogirhtm matches from the input char to
 350 the index: let's say char `a` is matched to the number `26`. So we would first match
 351 chars to the corresponding indexes and then blend those indexes toghether.
 352
 353 Let's say (one letter still 2 bits):
 354
 355 ```
 356 0AAA'0BBB'0CCC'0DDD  // input chars
 357 0aaa'0bbb'0ccc'0ddd  // correspoding bytes
 358 cddd'bbcc'aaab'____  // after a bunch of bit manipulations
 359                      // (last byte is w/e)
 360 ```
 361
 362 However, what if we incorporated both position and character in the lookup table?
 363
 364 ```
 365 char:                 0AAA
 366 corresponds to index: 0aaa
 367
 368 [0][0AAA] = 0000'0000'aaa0'0000
 369 [1][0AAA] = 0000'aa00'000a'0000
 370 [2][0AAA] = a000'00aa'0000'0000
 371 [4][0AAA] = 0aaa'0000'0000'0000
 372 ```
 373
 374 This way we can just `or` results for all input chars.
 375 Since we have one byte, which is always `0000`, we can use that byte for marking errors.
 376 So for a busted input we can just store `0xffffffff`. We will `or` everything we output - and then check for `0xffffffff` in the end.
 377
 378 This is the formula for each position (d - decoded index for char):
 379
 380 ```
 381 d << 2;                                   // 0: 0000'0000'0000'aaa0
 382 d >> 4 | (d << 12 & 0xff00);              // 1: 0000'0000'bb00'000b
 383 (d << 6 & 0xff00) | (d << 22 & 0xff0000); // 2: 0000'c000'00cc'0000
 384 d << 16;                                  // 3: 0000'0ddd'0000'0000
 385 ```
 386
 387 This is a memory/speed trade off. For a simple table lookup decoding we store 256 bytes. For this decoding we store 256 * 4 * 4 = 4kb. On big enough data this is a win of about a third over naive code. As far as SIMD clean up code is concerned, this is more questionable but for now we decided to go with it given that it shows an
 388 improvement in microbenchmarks.