third_party/rust/encoding_rs/README.md

   1 # encoding_rs
   2
   3 [![Build Status](https://travis-ci.org/hsivonen/encoding_rs.svg?branch=master)](https://travis-ci.org/hsivonen/encoding_rs)
   4 [![crates.io](https://img.shields.io/crates/v/encoding_rs.svg)](https://crates.io/crates/encoding_rs)
   5 [![docs.rs](https://docs.rs/encoding_rs/badge.svg)](https://docs.rs/encoding_rs/)
   6
   7 encoding_rs an implementation of the (non-JavaScript parts of) the
   8 [Encoding Standard](https://encoding.spec.whatwg.org/) written in Rust.
   9
  10 The Encoding Standard defines the Web-compatible set of character encodings,
  11 which means this crate can be used to decode Web content. encoding_rs is
  12 used in Gecko starting with Firefox 56. Due to the notable overlap between
  13 the legacy encodings on the Web and the legacy encodings used on Windows,
  14 this crate may be of use for non-Web-related situations as well; see below
  15 for links to adjacent crates.
  16
  17 Additionally, the `mem` module provides various operations for dealing with
  18 in-RAM text (as opposed to data that's coming from or going to an IO boundary).
  19 The `mem` module is a module instead of a separate crate due to internal
  20 implementation detail efficiencies.
  21
  22 ## Functionality
  23
  24 Due to the Gecko use case, encoding_rs supports decoding to and encoding from
  25 UTF-16 in addition to supporting the usual Rust use case of decoding to and
  26 encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly
  27 to accommodate the C++ side of Gecko.
  28
  29 Specifically, encoding_rs does the following:
  30
  31 * Decodes a stream of bytes in an Encoding Standard-defined character encoding
  32   into valid aligned native-endian in-RAM UTF-16 (units of `u16` / `char16_t`).
  33 * Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16
  34   (units of `u16` / `char16_t`) into a sequence of bytes in an Encoding
  35   Standard-defined character encoding as if the lone surrogates had been
  36   replaced with the REPLACEMENT CHARACTER before performing the encode.
  37   (Gecko's UTF-16 is potentially invalid.)
  38 * Decodes a stream of bytes in an Encoding Standard-defined character
  39   encoding into valid UTF-8.
  40 * Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding
  41   Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
  42 * Does the above in streaming (input and output split across multiple
  43   buffers) and non-streaming (whole input in a single buffer and whole
  44   output in a single buffer) variants.
  45 * Avoids copying (borrows) when possible in the non-streaming cases when
  46   decoding to or encoding from UTF-8.
  47 * Resolves textual labels that identify character encodings in
  48   protocol text into type-safe objects representing the those encodings
  49   conceptually.
  50 * Maps the type-safe encoding objects onto strings suitable for
  51   returning from `document.characterSet`.
  52 * Validates UTF-8 (in common instruction set scenarios a bit faster for Web
  53   workloads than the standard library; hopefully will get upstreamed some
  54   day) and ASCII.
  55
  56 Additionally, `encoding_rs::mem` does the following:
  57
  58 * Checks if a byte buffer contains only ASCII.
  59 * Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
  60 * Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
  61   buffer contains only Latin1 code points (below U+0100).
  62 * Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
  63   buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior
  64   (suitable for checking if the Unicode Bidirectional Algorithm can be optimized
  65   out).
  66 * Combined versions of the above two checks.
  67 * Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
  68 * Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
  69 * Converts UTF-8 and UTF-16 to Latin1 (if in range).
  70 * Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
  71 * Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
  72 * Copies ASCII from one buffer to another up to the first non-ASCII byte.
  73 * Converts ASCII to UTF-16 up to the first non-ASCII byte.
  74 * Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.
  75
  76 ## Integration with `std::io`
  77
  78 Notably, the above feature list doesn't include the capability to wrap
  79 a `std::io::Read`, decode it into UTF-8 and presenting the result via
  80 `std::io::Read`. The [`encoding_rs_io`](https://crates.io/crates/encoding_rs_io)
  81 crate provides that capability.
  82
  83 ## `no_std` Environment
  84
  85 The crate works in a `no_std` environment. By default, the `alloc` feature,
  86 which assumes that an allocator is present is enabled. For a no-allocator
  87 environment, the default features (i.e. `alloc`) can be turned off. This
  88 makes the part of the API that returns `Vec`/`String`/`Cow` unavailable.
  89
  90 ## Decoding Email
  91
  92 For decoding character encodings that occur in email, use the
  93 [`charset`](https://crates.io/crates/charset) crate instead of using this
  94 one directly. (It wraps this crate and adds UTF-7 decoding.)
  95
  96 ## Windows Code Page Identifier Mappings
  97
  98 For mappings to and from Windows code page identifiers, use the
  99 [`codepage`](https://crates.io/crates/codepage) crate.
 100
 101 ## DOS Encodings
 102
 103 This crate does not support single-byte DOS encodings that aren't required by
 104 the Web Platform, but the [`oem_cp`](https://crates.io/crates/oem_cp) crate does.
 105
 106 ## Preparing Text for the Encoders
 107
 108 Normalizing text into Unicode Normalization Form C prior to encoding text into
 109 a legacy encoding minimizes unmappable characters. Text can be normalized to
 110 Unicode Normalization Form C using the
 111 [`unic-normal`](https://crates.io/crates/unic-normal) crate.
 112
 113 The exception is windows-1258, which after normalizing to Unicode Normalization
 114 Form C requires tone marks to be decomposed in order to minimize unmappable
 115 characters. Vietnamese tone marks can be decomposed using the
 116 [`detone`](https://crates.io/crates/detone) crate.
 117
 118 ## Licensing
 119
 120 TL;DR: `(Apache-2.0 OR MIT) AND BSD-3-Clause` for the code and data combination.
 121
 122 Please see the file named
 123 [COPYRIGHT](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT).
 124
 125 The non-test code that isn't generated from the WHATWG data in this crate is
 126 under Apache-2.0 OR MIT. Test code is under CC0.
 127
 128 This crate contains code/data generated from WHATWG-supplied data. The WHATWG
 129 upstream changed its license for portions of specs incorporated into source code
 130 from CC0 to BSD-3-Clause between the initial release of this crate and the present
 131 version of this crate. The in-source licensing legends have been updated for the
 132 parts of the generated code that have changed since the upstream license change.
 133
 134 ## Documentation
 135
 136 Generated [API documentation](https://docs.rs/encoding_rs/) is available
 137 online.
 138
 139 There is a [long-form write-up](https://hsivonen.fi/encoding_rs/) about the
 140 design and internals of the crate.
 141
 142 ## C and C++ bindings
 143
 144 An FFI layer for encoding_rs is available as a
 145 [separate crate](https://github.com/hsivonen/encoding_c). The crate comes
 146 with a [demo C++ wrapper](https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h)
 147 using the C++ standard library and [GSL](https://github.com/Microsoft/GSL/) types.
 148
 149 The bindings for the `mem` module are in the
 150 [encoding_c_mem crate](https://github.com/hsivonen/encoding_c_mem).
 151
 152 For the Gecko context, there's a
 153 [C++ wrapper using the MFBT/XPCOM types](https://searchfox.org/mozilla-central/source/intl/Encoding.h#100).
 154
 155 There's a [write-up](https://hsivonen.fi/modern-cpp-in-rust/) about the C++
 156 wrappers.
 157
 158 ## Sample programs
 159
 160 * [Rust](https://github.com/hsivonen/recode_rs)
 161 * [C](https://github.com/hsivonen/recode_c)
 162 * [C++](https://github.com/hsivonen/recode_cpp)
 163
 164 ## Optional features
 165
 166 There are currently these optional cargo features:
 167
 168 ### `simd-accel`
 169
 170 Enables SIMD acceleration using the nightly-dependent `packed_simd_2` crate.
 171
 172 This is an opt-in feature, because enabling this feature _opts out_ of Rust's
 173 guarantees of future compilers compiling old code (aka. "stability story").
 174
 175 Currently, this has not been tested to be an improvement except for these
 176 targets:
 177
 178 * x86_64
 179 * i686
 180 * aarch64
 181 * thumbv7neon
 182
 183 If you use nightly Rust, you use targets whose first component is one of the
 184 above, and you are prepared _to have to revise your configuration when updating
 185 Rust_, you should enable this feature. Otherwise, please _do not_ enable this
 186 feature.
 187
 188 _Note!_ If you are compiling for a target that does not have 128-bit SIMD
 189 enabled as part of the target definition and you are enabling 128-bit SIMD
 190 using `-C target_feature`, you need to enable the `core_arch` Cargo feature
 191 for `packed_simd_2` to compile a crates.io snapshot of `core_arch` instead of
 192 using the standard-library copy of `core::arch`, because the `core::arch`
 193 module of the pre-compiled standard library has been compiled with the
 194 assumption that the CPU doesn't have 128-bit SIMD. At present this applies
 195 mainly to 32-bit ARM targets whose first component does not include the
 196 substring `neon`.
 197
 198 The encoding_rs side of things has not been properly set up for POWER,
 199 PowerPC, MIPS, etc., SIMD at this time, so even if you were to follow
 200 the advice from the previous paragraph, you probably shouldn't use
 201 the `simd-accel` option on the less mainstream architectures at this
 202 time.
 203
 204 Used by Firefox.
 205
 206 ### `serde`
 207
 208 Enables support for serializing and deserializing `&'static Encoding`-typed
 209 struct fields using [Serde][1].
 210
 211 [1]: https://serde.rs/
 212
 213 Not used by Firefox.
 214
 215 ### `fast-legacy-encode`
 216
 217 A catch-all option for enabling the fastest legacy encode options. _Does not
 218 affect decode speed or UTF-8 encode speed._
 219
 220 At present, this option is equivalent to enabling the following options:
 221  * `fast-hangul-encode`
 222  * `fast-hanja-encode`
 223  * `fast-kanji-encode`
 224  * `fast-gb-hanzi-encode`
 225  * `fast-big5-hanzi-encode`
 226
 227 Adds 176 KB to the binary size.
 228
 229 Not used by Firefox.
 230
 231 ### `fast-hangul-encode`
 232
 233 Changes encoding precomposed Hangul syllables into EUC-KR from binary
 234 search over the decode-optimized tables to lookup by index making Korean
 235 plain-text encode about 4 times as fast as without this option.
 236
 237 Adds 20 KB to the binary size.
 238
 239 Does _not_ affect decode speed.
 240
 241 Not used by Firefox.
 242
 243 ### `fast-hanja-encode`
 244
 245 Changes encoding of Hanja into EUC-KR from linear search over the
 246 decode-optimized table to lookup by index. Since Hanja is practically absent
 247 in modern Korean text, this option doesn't affect perfomance in the common
 248 case and mainly makes sense if you want to make your application resilient
 249 agaist denial of service by someone intentionally feeding it a lot of Hanja
 250 to encode into EUC-KR.
 251
 252 Adds 40 KB to the binary size.
 253
 254 Does _not_ affect decode speed.
 255
 256 Not used by Firefox.
 257
 258 ### `fast-kanji-encode`
 259
 260 Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear
 261 search over the decode-optimized tables to lookup by index making Japanese
 262 plain-text encode to legacy encodings 30 to 50 times as fast as without this
 263 option (about 2 times as fast as with `less-slow-kanji-encode`).
 264
 265 Takes precedence over `less-slow-kanji-encode`.
 266
 267 Adds 36 KB to the binary size (24 KB compared to `less-slow-kanji-encode`).
 268
 269 Does _not_ affect decode speed.
 270
 271 Not used by Firefox.
 272
 273 ### `less-slow-kanji-encode`
 274
 275 Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and
 276 ISO-2022-JP) encode less slow (binary search instead of linear search) making
 277 Japanese plain-text encode to legacy encodings 14 to 23 times as fast as
 278 without this option.
 279
 280 Adds 12 KB to the binary size.
 281
 282 Does _not_ affect decode speed.
 283
 284 Not used by Firefox.
 285
 286 ### `fast-gb-hanzi-encode`
 287
 288 Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and
 289 gb18030 from linear search over a part the decode-optimized tables followed
 290 by a binary search over another part of the decode-optimized tables to lookup
 291 by index making Simplified Chinese plain-text encode to the legacy encodings
 292 100 to 110 times as fast as without this option (about 2.5 times as fast as
 293 with `less-slow-gb-hanzi-encode`).
 294
 295 Takes precedence over `less-slow-gb-hanzi-encode`.
 296
 297 Adds 36 KB to the binary size (24 KB compared to `less-slow-gb-hanzi-encode`).
 298
 299 Does _not_ affect decode speed.
 300
 301 Not used by Firefox.
 302
 303 ### `less-slow-gb-hanzi-encode`
 304
 305 Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode
 306 less slow (binary search instead of linear search) making Simplified Chinese
 307 plain-text encode to the legacy encodings about 40 times as fast as without
 308 this option.
 309
 310 Adds 12 KB to the binary size.
 311
 312 Does _not_ affect decode speed.
 313
 314 Not used by Firefox.
 315
 316 ### `fast-big5-hanzi-encode`
 317
 318 Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from
 319 linear search over a part the decode-optimized tables to lookup by index
 320 making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast
 321 as without this option (about 3 times as fast as with
 322 `less-slow-big5-hanzi-encode`).
 323
 324 Takes precedence over `less-slow-big5-hanzi-encode`.
 325
 326 Adds 40 KB to the binary size (20 KB compared to `less-slow-big5-hanzi-encode`).
 327
 328 Does _not_ affect decode speed.
 329
 330 Not used by Firefox.
 331
 332 ### `less-slow-big5-hanzi-encode`
 333
 334 Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow
 335 (binary search instead of linear search) making Traditional Chinese
 336 plain-text encode to Big5 about 36 times as fast as without this option.
 337
 338 Adds 20 KB to the binary size.
 339
 340 Does _not_ affect decode speed.
 341
 342 Not used by Firefox.
 343
 344 ## Performance goals
 345
 346 For decoding to UTF-16, the goal is to perform at least as well as Gecko's old
 347 uconv. For decoding to UTF-8, the goal is to perform at least as well as
 348 rust-encoding. These goals have been achieved.
 349
 350 Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent
 351 to `memcpy` and UTF-16 to UTF-8 should be fast.)
 352
 353 Speed is a non-goal when encoding to legacy encodings. By default, encoding to
 354 legacy encodings should not be optimized for speed at the expense of code size
 355 as long as form submission and URL parsing in Gecko don't become noticeably
 356 too slow in real-world use.
 357
 358 In the interest of binary size, by default, encoding_rs does not have
 359 encode-specific data tables beyond 32 bits of encode-specific data for each
 360 single-byte encoding. Therefore, encoders search the decode-optimized data
 361 tables. This is a linear search in most cases. As a result, by default, encode
 362 to legacy encodings varies from slow to extremely slow relative to other
 363 libraries. Still, with realistic work loads, this seemed fast enough not to be
 364 user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing)
 365 in the Web-exposed encoder use cases.
 366
 367 See the cargo features above for optionally making CJK legacy encode fast.
 368
 369 A framework for measuring performance is [available separately][2].
 370
 371 [2]: https://github.com/hsivonen/encoding_bench/
 372
 373 ## Rust Version Compatibility
 374
 375 It is a goal to support the latest stable Rust, the latest nightly Rust and
 376 the version of Rust that's used for Firefox Nightly.
 377
 378 At this time, there is no firm commitment to support a version older than
 379 what's required by Firefox, and there is no commitment to treat MSRV changes
 380 as semver-breaking, because this crate depends on `cfg-if`, which doesn't
 381 appear to treat MSRV changes as semver-breaking, so it would be useless for
 382 this crate to treat MSRV changes as semver-breaking.
 383
 384 As of 2021-02-04, MSRV appears to be Rust 1.36.0 for using the crate and
 385 1.42.0 for doc tests to pass without errors about the global allocator.
 386
 387 ## Compatibility with rust-encoding
 388
 389 A compatibility layer that implements the rust-encoding API on top of
 390 encoding_rs is
 391 [provided as a separate crate](https://github.com/hsivonen/encoding_rs_compat)
 392 (cannot be uploaded to crates.io). The compatibility layer was originally
 393 written with the assuption that Firefox would need it, but it is not currently
 394 used in Firefox.
 395
 396 ## Regenerating Generated Code
 397
 398 To regenerate the generated code:
 399
 400  * Have Python 2 installed.
 401  * Clone [`https://github.com/hsivonen/encoding_c`](https://github.com/hsivonen/encoding_c)
 402    next to the `encoding_rs` directory.
 403  * Clone [`https://github.com/hsivonen/codepage`](https://github.com/hsivonen/codepage)
 404    next to the `encoding_rs` directory.
 405  * Clone [`https://github.com/whatwg/encoding`](https://github.com/whatwg/encoding)
 406    next to the `encoding_rs` directory.
 407  * Checkout revision `be3337450e7df1c49dca7872153c4c4670dd8256` of the `encoding` repo.
 408    (Note: `f381389` was the revision of `encoding` used from before the `encoding` repo
 409    license change. So far, only output changed since then has been updated to
 410    the new license legend.)
 411  * With the `encoding_rs` directory as the working directory, run
 412    `python generate-encoding-data.py`.
 413
 414 ## Roadmap
 415
 416 - [x] Design the low-level API.
 417 - [x] Provide Rust-only convenience features.
 418 - [x] Provide an stl/gsl-flavored C++ API.
 419 - [x] Implement all decoders and encoders.
 420 - [x] Add unit tests for all decoders and encoders.
 421 - [x] Finish BOM sniffing variants in Rust-only convenience features.
 422 - [x] Document the API.
 423 - [x] Publish the crate on crates.io.
 424 - [x] Create a solution for measuring performance.
 425 - [x] Accelerate ASCII conversions using SSE2 on x86.
 426 - [x] Accelerate ASCII conversions using ALU register-sized operations on
 427       non-x86 architectures (process an `usize` instead of `u8` at a time).
 428 - [x] Split FFI into a separate crate so that the FFI doesn't interfere with
 429       LTO in pure-Rust usage.
 430 - [x] Compress CJK indices by making use of sequential code points as well
 431       as Unicode-ordered parts of indices.
 432 - [x] Make lookups by label or name use binary search that searches from the
 433       end of the label/name to the start.
 434 - [x] Make labels with non-ASCII bytes fail fast.
 435 - [ ] ~Parallelize UTF-8 validation using [Rayon](https://github.com/nikomatsakis/rayon).~
 436       (This turned out to be a pessimization in the ASCII case due to memory bandwidth reasons.)
 437 - [x] Provide an XPCOM/MFBT-flavored C++ API.
 438 - [x] Investigate accelerating single-byte encode with a single fast-tracked
 439       range per encoding.
 440 - [x] Replace uconv with encoding_rs in Gecko.
 441 - [x] Implement the rust-encoding API in terms of encoding_rs.
 442 - [x] Add SIMD acceleration for Aarch64.
 443 - [x] Investigate the use of NEON on 32-bit ARM.
 444 - [ ] ~Investigate Björn Höhrmann's lookup table acceleration for UTF-8 as
 445       adapted to Rust in rust-encoding.~
 446 - [x] Add actually fast CJK encode options.
 447 - [ ] ~Investigate [Bob Steagall's lookup table acceleration for UTF-8](https://github.com/BobSteagall/CppNow2018/blob/master/FastConversionFromUTF-8/Fast%20Conversion%20From%20UTF-8%20with%20C%2B%2B%2C%20DFAs%2C%20and%20SSE%20Intrinsics%20-%20Bob%20Steagall%20-%20C%2B%2BNow%202018.pdf).~
 448 - [ ] Provide a build mode that works without `alloc` (with lesser API surface).
 449 - [ ] Migrate to `std::simd` once it is stable and declare 1.0.
 450
 451 ## Release Notes
 452
 453 ### 0.8.31
 454
 455 * Use SPDX with parentheses now that crates.io supports parentheses.
 456
 457 ### 0.8.30
 458
 459 * Update the licensing information to take into account the WHATWG data license change.
 460
 461 ### 0.8.29
 462
 463 * Make the parts that use an allocator optional.
 464
 465 ### 0.8.28
 466
 467 * Fix error in Serde support introduced as part of `no_std` support.
 468
 469 ### 0.8.27
 470
 471 * Make the crate works in a `no_std` environment (with `alloc`).
 472
 473 ### 0.8.26
 474
 475 * Fix oversights in edition 2018 migration that broke the `simd-accel` feature.
 476
 477 ### 0.8.25
 478
 479 * Do pointer alignment checks in a way where intermediate steps aren't defined to be Undefined Behavior.
 480 * Update the `packed_simd` dependency to `packed_simd_2`.
 481 * Update the `cfg-if` dependency to 1.0.
 482 * Address warnings that have been introduced by newer Rust versions along the way.
 483 * Update to edition 2018, since even prior to 1.0 `cfg-if` updated to edition 2018 without a semver break.
 484
 485 ### 0.8.24
 486
 487 * Avoid computing an intermediate (not dereferenced) pointer value in a manner designated as Undefined Behavior when computing pointer alignment.
 488
 489 ### 0.8.23
 490
 491 * Remove year from copyright notices. (No features or bug fixes.)
 492
 493 ### 0.8.22
 494
 495 * Formatting fix and new unit test. (No features or bug fixes.)
 496
 497 ### 0.8.21
 498
 499 * Fixed a panic with invalid UTF-16[BE|LE] input at the end of the stream.
 500
 501 ### 0.8.20
 502
 503 * Make `Decoder::latin1_byte_compatible_up_to` return `None` in more
 504   cases to make the method actually useful. While this could be argued
 505   to be a breaking change due to the bug fix changing semantics, it does
 506   not break callers that had to handle the `None` case in a reasonable
 507   way anyway.
 508
 509 ### 0.8.19
 510
 511 * Removed a bunch of bound checks in `convert_str_to_utf16`.
 512 * Added `mem::convert_utf8_to_utf16_without_replacement`.
 513
 514 ### 0.8.18
 515
 516 * Added `mem::utf8_latin1_up_to` and `mem::str_latin1_up_to`.
 517 * Added `Decoder::latin1_byte_compatible_up_to`.
 518
 519 ### 0.8.17
 520
 521 * Update `bincode` (dev dependency) version requirement to 1.0.
 522
 523 ### 0.8.16
 524
 525 * Switch from the `simd` crate to `packed_simd`.
 526
 527 ### 0.8.15
 528
 529 * Adjust documentation for `simd-accel` (README-only release).
 530
 531 ### 0.8.14
 532
 533 * Made UTF-16 to UTF-8 encode conversion fill the output buffer as
 534   closely as possible.
 535
 536 ### 0.8.13
 537
 538 * Made the UTF-8 to UTF-16 decoder compare the number of code units written
 539   with the length of the right slice (the output slice) to fix a panic
 540   introduced in 0.8.11.
 541
 542 ### 0.8.12
 543
 544 * Removed the `clippy::` prefix from clippy lint names.
 545
 546 ### 0.8.11
 547
 548 * Changed minimum Rust requirement to 1.29.0 (for the ability to refer
 549   to the interior of a `static` when defining another `static`).
 550 * Explicitly aligned the lookup tables for single-byte encodings and
 551   UTF-8 to cache lines in the hope of freeing up one cache line for
 552   other data. (Perhaps the tables were already aligned and this is
 553   placebo.)
 554 * Added 32 bits of encode-oriented data for each single-byte encoding.
 555   The change was performance-neutral for non-Latin1-ish Latin legacy
 556   encodings, improved Latin1-ish and Arabic legacy encode speed
 557   somewhat (new speed is 2.4x the old speed for German, 2.3x for
 558   Arabic, 1.7x for Portuguese and 1.4x for French) and improved
 559   non-Latin1, non-Arabic legacy single-byte encode a lot (7.2x for
 560   Thai, 6x for Greek, 5x for Russian, 4x for Hebrew).
 561 * Added compile-time options for fast CJK legacy encode options (at
 562   the cost of binary size (up to 176 KB) and run-time memory usage).
 563   These options still retain the overall code structure instead of
 564   rewriting the CJK encoders totally, so the speed isn't as good as
 565   what could be achieved by using even more memory / making the
 566   binary even langer.
 567 * Made UTF-8 decode and validation faster.
 568 * Added method `is_single_byte()` on `Encoding`.
 569 * Added `mem::decode_latin1()` and `mem::encode_latin1_lossy()`.
 570
 571 ### 0.8.10
 572
 573 * Disabled a unit test that tests a panic condition when the assertion
 574   being tested is disabled.
 575
 576 ### 0.8.9
 577
 578 * Made `--features simd-accel` work with stable-channel compiler to
 579   simplify the Firefox build system.
 580
 581 ### 0.8.8
 582
 583 * Made the `is_foo_bidi()` not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE
 584   aka. BYTE ORDER MARK) as right-to-left.
 585 * Made the `is_foo_bidi()` functions report `true` if the input contains
 586   Hebrew presentations forms (which are right-to-left but not in a
 587   right-to-left-roadmapped block).
 588
 589 ### 0.8.7
 590
 591 * Fixed a panic in the UTF-16LE/UTF-16BE decoder when decoding to UTF-8.
 592
 593 ### 0.8.6
 594
 595 * Temporarily removed the debug assertion added in version 0.8.5 from
 596   `convert_utf16_to_latin1_lossy`.
 597
 598 ### 0.8.5
 599
 600 * If debug assertions are enabled but fuzzing isn't enabled, lossy conversions
 601   to Latin1 in the `mem` module assert that the input is in the range
 602   U+0000...U+00FF (inclusive).
 603 * In the `mem` module provide conversions from Latin1 and UTF-16 to UTF-8
 604   that can deal with insufficient output space. The idea is to use them
 605   first with an allocation rounded up to jemalloc bucket size and do the
 606   worst-case allocation only if the jemalloc rounding up was insufficient
 607   as the first guess.
 608
 609 ### 0.8.4
 610
 611 * Fix SSE2-specific, `simd-accel`-specific memory corruption introduced in
 612   version 0.8.1 in conversions between UTF-16 and Latin1 in the `mem` module.
 613
 614 ### 0.8.3
 615
 616 * Removed an `#[inline(never)]` annotation that was not meant for release.
 617
 618 ### 0.8.2
 619
 620 * Made non-ASCII UTF-16 to UTF-8 encode faster by manually omitting bound
 621   checks and manually adding branch prediction annotations.
 622
 623 ### 0.8.1
 624
 625 * Tweaked loop unrolling and memory alignment for SSE2 conversions between
 626   UTF-16 and Latin1 in the `mem` module to increase the performance when
 627   converting long buffers.
 628
 629 ### 0.8.0
 630
 631 * Changed the minimum supported version of Rust to 1.21.0 (semver breaking
 632   change).
 633 * Flipped around the defaults vs. optional features for controlling the size
 634   vs. speed trade-off for Kanji and Hanzi legacy encode (semver breaking
 635   change).
 636 * Added NEON support on ARMv7.
 637 * SIMD-accelerated x-user-defined to UTF-16 decode.
 638 * Made UTF-16LE and UTF-16BE decode a lot faster (including SIMD
 639   acceleration).
 640
 641 ### 0.7.2
 642
 643 * Add the `mem` module.
 644 * Refactor SIMD code which can affect performance outside the `mem`
 645   module.
 646
 647 ### 0.7.1
 648
 649 * When encoding from invalid UTF-16, correctly handle U+DC00 followed by
 650   another low surrogate.
 651
 652 ### 0.7.0
 653
 654 * [Make `replacement` a label of the replacement
 655   encoding.](https://github.com/whatwg/encoding/issues/70) (Spec change.)
 656 * Remove `Encoding::for_name()`. (`Encoding::for_label(foo).unwrap()` is
 657   now close enough after the above label change.)
 658 * Remove the `parallel-utf8` cargo feature.
 659 * Add optional Serde support for `&'static Encoding`.
 660 * Performance tweaks for ASCII handling.
 661 * Performance tweaks for UTF-8 validation.
 662 * SIMD support on aarch64.
 663
 664 ### 0.6.11
 665
 666 * Make `Encoder::has_pending_state()` public.
 667 * Update the `simd` crate dependency to 0.2.0.
 668
 669 ### 0.6.10
 670
 671 * Reserve enough space for NCRs when encoding to ISO-2022-JP.
 672 * Correct max length calculations for multibyte decoders.
 673 * Correct max length calculations before BOM sniffing has been
 674   performed.
 675 * Correctly calculate max length when encoding from UTF-16 to GBK.
 676
 677 ### 0.6.9
 678
 679 * [Don't prepend anything when gb18030 range decode
 680   fails](https://github.com/whatwg/encoding/issues/110). (Spec change.)
 681
 682 ### 0.6.8
 683
 684 * Correcly handle the case where the first buffer contains potentially
 685   partial BOM and the next buffer is the last buffer.
 686 * Decode byte `7F` correctly in ISO-2022-JP.
 687 * Make UTF-16 to UTF-8 encode write closer to the end of the buffer.
 688 * Implement `Hash` for `Encoding`.
 689
 690 ### 0.6.7
 691
 692 * [Map half-width katakana to full-width katana in ISO-2022-JP
 693   encoder](https://github.com/whatwg/encoding/issues/105). (Spec change.)
 694 * Give `InputEmpty` correct precedence over `OutputFull` when encoding
 695   with replacement and the output buffer passed in is too short or the
 696   remaining space in the output buffer is too small after a replacement.
 697
 698 ### 0.6.6
 699
 700 * Correct max length calculation when a partial BOM prefix is part of
 701   the decoder's state.
 702
 703 ### 0.6.5
 704
 705 * Correct max length calculation in various encoders.
 706 * Correct max length calculation in the UTF-16 decoder.
 707 * Derive `PartialEq` and `Eq` for the `CoderResult`, `DecoderResult`
 708   and `EncoderResult` types.
 709
 710 ### 0.6.4
 711
 712 * Avoid panic when encoding with replacement and the destination buffer is
 713   too short to hold one numeric character reference.
 714
 715 ### 0.6.3
 716
 717 * Add support for 32-bit big-endian hosts. (For real this time.)
 718
 719 ### 0.6.2
 720
 721 * Fix a panic from subslicing with bad indices in
 722   `Encoder::encode_from_utf16`. (Due to an oversight, it lacked the fix that
 723   `Encoder::encode_from_utf8` already had.)
 724 * Micro-optimize error status accumulation in non-streaming case.
 725
 726 ### 0.6.1
 727
 728 * Avoid panic near integer overflow in a case that's unlikely to actually
 729   happen.
 730 * Address Clippy lints.
 731
 732 ### 0.6.0
 733
 734 * Make the methods for computing worst-case buffer size requirements check
 735   for integer overflow.
 736 * Upgrade rayon to 0.7.0.
 737
 738 ### 0.5.1
 739
 740 * Reorder methods for better documentation readability.
 741 * Add support for big-endian hosts. (Only 64-bit case actually tested.)
 742 * Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64.
 743
 744 ### 0.5.0
 745
 746 * Avoid allocating an excessively long buffers in non-streaming decode.
 747 * Fix the behavior of ISO-2022-JP and replacement decoders near the end of the
 748   output buffer.
 749 * Annotate the result structs with `#[must_use]`.
 750
 751 ### 0.4.0
 752
 753 * Split FFI into a separate crate.
 754 * Performance tweaks.
 755 * CJK binary size and encoding performance changes.
 756 * Parallelize UTF-8 validation in the case of long buffers (with optional
 757   feature `parallel-utf8`).
 758 * Borrow even with ISO-2022-JP when possible.
 759
 760 ### 0.3.2
 761
 762 * Fix moving pointers to alignment in ALU-based ASCII acceleration.
 763 * Fix errors in documentation and improve documentation.
 764
 765 ### 0.3.1
 766
 767 * Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE.
 768 * Make UTF-8 to UTF-8 decode SSE2-accelerated when feature `simd-accel` is used.
 769 * When decoding and encoding ASCII-only input from or to an ASCII-compatible
 770   encoding using the non-streaming API, return a borrow of the input.
 771 * Make encode from UTF-16 to UTF-8 faster.
 772
 773 ### 0.3
 774
 775 * Change the references to the instances of `Encoding` from `const` to `static`
 776   to make the referents unique across crates that use the refernces.
 777 * Introduce non-reference-typed `FOO_INIT` instances of `Encoding` to allow
 778   foreign crates to initialize `static` arrays with references to `Encoding`
 779   instances even under Rust's constraints that prohibit the initialization of
 780   `&'static Encoding`-typed array items with `&'static Encoding`-typed
 781   `statics`.
 782 * Document that the above two points will be reverted if Rust changes `const`
 783   to work so that cross-crate usage keeps the referents unique.
 784 * Return `Cow`s from Rust-only non-streaming methods for encode and decode.
 785 * `Encoding::for_bom()` returns the length of the BOM.
 786 * ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE,
 787   ISO-2022-JP and x-user-defined.
 788 * Add SSE2 acceleration behind the `simd-accel` feature flag. (Requires
 789   nightly Rust.)
 790 * Fix panic with long bogus labels.
 791 * Map [0xCA to U+05BA in windows-1255](https://github.com/whatwg/encoding/issues/73).
 792   (Spec change.)
 793 * Correct the [end of the Shift_JIS EUDC range](https://github.com/whatwg/encoding/issues/53).
 794   (Spec change.)
 795
 796 ### 0.2.4
 797
 798 * Polish FFI documentation.
 799
 800 ### 0.2.3
 801
 802 * Fix UTF-16 to UTF-8 encode.
 803
 804 ### 0.2.2
 805
 806 * Add `Encoder.encode_from_utf8_to_vec_without_replacement()`.
 807
 808 ### 0.2.1
 809
 810 * Add `Encoding.is_ascii_compatible()`.
 811
 812 * Add `Encoding::for_bom()`.
 813
 814 * Make `==` for `Encoding` use name comparison instead of pointer comparison,
 815   because uses of the encoding constants in different crates result in
 816   different addresses and the constant cannot be turned into statics without
 817   breaking other things.
 818
 819 ### 0.2.0
 820
 821 The initial release.