xpcom/docs/stringguide.rst

   1 String Guide
   2 ============
   3
   4 Most of the Mozilla code uses a C++ class hierarchy to pass string data,
   5 rather than using raw pointers. This guide documents the string classes which
   6 are visible to code within the Mozilla codebase (code which is linked into
   7 ``libxul``).
   8
   9 Introduction
  10 ------------
  11
  12 The string classes are a library of C++ classes which are used to manage
  13 buffers of wide (16-bit) and narrow (8-bit) character strings. The headers
  14 and implementation are in the `xpcom/string
  15 <https://searchfox.org/mozilla-central/source/xpcom/string>`_ directory. All
  16 strings are stored as a single contiguous buffer of characters.
  17
  18 The 8-bit and 16-bit string classes have completely separate base classes,
  19 but share the same APIs. As a result, you cannot assign a 8-bit string to a
  20 16-bit string without some kind of conversion helper class or routine. For
  21 the purpose of this document, we will refer to the 16-bit string classes in
  22 class documentation. Every 16-bit class has an equivalent 8-bit class:
  23
  24 ===================== ======================
  25 Wide                  Narrow
  26 ===================== ======================
  27 ``nsAString``         ``nsACString``
  28 ``nsString``          ``nsCString``
  29 ``nsAutoString``      ``nsAutoCString``
  30 ``nsDependentString`` ``nsDependentCString``
  31 ===================== ======================
  32
  33 The string classes distinguish, as part of the type hierarchy, between
  34 strings that must have a null-terminator at the end of their buffer
  35 (``ns[C]String``) and strings that are not required to have a null-terminator
  36 (``nsA[C]String``). nsA[C]String is the base of the string classes (since it
  37 imposes fewer requirements) and ``ns[C]String`` is a class derived from it.
  38 Functions taking strings as parameters should generally take one of these
  39 four types.
  40
  41 In order to avoid unnecessary copying of string data (which can have
  42 significant performance cost), the string classes support different ownership
  43 models. All string classes support the following three ownership models
  44 dynamically:
  45
  46 * reference counted, copy-on-write, buffers (the default)
  47
  48 * adopted buffers (a buffer that the string class owns, but is not reference
  49   counted, because it came from somewhere else)
  50
  51 * dependent buffers, that is, an underlying buffer that the string class does
  52   not own, but that the caller that constructed the string guarantees will
  53   outlive the string instance
  54
  55 Auto strings will prefer reference counting an existing reference-counted
  56 buffer over their stack buffer, but will otherwise use their stack buffer for
  57 anything that will fit in it.
  58
  59 There are a number of additional string classes:
  60
  61
  62 * Classes which exist primarily as constructors for the other types,
  63   particularly ``nsDependent[C]String`` and ``nsDependent[C]Substring``. These
  64   types are really just convenient notation for constructing an
  65   ``nsA[C]String`` with a non-default ownership mode; they should not be
  66   thought of as different types.
  67
  68 * ``nsLiteral[C]String`` which should rarely be constructed explicitly but
  69   usually through the ``""_ns`` and ``u""_ns`` user-defined string literals.
  70   ``nsLiteral[C]String`` is trivially constructible and destructible, and
  71   therefore does not emit construction/destruction code when stored in static,
  72   as opposed to the other string classes.
  73
  74 The Major String Classes
  75 ------------------------
  76
  77 The list below describes the main base classes. Once you are familiar with
  78 them, see the appendix describing What Class to Use When.
  79
  80
  81 * **nsAString**/**nsACString**: the abstract base class for all strings. It
  82   provides an API for assignment, individual character access, basic
  83   manipulation of characters in the string, and string comparison. This class
  84   corresponds to the XPIDL ``AString`` or ``ACString`` parameter types.
  85   ``nsA[C]String`` is not necessarily null-terminated.
  86
  87 * **nsString**/**nsCString**: builds on ``nsA[C]String`` by guaranteeing a
  88   null-terminated storage. This allows for a method (``.get()``) to access the
  89   underlying character buffer.
  90
  91 The remainder of the string classes inherit from either ``nsA[C]String`` or
  92 ``ns[C]String``. Thus, every string class is compatible with ``nsA[C]String``.
  93
  94 .. note::
  95
  96     In code which is generic over string width, ``nsA[C]String`` is sometimes
  97     known as ``nsTSubstring<CharT>``. ``nsAString`` is a type alias for
  98     ``nsTSubstring<char16_t>``, and ``nsACString`` is a type alias for
  99     ``nsTSubstring<char>``.
 100
 101 .. note::
 102
 103     The type ``nsLiteral[C]String`` technically does not inherit from
 104     ``nsA[C]String``, but instead inherits from ``nsStringRepr<CharT>``. This
 105     allows the type to not generate destructors when stored in static
 106     storage.
 107
 108     It can be implicitly coerced to ``const ns[C]String&`` (though can never
 109     be accessed mutably) and generally acts as-if it was a subclass of
 110     ``ns[C]String`` in most cases.
 111
 112 Since every string derives from ``nsAString`` (or ``nsACString``), they all
 113 share a simple API. Common read-only methods include:
 114
 115 * ``.Length()`` - the number of code units (bytes for 8-bit string classes and ``char16_t`` for 16-bit string classes) in the string.
 116 * ``.IsEmpty()`` - the fastest way of determining if the string has any value. Use this instead of testing ``string.Length() == 0``
 117 * ``.Equals(string)`` - ``true`` if the given string has the same value as the current string. Approximately the same as ``operator==``.
 118
 119 Common methods that modify the string:
 120
 121 * ``.Assign(string)`` - Assigns a new value to the string. Approximately the same as ``operator=``.
 122 * ``.Append(string)`` - Appends a value to the string.
 123 * ``.Insert(string, position)`` - Inserts the given string before the code unit at position.
 124 * ``.Truncate(length)`` - shortens the string to the given length.
 125
 126 More complete documentation can be found in the `Class Reference`_.
 127
 128 As function parameters
 129 ~~~~~~~~~~~~~~~~~~~~~~
 130
 131 In general, use ``nsA[C]String`` references to pass strings across modules. For example:
 132
 133 .. code-block:: cpp
 134
 135     // when passing a string to a method, use const nsAString&
 136     nsFoo::PrintString(const nsAString& str);
 137
 138     // when getting a string from a method, use nsAString&
 139     nsFoo::GetString(nsAString& result);
 140
 141 The Concrete Classes - which classes to use when
 142 ------------------------------------------------
 143
 144 The concrete classes are for use in code that actually needs to store string
 145 data. The most common uses of the concrete classes are as local variables,
 146 and members in classes or structs.
 147
 148 .. digraph:: concreteclasses
 149
 150     node [shape=rectangle]
 151
 152     "nsA[C]String" -> "ns[C]String";
 153     "ns[C]String" -> "nsDependent[C]String";
 154     "nsA[C]String" -> "nsDependent[C]Substring";
 155     "nsA[C]String" -> "ns[C]SubstringTuple";
 156     "ns[C]String" -> "nsAuto[C]StringN";
 157     "ns[C]String" -> "nsLiteral[C]String" [style=dashed];
 158     "nsAuto[C]StringN" -> "nsPromiseFlat[C]String";
 159     "nsAuto[C]StringN" -> "nsPrintfCString";
 160
 161 The following is a list of the most common concrete classes. Once you are
 162 familiar with them, see the appendix describing What Class to Use When.
 163
 164 * ``ns[C]String`` - a null-terminated string whose buffer is allocated on the
 165   heap. Destroys its buffer when the string object goes away.
 166
 167 * ``nsAuto[C]String`` - derived from ``nsString``, a string which owns a 64
 168   code unit buffer in the same storage space as the string itself. If a string
 169   less than 64 code units is assigned to an ``nsAutoString``, then no extra
 170   storage will be allocated. For larger strings, a new buffer is allocated on
 171   the heap.
 172
 173   If you want a number other than 64, use the templated types ``nsAutoStringN``
 174   / ``nsAutoCStringN``. (``nsAutoString`` and ``nsAutoCString`` are just
 175   typedefs for ``nsAutoStringN<64>`` and ``nsAutoCStringN<64>``, respectively.)
 176
 177 * ``nsDependent[C]String`` - derived from ``nsString``, this string does not
 178   own its buffer. It is useful for converting a raw string pointer (``const
 179   char16_t*`` or ``const char*``) into a class of type ``nsAString``. Note that
 180   you must null-terminate buffers used by to ``nsDependentString``. If you
 181   don't want to or can't null-terminate the buffer, use
 182   ``nsDependentSubstring``.
 183
 184 * ``nsPrintfCString`` - derived from ``nsCString``, this string behaves like an
 185   ``nsAutoCString``. The constructor takes parameters which allows it to
 186   construct a 8-bit string from a printf-style format string and parameter
 187   list.
 188
 189 There are also a number of concrete classes that are created as a side-effect
 190 of helper routines, etc. You should avoid direct use of these classes. Let
 191 the string library create the class for you.
 192
 193 * ``ns[C]SubstringTuple`` - created via string concatenation
 194 * ``nsDependent[C]Substring`` - created through ``Substring()``
 195 * ``nsPromiseFlat[C]String`` - created through ``PromiseFlatString()``
 196 * ``nsLiteral[C]String`` - created through the ``""_ns`` and ``u""_ns`` user-defined literals
 197
 198 Of course, there are times when it is necessary to reference these string
 199 classes in your code, but as a general rule they should be avoided.
 200
 201 Iterators
 202 ---------
 203
 204 Because Mozilla strings are always a single buffer, iteration over the
 205 characters in the string is done using raw pointers:
 206
 207 .. code-block:: cpp
 208
 209     /**
 210      * Find whether there is a tab character in `data`
 211      */
 212     bool HasTab(const nsAString& data) {
 213       const char16_t* cur = data.BeginReading();
 214       const char16_t* end = data.EndReading();
 215
 216       for (; cur < end; ++cur) {
 217         if (char16_t('\t') == *cur) {
 218           return true;
 219         }
 220       }
 221       return false;
 222     }
 223
 224 Note that ``end`` points to the character after the end of the string buffer.
 225 It should never be dereferenced.
 226
 227 Writing to a mutable string is also simple:
 228
 229 .. code-block:: cpp
 230
 231     /**
 232     * Replace every tab character in `data` with a space.
 233     */
 234     void ReplaceTabs(nsAString& data) {
 235       char16_t* cur = data.BeginWriting();
 236       char16_t* end = data.EndWriting();
 237
 238       for (; cur < end; ++cur) {
 239         if (char16_t('\t') == *cur) {
 240           *cur = char16_t(' ');
 241         }
 242       }
 243     }
 244
 245 You may change the length of a string via ``SetLength()``. Note that
 246 Iterators become invalid after changing the length of a string. If a string
 247 buffer becomes smaller while writing it, use ``SetLength`` to inform the
 248 string class of the new size:
 249
 250 .. code-block:: cpp
 251
 252     /**
 253      * Remove every tab character from `data`
 254      */
 255     void RemoveTabs(nsAString& data) {
 256       int len = data.Length();
 257       char16_t* cur = data.BeginWriting();
 258       char16_t* end = data.EndWriting();
 259
 260       while (cur < end) {
 261         if (char16_t('\t') == *cur) {
 262           len -= 1;
 263           end -= 1;
 264           if (cur < end)
 265             memmove(cur, cur + 1, (end - cur) * sizeof(char16_t));
 266         } else {
 267           cur += 1;
 268         }
 269       }
 270
 271       data.SetLength(len);
 272     }
 273
 274 Note that using ``BeginWriting()`` to make a string longer is not OK.
 275 ``BeginWriting()`` must not be used to write past the logical length of the
 276 string indicated by ``EndWriting()`` or ``Length()``. Calling
 277 ``SetCapacity()`` before ``BeginWriting()`` does not affect what the previous
 278 sentence says. To make the string longer, call ``SetLength()`` before
 279 ``BeginWriting()`` or use the ``BulkWrite()`` API described below.
 280
 281 Bulk Write
 282 ----------
 283
 284 ``BulkWrite()`` allows capacity-aware cache-friendly low-level writes to the
 285 string's buffer.
 286
 287 Capacity-aware means that the caller is made aware of how the
 288 caller-requested buffer capacity was rounded up to mozjemalloc buckets. This
 289 is useful when initially requesting best-case buffer size without yet knowing
 290 the true size need. If the data that actually needs to be written is larger
 291 than the best-case estimate but still fits within the rounded-up capacity,
 292 there is no need to reallocate despite requesting the best-case capacity.
 293
 294 Cache-friendly means that the zero terminator for C compatibility is written
 295 after the new content of the string has been written, so the result is a
 296 forward-only linear write access pattern instead of a non-linear
 297 back-and-forth sequence resulting from using ``SetLength()`` followed by
 298 ``BeginWriting()``.
 299
 300 Low-level means that writing via a raw pointer is possible as with
 301 ``BeginWriting()``.
 302
 303 ``BulkWrite()`` takes three arguments: The new capacity (which may be rounded
 304 up), the number of code units at the beginning of the string to preserve
 305 (typically the old logical length), and a boolean indicating whether
 306 reallocating a smaller buffer is OK if the requested capacity would fit in a
 307 buffer that's smaller than current one. It returns a ``mozilla::Result`` which
 308 contains either a usable ``mozilla::BulkWriteHandle<T>`` (where ``T`` is the
 309 string's ``char_type``) or an ``nsresult`` explaining why none can be had
 310 (presumably OOM).
 311
 312 The actual writes are performed through the returned
 313 ``mozilla::BulkWriteHandle<T>``. You must not access the string except via this
 314 handle until you call ``Finish()`` on the handle in the success case or you let
 315 the handle go out of scope without calling ``Finish()`` in the failure case, in
 316 which case the destructor of the handle puts the string in a mostly harmless but
 317 consistent state (containing a single REPLACEMENT CHARACTER if a capacity
 318 greater than 0 was requested, or in the ``char`` case if the three-byte UTF-8
 319 representation of the REPLACEMENT CHARACTER doesn't fit, an ASCII SUBSTITUTE).
 320
 321 ``mozilla::BulkWriteHandle<T>`` autoconverts to a writable
 322 ``mozilla::Span<T>`` and also provides explicit access to itself as ``Span``
 323 (``AsSpan()``) or via component accessors named consistently with those on
 324 ``Span``: ``Elements()`` and ``Length()``. (The latter is not the logical
 325 length of the string but the writable length of the buffer.) The buffer
 326 exposed via these methods includes the prefix that you may have requested to
 327 be preserved. It's up to you to skip past it so as to not overwrite it.
 328
 329 If there's a need to request a different capacity before you are ready to
 330 call ``Finish()``, you can call ``RestartBulkWrite()`` on the handle. It
 331 takes three arguments that match the first three arguments of
 332 ``BulkWrite()``. It returns ``mozilla::Result<mozilla::Ok, nsresult>`` to
 333 indicate success or OOM. Calling ``RestartBulkWrite()`` invalidates
 334 previously-obtained span, raw pointer or length.
 335
 336 Once you are done writing, call ``Finish()``. It takes two arguments: the new
 337 logical length of the string (which must not exceed the capacity returned by
 338 the ``Length()`` method of the handle) and a boolean indicating whether it's
 339 OK to attempt to reallocate a smaller buffer in case a smaller mozjemalloc
 340 bucket could accommodate the new logical length.
 341
 342 Helper Classes and Functions
 343 ----------------------------
 344
 345 Converting NSString strings
 346 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 347
 348 Use ``mozilla::CopyNSStringToXPCOMString()`` in
 349 ``mozilla/MacStringHelpers.h`` to convert NSString strings to XPCOM strings.
 350
 351 Searching strings - looking for substrings, characters, etc.
 352 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 353
 354 The ``nsReadableUtils.h`` header provides helper methods for searching in runnables.
 355
 356 .. code-block:: cpp
 357
 358     bool FindInReadable(const nsAString& pattern,
 359                         nsAString::const_iterator start, nsAString::const_iterator end,
 360                         nsStringComparator& aComparator = nsDefaultStringComparator());
 361
 362 To use this, ``start`` and ``end`` should point to the beginning and end of a
 363 string that you would like to search. If the search string is found,
 364 ``start`` and ``end`` will be adjusted to point to the beginning and end of
 365 the found pattern. The return value is ``true`` or ``false``, indicating
 366 whether or not the string was found.
 367
 368 An example:
 369
 370 .. code-block:: cpp
 371
 372     const nsAString& str = GetSomeString();
 373     nsAString::const_iterator start, end;
 374
 375     str.BeginReading(start);
 376     str.EndReading(end);
 377
 378     constexpr auto valuePrefix = u"value="_ns;
 379
 380     if (FindInReadable(valuePrefix, start, end)) {
 381         // end now points to the character after the pattern
 382         valueStart = end;
 383     }
 384
 385 Checking for Memory Allocation failure
 386 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 387
 388 Like other types in Gecko, the string classes use infallible memory
 389 allocation by default, so you do not need to check for success when
 390 allocating/resizing "normal" strings.
 391
 392 Most functions that modify strings (``Assign()``, ``SetLength()``, etc.) also
 393 have an overload that takes a ``mozilla::fallible_t`` parameter. These
 394 overloads return ``false`` instead of aborting if allocation fails. Use them
 395 when creating/allocating strings which may be very large, and which the
 396 program could recover from if the allocation fails.
 397
 398 Substrings (string fragments)
 399 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 400
 401 It is very simple to refer to a substring of an existing string without
 402 actually allocating new space and copying the characters into that substring.
 403 ``Substring()`` is the preferred method to create a reference to such a
 404 string.
 405
 406 .. code-block:: cpp
 407
 408     void ProcessString(const nsAString& str) {
 409         const nsAString& firstFive = Substring(str, 0, 5); // from index 0, length 5
 410         // firstFive is now a string representing the first 5 characters
 411     }
 412
 413 Unicode Conversion
 414 ------------------
 415
 416 Strings can be stored in two basic formats: 8-bit code unit (byte/``char``)
 417 strings, or 16-bit code unit (``char16_t``) strings. Any string class with a
 418 capital "C" in the classname contains 8-bit bytes. These classes include
 419 ``nsCString``, ``nsDependentCString``, and so forth. Any string class without
 420 the "C" contains 16-bit code units.
 421
 422 A 8-bit string can be in one of many character encodings while a 16-bit
 423 string is always in potentially-invalid UTF-16. (You can make a 16-bit string
 424 guaranteed-valid UTF-16 by passing it to ``EnsureUTF16Validity()``.) The most
 425 common encodings are:
 426
 427
 428 * ASCII - 7-bit encoding for basic English-only strings. Each ASCII value
 429   is stored in exactly one byte in the array with the most-significant 8th bit
 430   set to zero.
 431
 432 * `UCS2 <http://www.unicode.org/glossary/#UCS_2>`_ - 16-bit encoding for a
 433   subset of Unicode, `BMP <http://www.unicode.org/glossary/#BMP>`_. The Unicode
 434   value of a character stored in UCS2 is stored in exactly one 16-bit
 435   ``char16_t`` in a string class.
 436
 437 * `UTF-8 <http://www.faqs.org/rfcs/rfc3629.html>`_ - 8-bit encoding for
 438   Unicode characters. Each Unicode characters is stored in up to 4 bytes in a
 439   string class. UTF-8 is capable of representing the entire Unicode character
 440   repertoire, and it efficiently maps to `UTF-32
 441   <http://www.unicode.org/glossary/#UTF_32>`_. (Gtk and Rust natively use
 442   UTF-8.)
 443
 444 * `UTF-16 <http://www.unicode.org/glossary/#UTF_16>`_ - 16-bit encoding for
 445   Unicode storage, backwards compatible with UCS2. The Unicode value of a
 446   character stored in UTF-16 may require one or two 16-bit ``char16_t`` in a
 447   string class. The contents of ``nsAString`` always has to be regarded as in
 448   this encoding instead of UCS2. UTF-16 is capable of representing the entire
 449   Unicode character repertoire, and it efficiently maps to UTF-32. (Win32 W
 450   APIs and Mac OS X natively use UTF-16.)
 451
 452 * Latin1 - 8-bit encoding for the first 256 Unicode code points. Used for
 453   HTTP headers and for size-optimized storage in text node and SpiderMonkey
 454   strings. Latin1 converts to UTF-16 by zero-extending each byte to a 16-bit
 455   code unit. Note that this kind of "Latin1" is not available for encoding
 456   HTML, CSS, JS, etc. Specifying ``charset=latin1`` means the same as
 457   ``charset=windows-1252``. Windows-1252 is a similar but different encoding
 458   used for interchange.
 459
 460 In addition, there exist multiple other (legacy) encodings. The Web-relevant
 461 ones are defined in the `Encoding Standard <https://encoding.spec.whatwg.org/>`_.
 462 Conversions from these encodings to
 463 UTF-8 and UTF-16 are provided by `mozilla::Encoding
 464 <https://searchfox.org/mozilla-central/source/intl/Encoding.h#109>`_.
 465 Additionally, on Windows the are some rare cases (e.g. drag&drop) where it's
 466 necessary to call a system API with data encoded in the Windows
 467 locale-dependent legacy encoding instead of UTF-16. In those rare cases, use
 468 ``MultiByteToWideChar``/``WideCharToMultiByte`` from kernel32.dll. Do not use
 469 ``iconv`` on *nix. We only support UTF-8-encoded file paths on *nix, non-path
 470 Gtk strings are always UTF-8 and Cocoa and Java strings are always UTF-16.
 471
 472 When working with existing code, it is important to examine the current usage
 473 of the strings that you are manipulating, to determine the correct conversion
 474 mechanism.
 475
 476 When writing new code, it can be confusing to know which storage class and
 477 encoding is the most appropriate. There is no single answer to this question,
 478 but the important points are:
 479
 480
 481 * **Surprisingly many strings are very often just ASCII.** ASCII is a subset of
 482   UTF-8 and is, therefore, efficient to represent as UTF-8. Representing ASCII
 483   as UTF-16 bad both for memory usage and cache locality.
 484
 485 * **Rust strongly prefers UTF-8.** If your C++ code is interacting with Rust
 486   code, using UTF-8 in ``nsACString`` and merely validating it when converting
 487   to Rust strings is more efficient than using ``nsAString`` on the C++ side.
 488
 489 * **Networking code prefers 8-bit strings.** Networking code tends to use 8-bit
 490   strings: either with UTF-8 or Latin1 (byte value is the Unicode scalar value)
 491   semantics.
 492
 493 * **JS and DOM prefer UTF-16.** Most Gecko code uses UTF-16 for compatibility
 494   with JS strings and DOM string which are potentially-invalid UTF-16. However,
 495   both DOM text nodes and JS strings store strings that only contain code points
 496   below U+0100 as Latin1 (byte value is the Unicode scalar value).
 497
 498 * **Windows and Cocoa use UTF-16.** Windows system APIs take UTF-16. Cocoa
 499   ``NSString`` is UTF-16.
 500
 501 * **Gtk uses UTF-8.** Gtk APIs take UTF-8 for non-file paths. In the Gecko
 502   case, we support only UTF-8 file paths outside Windows, so all Gtk strings
 503   are UTF-8 for our purposes though file paths received from Gtk may not be
 504   valid UTF-8.
 505
 506 To assist with ASCII, Latin1, UTF-8, and UTF-16 conversions, there are some
 507 helper methods and classes. Some of these classes look like functions,
 508 because they are most often used as temporary objects on the stack.
 509
 510 Short zero-terminated ASCII strings
 511 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 512
 513 If you have a short zero-terminated string that you are certain is always
 514 ASCII, use these special-case methods instead of the conversions described in
 515 the later sections.
 516
 517 * If you are assigning an ASCII literal to an ``nsACString``, use
 518   ``AssignLiteral()``.
 519 * If you are assigning a literal to an ``nsAString``, use ``AssignLiteral()``
 520   and make the literal a ``u""`` literal. If the literal has to be a ``""``
 521   literal (as opposed to ``u""``) and is ASCII, still use ``AppendLiteral()``,
 522   but be aware that this involves a run-time inflation.
 523 * If you are assigning a zero-terminated ASCII string that's not a literal from
 524   the compiler's point of view at the call site and you don't know the length
 525   of the string either (e.g. because it was looked up from an array of literals
 526   of varying lengths), use ``AssignASCII()``.
 527
 528 UTF-8 / UTF-16 conversion
 529 ~~~~~~~~~~~~~~~~~~~~~~~~~
 530
 531 .. cpp:function:: NS_ConvertUTF8toUTF16(const nsACString&)
 532
 533     a ``nsAutoString`` subclass that converts a UTF-8 encoded ``nsACString``
 534     or ``const char*`` to a 16-bit UTF-16 string. If you need a ``const
 535     char16_t*`` buffer, you can use the ``.get()`` method. For example:
 536
 537     .. code-block:: cpp
 538
 539         /* signature: void HandleUnicodeString(const nsAString& str); */
 540         object->HandleUnicodeString(NS_ConvertUTF8toUTF16(utf8String));
 541
 542         /* signature: void HandleUnicodeBuffer(const char16_t* str); */
 543         object->HandleUnicodeBuffer(NS_ConvertUTF8toUTF16(utf8String).get());
 544
 545 .. cpp:function:: NS_ConvertUTF16toUTF8(const nsAString&)
 546
 547     a ``nsAutoCString`` which converts a 16-bit UTF-16 string (``nsAString``)
 548     to a UTF-8 encoded string. As above, you can use ``.get()`` to access a
 549     ``const char*`` buffer.
 550
 551     .. code-block:: cpp
 552
 553         /* signature: void HandleUTF8String(const nsACString& str); */
 554         object->HandleUTF8String(NS_ConvertUTF16toUTF8(utf16String));
 555
 556         /* signature: void HandleUTF8Buffer(const char* str); */
 557         object->HandleUTF8Buffer(NS_ConvertUTF16toUTF8(utf16String).get());
 558
 559 .. cpp:function:: CopyUTF8toUTF16(const nsACString&, nsAString&)
 560
 561     converts and copies:
 562
 563     .. code-block:: cpp
 564
 565         // return a UTF-16 value
 566         void Foo::GetUnicodeValue(nsAString& result) {
 567           CopyUTF8toUTF16(mLocalUTF8Value, result);
 568         }
 569
 570 .. cpp:function:: AppendUTF8toUTF16(const nsACString&, nsAString&)
 571
 572     converts and appends:
 573
 574     .. code-block:: cpp
 575
 576         // return a UTF-16 value
 577         void Foo::GetUnicodeValue(nsAString& result) {
 578           result.AssignLiteral("prefix:");
 579           AppendUTF8toUTF16(mLocalUTF8Value, result);
 580         }
 581
 582 .. cpp:function:: CopyUTF16toUTF8(const nsAString&, nsACString&)
 583
 584     converts and copies:
 585
 586     .. code-block:: cpp
 587
 588         // return a UTF-8 value
 589         void Foo::GetUTF8Value(nsACString& result) {
 590           CopyUTF16toUTF8(mLocalUTF16Value, result);
 591         }
 592
 593 .. cpp:function:: AppendUTF16toUTF8(const nsAString&, nsACString&)
 594
 595     converts and appends:
 596
 597     .. code-block:: cpp
 598
 599         // return a UTF-8 value
 600         void Foo::GetUnicodeValue(nsACString& result) {
 601           result.AssignLiteral("prefix:");
 602           AppendUTF16toUTF8(mLocalUTF16Value, result);
 603         }
 604
 605
 606 Latin1 / UTF-16 Conversion
 607 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 608
 609 The following should only be used when you can guarantee that the original
 610 string is ASCII or Latin1 (in the sense that the byte value is the Unicode
 611 scalar value; not in the windows-1252 sense). These helpers are very similar
 612 to the UTF-8 / UTF-16 conversion helpers above.
 613
 614
 615 UTF-16 to Latin1 converters
 616 ```````````````````````````
 617
 618 These converters are **very dangerous** because they **lose information**
 619 during the conversion process. You should **avoid UTF-16 to Latin1
 620 conversions** unless your strings are guaranteed to be Latin1 or ASCII. (In
 621 the future, these conversions may start asserting in debug builds that their
 622 input is in the permissible range.) If the input is actually in the Latin1
 623 range, each 16-bit code unit in narrowed to an 8-bit byte by removing the
 624 high half. Unicode code points above U+00FF result in garbage whose nature
 625 must not be relied upon. (In the future the nature of the garbage will be CPU
 626 architecture-dependent.) If you want to ``printf()`` something and don't care
 627 what happens to non-ASCII, please convert to UTF-8 instead.
 628
 629
 630 .. cpp:function:: NS_LossyConvertUTF16toASCII(const nsAString&)
 631
 632     A ``nsAutoCString`` which holds a temporary buffer containing the Latin1
 633     value of the string.
 634
 635 .. cpp:function:: void LossyCopyUTF16toASCII(Span<const char16_t>, nsACString&)
 636
 637     Does an in-place conversion from UTF-16 into an Latin1 string object.
 638
 639 .. cpp:function:: void LossyAppendUTF16toASCII(Span<const char16_t>, nsACString&)
 640
 641     Appends a UTF-16 string to a Latin1 string.
 642
 643 Latin1 to UTF-16 converters
 644 ```````````````````````````
 645
 646 These converters are very dangerous because they will **produce wrong results
 647 for non-ASCII UTF-8 or windows-1252 input** into a meaningless UTF-16 string.
 648 You should **avoid ASCII to UTF-16 conversions** unless your strings are
 649 guaranteed to be ASCII or Latin1 in the sense of the byte value being the
 650 Unicode scalar value. Every byte is zero-extended into a 16-bit code unit.
 651
 652 It is correct to use these on most HTTP header values, but **it's always
 653 wrong to use these on HTTP response bodies!** (Use ``mozilla::Encoding`` to
 654 deal with response bodies.)
 655
 656 .. cpp:function:: NS_ConvertASCIItoUTF16(const nsACString&)
 657
 658     A ``nsAutoString`` which holds a temporary buffer containing the value of
 659     the Latin1 to UTF-16 conversion.
 660
 661 .. cpp:function:: void CopyASCIItoUTF16(Span<const char>, nsAString&)
 662
 663     does an in-place conversion from Latin1 to UTF-16.
 664
 665 .. cpp:function:: void AppendASCIItoUTF16(Span<const char>, nsAString&)
 666
 667     appends a Latin1 string to a UTF-16 string.
 668
 669 Comparing ns*Strings with C strings
 670 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 671
 672 You can compare ``ns*Strings`` with C strings by converting the ``ns*String``
 673 to a C string, or by comparing directly against a C String.
 674
 675 .. cpp:function:: bool nsAString::EqualsASCII(const char*)
 676
 677     Compares with an ASCII C string.
 678
 679 .. cpp:function:: bool nsAString::EqualsLiteral(...)
 680
 681     Compares with a string literal.
 682
 683 Common Patterns
 684 ---------------
 685
 686 Literal Strings
 687 ~~~~~~~~~~~~~~~
 688
 689 A literal string is a raw string value that is written in some C++ code. For
 690 example, in the statement ``printf("Hello World\n");`` the value ``"Hello
 691 World\n"`` is a literal string. It is often necessary to insert literal
 692 string values when an ``nsAString`` or ``nsACString`` is required. Two
 693 user-defined literals are provided that implicitly convert to ``const
 694 nsString&`` resp. ``const nsCString&``:
 695
 696 * ``""_ns`` for 8-bit literals, converting implicitly to ``const nsCString&``
 697 * ``u""_ns`` for 16-bit literals, converting implicitly to ``const nsString&``
 698
 699 The benefits of the user-defined literals may seem unclear, given that
 700 ``nsDependentCString`` will also wrap a string value in an ``nsCString``. The
 701 advantage of the user-defined literals is twofold.
 702
 703 * The length of these strings is calculated at compile time, so the string does
 704   not need to be scanned at runtime to determine its length.
 705
 706 * Literal strings live for the lifetime of the binary, and can be moved between
 707   the ``ns[C]String`` classes without being copied or freed.
 708
 709 Here are some examples of proper usage of the literals (both standard and
 710 user-defined):
 711
 712 .. code-block:: cpp
 713
 714     // call Init(const nsLiteralString&) - enforces that it's only called with literals
 715     Init(u"start value"_ns);
 716
 717     // call Init(const nsAString&)
 718     Init(u"start value"_ns);
 719
 720     // call Init(const nsACString&)
 721     Init("start value"_ns);
 722
 723 In case a literal is defined via a macro, you can just convert it to
 724 ``nsLiteralString`` or ``nsLiteralCString`` using their constructor. You
 725 could consider not using a macro at all but a named ``constexpr`` constant
 726 instead.
 727
 728 In some cases, an 8-bit literal is defined via a macro, either within code or
 729 from the environment, but it can't be changed or is used both as an 8-bit and
 730 a 16-bit string. In these cases, you can use the
 731 ``NS_LITERAL_STRING_FROM_CSTRING`` macro to construct a ``nsLiteralString``
 732 and do the conversion at compile-time.
 733
 734 String Concatenation
 735 ~~~~~~~~~~~~~~~~~~~~
 736
 737 Strings can be concatenated together using the + operator. The resulting
 738 string is a ``const nsSubstringTuple`` object. The resulting object can be
 739 treated and referenced similarly to a ``nsAString`` object. Concatenation *does
 740 not copy the substrings*. The strings are only copied when the concatenation
 741 is assigned into another string object. The ``nsSubstringTuple`` object holds
 742 pointers to the original strings. Therefore, the ``nsSubstringTuple`` object is
 743 dependent on all of its substrings, meaning that their lifetime must be at
 744 least as long as the ``nsSubstringTuple`` object.
 745
 746 For example, you can use the value of two strings and pass their
 747 concatenation on to another function which takes an ``const nsAString&``:
 748
 749 .. code-block:: cpp
 750
 751     void HandleTwoStrings(const nsAString& one, const nsAString& two) {
 752       // call HandleString(const nsAString&)
 753       HandleString(one + two);
 754     }
 755
 756 NOTE: The two strings are implicitly combined into a temporary ``nsString``
 757 in this case, and the temporary string is passed into ``HandleString``. If
 758 ``HandleString`` assigns its input into another ``nsString``, then the string
 759 buffer will be shared in this case negating the cost of the intermediate
 760 temporary. You can concatenate N strings and store the result in a temporary
 761 variable:
 762
 763 .. code-block:: cpp
 764
 765     constexpr auto start = u"start "_ns;
 766     constexpr auto middle = u"middle "_ns;
 767     constexpr auto end = u"end"_ns;
 768     // create a string with 3 dependent fragments - no copying involved!
 769     nsString combinedString = start + middle + end;
 770
 771     // call void HandleString(const nsAString&);
 772     HandleString(combinedString);
 773
 774 It is safe to concatenate user-defined literals because the temporary
 775 ``nsLiteral[C]String`` objects will live as long as the temporary
 776 concatenation object (of type ``nsSubstringTuple``).
 777
 778 .. code-block:: cpp
 779
 780     // call HandlePage(const nsAString&);
 781     // safe because the concatenated-string will live as long as its substrings
 782     HandlePage(u"start "_ns + u"end"_ns);
 783
 784 Local Variables
 785 ~~~~~~~~~~~~~~~
 786
 787 Local variables within a function are usually stored on the stack. The
 788 ``nsAutoString``/``nsAutoCString`` classes are subclasses of the
 789 ``nsString``/``nsCString`` classes. They own a 64-character buffer allocated
 790 in the same storage space as the string itself. If the ``nsAutoString`` is
 791 allocated on the stack, then it has at its disposal a 64-character stack
 792 buffer. This allows the implementation to avoid allocating extra memory when
 793 dealing with small strings. ``nsAutoStringN``/``nsAutoCStringN`` are more
 794 general alternatives that let you choose the number of characters in the
 795 inline buffer.
 796
 797 .. code-block:: cpp
 798
 799     ...
 800     nsAutoString value;
 801     GetValue(value); // if the result is less than 64 code units,
 802                     // then this just saved us an allocation
 803     ...
 804
 805 Member Variables
 806 ~~~~~~~~~~~~~~~~
 807
 808 In general, you should use the concrete classes ``nsString`` and
 809 ``nsCString`` for member variables.
 810
 811 .. code-block:: cpp
 812
 813     class Foo {
 814       ...
 815       // these store UTF-8 and UTF-16 values respectively
 816       nsCString mLocalName;
 817       nsString mTitle;
 818     };
 819
 820 A common incorrect pattern is to use ``nsAutoString``/``nsAutoCString``
 821 for member variables. As described in `Local Variables`_, these classes have
 822 a built in buffer that make them very large. This means that if you include
 823 them in a class, they bloat the class by 64 bytes (``nsAutoCString``) or 128
 824 bytes (``nsAutoString``).
 825
 826
 827 Raw Character Pointers
 828 ~~~~~~~~~~~~~~~~~~~~~~
 829
 830 ``PromiseFlatString()`` and ``PromiseFlatCString()`` can be used to create a
 831 temporary buffer which holds a null-terminated buffer containing the same
 832 value as the source string. ``PromiseFlatString()`` will create a temporary
 833 buffer if necessary. This is most often used in order to pass an
 834 ``nsAString`` to an API which requires a null-terminated string.
 835
 836 In the following example, an ``nsAString`` is combined with a literal string,
 837 and the result is passed to an API which requires a simple character buffer.
 838
 839 .. code-block:: cpp
 840
 841     // Modify the URL and pass to AddPage(const char16_t* url)
 842     void AddModifiedPage(const nsAString& url) {
 843       constexpr auto httpPrefix = u"http://"_ns;
 844       const nsAString& modifiedURL = httpPrefix + url;
 845
 846       // creates a temporary buffer
 847       AddPage(PromiseFlatString(modifiedURL).get());
 848     }
 849
 850 ``PromiseFlatString()`` is smart when handed a string that is already
 851 null-terminated. It avoids creating the temporary buffer in such cases.
 852
 853 .. code-block:: cpp
 854
 855     // Modify the URL and pass to AddPage(const char16_t* url)
 856     void AddModifiedPage(const nsAString& url, PRBool addPrefix) {
 857         if (addPrefix) {
 858             // MUST create a temporary buffer - string is multi-fragmented
 859             constexpr auto httpPrefix = u"http://"_ns;
 860             AddPage(PromiseFlatString(httpPrefix + modifiedURL));
 861         } else {
 862             // MIGHT create a temporary buffer, does a runtime check
 863             AddPage(PromiseFlatString(url).get());
 864         }
 865     }
 866
 867 .. note::
 868
 869     It is **not** possible to efficiently transfer ownership of a string
 870     class' internal buffer into an owned ``char*`` which can be safely
 871     freed by other components due to the COW optimization.
 872
 873     If working with a legacy API which requires malloced ``char*`` buffers,
 874     prefer using ``ToNewUnicode``, ``ToNewCString`` or ``ToNewUTF8String``
 875     over ``strdup`` to create owned ``char*`` pointers.
 876
 877 ``printf`` and a UTF-16 string
 878 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 879
 880 For debugging, it's useful to ``printf`` a UTF-16 string (``nsString``,
 881 ``nsAutoString``, etc). To do this usually requires converting it to an 8-bit
 882 string, because that's what ``printf`` expects. Use:
 883
 884 .. code-block:: cpp
 885
 886     printf("%s\n", NS_ConvertUTF16toUTF8(yourString).get());
 887
 888 Sequence of appends without reallocating
 889 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 890
 891 ``SetCapacity()`` allows you to give the string a hint of the future string
 892 length caused by a sequence of appends (excluding appends that convert
 893 between UTF-16 and UTF-8 in either direction) in order to avoid multiple
 894 allocations during the sequence of appends. However, the other
 895 allocation-avoidance features of XPCOM strings interact badly with
 896 ``SetCapacity()`` making it something of a footgun.
 897
 898 ``SetCapacity()`` is appropriate to use before a sequence of multiple
 899 operations from the following list (without operations that are not on the
 900 list between the ``SetCapacity()`` call and operations from the list):
 901
 902 * ``Append()``
 903 * ``AppendASCII()``
 904 * ``AppendLiteral()``
 905 * ``AppendPrintf()``
 906 * ``AppendInt()``
 907 * ``AppendFloat()``
 908 * ``LossyAppendUTF16toASCII()``
 909 * ``AppendASCIItoUTF16()``
 910
 911 **DO NOT** call ``SetCapacity()`` if the subsequent operations on the string
 912 do not meet the criteria above. Operations that undo the benefits of
 913 ``SetCapacity()`` include but are not limited to:
 914
 915 * ``SetLength()``
 916 * ``Truncate()``
 917 * ``Assign()``
 918 * ``AssignLiteral()``
 919 * ``Adopt()``
 920 * ``CopyASCIItoUTF16()``
 921 * ``LossyCopyUTF16toASCII()``
 922 * ``AppendUTF16toUTF8()``
 923 * ``AppendUTF8toUTF16()``
 924 * ``CopyUTF16toUTF8()``
 925 * ``CopyUTF8toUTF16()``
 926
 927 If your string is an ``nsAuto[C]String`` and you are calling
 928 ``SetCapacity()`` with a constant ``N``, please instead declare the string as
 929 ``nsAuto[C]StringN<N+1>`` without calling ``SetCapacity()`` (while being
 930 mindful of not using such a large ``N`` as to overflow the run-time stack).
 931
 932 There is no need to include room for the null terminator: it is the job of
 933 the string class.
 934
 935 Note: Calling ``SetCapacity()`` does not give you permission to use the
 936 pointer obtained from ``BeginWriting()`` to write past the current length (as
 937 returned by ``Length()``) of the string. Please use either ``BulkWrite()`` or
 938 ``SetLength()`` instead.
 939
 940 .. _stringguide.xpidl:
 941
 942 XPIDL
 943 -----
 944
 945 The string library is also available through IDL. By declaring attributes and
 946 methods using the specially defined IDL types, string classes are used as
 947 parameters to the corresponding methods.
 948
 949 XPIDL String types
 950 ~~~~~~~~~~~~~~~~~~
 951
 952 The C++ signatures follow the abstract-type convention described above, such
 953 that all method parameters are based on the abstract classes. The following
 954 table describes the purpose of each string type in IDL.
 955
 956 +-----------------+----------------+----------------------------------------------------------------------------------+
 957 | XPIDL Type      | C++ Type       | Purpose                                                                          |
 958 +=================+================+==================================================================================+
 959 | ``string``      | ``char*``      | Raw character pointer to ASCII (7-bit) string, no string classes used.           |
 960 |                 |                |                                                                                  |
 961 |                 |                | High bit is not guaranteed across XPConnect boundaries.                          |
 962 +-----------------+----------------+----------------------------------------------------------------------------------+
 963 | ``wstring``     | ``char16_t*``  | Raw character pointer to UTF-16 string, no string classes used.                  |
 964 +-----------------+----------------+----------------------------------------------------------------------------------+
 965 | ``AString``     | ``nsAString``  | UTF-16 string.                                                                   |
 966 +-----------------+----------------+----------------------------------------------------------------------------------+
 967 | ``ACString``    | ``nsACString`` | 8-bit string. All bits are preserved across XPConnect boundaries.                |
 968 +-----------------+----------------+----------------------------------------------------------------------------------+
 969 | ``AUTF8String`` | ``nsACString`` | UTF-8 string.                                                                    |
 970 |                 |                |                                                                                  |
 971 |                 |                | Converted to UTF-16 as necessary when value is used across XPConnect boundaries. |
 972 +-----------------+----------------+----------------------------------------------------------------------------------+
 973
 974 Callers should prefer using the string classes ``AString``, ``ACString`` and
 975 ``AUTF8String`` over the raw pointer types ``string`` and ``wstring`` in
 976 almost all situations.
 977
 978 C++ Signatures
 979 ~~~~~~~~~~~~~~
 980
 981 In XPIDL, ``in`` parameters are read-only, and the C++ signatures for
 982 ``*String`` parameters follows the above guidelines by using ``const
 983 nsAString&`` for these parameters. ``out`` and ``inout`` parameters are
 984 defined simply as ``nsAString&`` so that the callee can write to them.
 985
 986 .. code-block:: cpp
 987
 988     interface nsIFoo : nsISupports {
 989         attribute AString utf16String;
 990         AUTF8String getValue(in ACString key);
 991     };
 992
 993 .. code-block:: cpp
 994
 995     class nsIFoo : public nsISupports {
 996       NS_IMETHOD GetUtf16String(nsAString& aResult) = 0;
 997       NS_IMETHOD SetUtf16String(const nsAString& aValue) = 0;
 998       NS_IMETHOD GetValue(const nsACString& aKey, nsACString& aResult) = 0;
 999     };
1000
1001 In the above example, ``utf16String`` is treated as a UTF-16 string. The
1002 implementation of ``GetUtf16String()`` will use ``aResult.Assign`` to
1003 "return" the value. In ``SetUtf16String()`` the value of the string can be
1004 used through a variety of methods including `Iterators`_,
1005 ``PromiseFlatString``, and assignment to other strings.
1006
1007 In ``GetValue()``, the first parameter, ``aKey``, is treated as a raw
1008 sequence of 8-bit values. Any non-ASCII characters in ``aKey`` will be
1009 preserved when crossing XPConnect boundaries. The implementation of
1010 ``GetValue()`` will assign a UTF-8 encoded 8-bit string into ``aResult``. If
1011 the this method is called across XPConnect boundaries, such as from a script,
1012 then the result will be decoded from UTF-8 into UTF-16 and used as a Unicode
1013 value.
1014
1015 String Guidelines
1016 -----------------
1017
1018 Follow these simple rules in your code to keep your fellow developers,
1019 reviewers, and users happy.
1020
1021 * Use the most abstract string class that you can. Usually this is:
1022   * ``nsAString`` for function parameters
1023   * ``nsString`` for member variables
1024   * ``nsAutoString`` for local (stack-based) variables
1025 * Use the ``""_ns`` and ``u""_ns`` user-defined literals to represent literal strings (e.g. ``"foo"_ns``) as nsAString-compatible objects.
1026 * Use string concatenation (i.e. the "+" operator) when combining strings.
1027 * Use ``nsDependentString`` when you have a raw character pointer that you need to convert to an nsAString-compatible string.
1028 * Use ``Substring()`` to extract fragments of existing strings.
1029 * Use `iterators`_ to parse and extract string fragments.
1030
1031 Class Reference
1032 ---------------
1033
1034 .. cpp:class:: template<T> nsTSubstring<T>
1035
1036     .. note::
1037
1038         The ``nsTSubstring<char_type>`` class is usually written as
1039         ``nsAString`` or ``nsACString``.
1040
1041     .. cpp:function:: size_type Length() const
1042
1043     .. cpp:function:: bool IsEmpty() const
1044
1045     .. cpp:function:: bool IsVoid() const
1046
1047     .. cpp:function:: const char_type* BeginReading() const
1048
1049     .. cpp:function:: const char_type* EndReading() const
1050
1051     .. cpp:function:: bool Equals(const self_type&, comparator_type = ...) const
1052
1053     .. cpp:function:: char_type First() const
1054
1055     .. cpp:function:: char_type Last() const
1056
1057     .. cpp:function:: size_type CountChar(char_type) const
1058
1059     .. cpp:function:: int32_t FindChar(char_type, index_type aOffset = 0) const
1060
1061     .. cpp:function:: void Assign(const self_type&)
1062
1063     .. cpp:function:: void Append(const self_type&)
1064
1065     .. cpp:function:: void Insert(const self_type&, index_type aPos)
1066
1067     .. cpp:function:: void Cut(index_type aCutStart, size_type aCutLength)
1068
1069     .. cpp:function:: void Replace(index_type aCutStart, size_type aCutLength, const self_type& aStr)
1070
1071     .. cpp:function:: void Truncate(size_type aLength)
1072
1073     .. cpp:function:: void SetIsVoid(bool)
1074
1075         Make it null. XPConnect and WebIDL will convert void nsAStrings to
1076         JavaScript ``null``.
1077
1078     .. cpp:function:: char_type* BeginWriting()
1079
1080     .. cpp:function:: char_type* EndWriting()
1081
1082     .. cpp:function:: void SetCapacity(size_type)
1083
1084         Inform the string about buffer size need before a sequence of calls
1085         to ``Append()`` or converting appends that convert between UTF-16 and
1086         Latin1 in either direction. (Don't use if you use appends that
1087         convert between UTF-16 and UTF-8 in either direction.) Calling this
1088         method does not give you permission to use ``BeginWriting()`` to
1089         write past the logical length of the string. Use ``SetLength()`` or
1090         ``BulkWrite()`` as appropriate.
1091
1092     .. cpp:function:: void SetLength(size_type)
1093
1094     .. cpp:function:: Result<BulkWriteHandle<char_type>, nsresult> BulkWrite(size_type aCapacity, size_type aPrefixToPreserve, bool aAllowShrinking)