docs/pdds/pdd28_strings.pod

   1 # Copyright (C) 2008-2009, Parrot Foundation.
   2 # $Id$
   3
   4 =head1 PDD 28: Strings
   5
   6 =head2 Abstract
   7
   8 This PDD describes the conventions for strings in Parrot,
   9 including but not limited to support for multiple character sets,
  10 encodings, and languages.
  11
  12 =head2 Version
  13
  14 $Revision$
  15
  16 =head2 Definitions
  17
  18 =head3 Character
  19
  20 A character is the abstract description of a symbol. It's the smallest
  21 chunk of text a computer knows how to deal with. Internally to
  22 the computer, a character (just like everything else) is a number, so
  23 a few further definitions are needed.
  24
  25 =head3 Character Set
  26
  27 The Unicode Standard prefers the concepts of I<character repertoire> (a
  28 collection of characters) and I<character code> (a mapping which tells you
  29 what number represents which character in the repertoire). Character set is
  30 commonly used to mean the standard which defines both a repertoire and a code.
  31
  32 =head3 Codepoint
  33
  34 A codepoint is the numeric representation of a character according to a
  35 given character set. So in ASCII, the character C<A> has codepoint 0x41.
  36
  37 =head3 Encoding
  38
  39 An encoding determines how a codepoint is represented inside a computer.
  40 Simple encodings like ASCII define that the codepoints 0-127 simply
  41 live as their numeric equivalents inside an eight-bit bytes. Other
  42 fixed-width encodings like UCS-2 use more bytes to encode more
  43 codepoints. Variable-width encodings like UTF-8 use one byte for
  44 codepoints 0-127, two bytes for codepoints 127-2047, and so on.
  45
  46 Character sets and encodings are related but separate concepts. An
  47 encoding is the lower-level representation of a string's data, whereas
  48 the character set determines higher-level semantics. Typically,
  49 character set functions will ask a string's encoding functions to
  50 retrieve data from the string, and then process the retrieved data.
  51
  52 =head3 Combining Character
  53
  54 A combining character is a Unicode concept. It is a character which
  55 modifies the preceding character. For instance, accents, lines, circles,
  56 boxes, etc. which are not to be displayed on their own, but to be
  57 composed with the preceding character.
  58
  59 =head3 Grapheme
  60
  61 In linguistics, a grapheme is a single symbol in a writing system (letter,
  62 number, punctuation mark, kanji, hiragana, Arabic glyph, Devanagari symbol,
  63 etc), including any modifiers (diacritics, etc).
  64
  65 The Unicode Standard defines a I<grapheme cluster> (commonly simplified to
  66 just I<grapheme>) as one or more characters forming a visible whole when
  67 displayed, in other words, a bundle of a character and all of its combining
  68 characters.  Because graphemes are the highest-level abstract idea of a
  69 "character", they're useful for converting between character sets.
  70
  71 =head3 Normalization Form
  72
  73 A normalization form standardizes the representation of a string by
  74 transforming a sequence of combining characters into a more complex character
  75 (composition), or by transforming a complex character into a sequence of
  76 composing characters (decomposition). The decomposition forms also define a
  77 standard order for the composing characters, to allow string comparisons. The
  78 Unicode Standard defines four normalization forms: NFC and NFKC are
  79 composition, NFD and NFKD are decomposition. See L<Unicode Normalization
  80 Forms|http://www.unicode.org/reports/tr15/> for more details.
  81
  82 =head3 Grapheme Normalization Form
  83
  84 Grapheme normalization form (NFG) is a normalization which allocates exactly
  85 one codepoint to each grapheme.
  86
  87 =head2 Description
  88
  89 =over 3
  90
  91 =item *
  92
  93 Parrot supports multiple string formats, and so users of Parrot strings must
  94 be aware at all times of string encoding issues and how these relate to the
  95 string interface.
  96
  97 =item *
  98
  99 Parrot provides an interface for interacting with strings and converting
 100 between character sets and encodings.
 101
 102 =item *
 103
 104 Operations that require understanding the semantics of a string must respect
 105 the character set of the string.
 106
 107 =item *
 108
 109 Operations that require understanding the layout of the string must respect
 110 the encoding of the string.
 111
 112 =item *
 113
 114 In addition to common string formats, Parrot provides an additional string
 115 format that is a sequence of 32-bit Unicode codepoints in NFG.
 116
 117 =back
 118
 119 =head2 Implementation
 120
 121 Parrot was designed from the outset to support multiple string formats:
 122 multiple character sets and multiple encodings. We don't standardize on
 123 Unicode internally, converting all strings to Unicode strings, because for the
 124 majority of use cases it's still far more efficient to deal with whatever
 125 input data the user sends us.
 126
 127 Consumers of Parrot strings need to be aware that there is a plurality of
 128 string encodings inside Parrot. (Producers of Parrot strings can do whatever
 129 is most efficient for them.) To put it in simple terms: if you find yourself
 130 writing C<*s++> or any other C string idioms, you need to stop and think if
 131 that's what you really mean. Not everything is byte-based anymore.
 132
 133 =head3 Grapheme Normalization Form
 134
 135 Unicode characters can be expressed in a number of different ways according to
 136 the Unicode Standard. This is partly to do with maintaining compatibility with
 137 existing character encodings. For instance, in Serbo-Croatian and Slovenian,
 138 there's a letter which looks like an C<i> without the dot but with two grave
 139 (C<`>) accents (E<0x209>). Unicode can represent this letter as a composed
 140 character C<0x209>, also known as C<LATIN SMALL LETTER I WITH DOUBLE GRAVE>,
 141 which does the job all in one go. It can also represent this letter as a
 142 decomposed sequence: C<LATIN SMALL LETTER I> (C<0x69>) followed by C<COMBINING
 143 DOUBLE GRAVE ACCENT> (C<0x30F>). We use the term I<grapheme> to refer to a
 144 "letter" whether it's represented by a single codepoint or multiple
 145 codepoints.
 146
 147 String operations on this kind of variable-byte encoding can be complex and
 148 expensive. Operations like comparison and traversal require a series of
 149 computations and lookaheads, because any given grapheme may be a sequence of
 150 combining characters. The Unicode Standard defines several "normalization
 151 forms" that help with this problem. Normalization Form C (NFC), for example,
 152 decomposes everything, then re-composes as much as possible. So if you see the
 153 integer stream C<0x69 0x30F>, it needs to be replaced by C<0x209>. However,
 154 Unicode's normalization forms don't go quite far enough to completely solve
 155 the problem. For example, Serbo-Croat is sometimes also written with Cyrillic
 156 letters rather than Latin letters. Unicode doesn't have a single composed
 157 character for the Cyrillic equivalent of the Serbo-Croat C<LATIN SMALL LETTER
 158 I WITH DOUBLE GRAVE>, so it is represented as a decomposed pair C<CYRILLIC
 159 SMALL LETTER I> (C<0x438>) with C<COMBINING DOUBLE GRAVE ACCENT> (C<0x30F>).
 160 This means that even in the most normalized Unicode form, string manipulation
 161 code must always assume a variable-byte encoding, and use expensive
 162 lookaheads. The cost is incurred on every operation, though the particular
 163 string operated on might not contain combining characters. It's particularly
 164 noticeable in parsing and regular expression matches, where backtracking
 165 operations may re-traverse the characters of a simple string hundreds of
 166 times.
 167
 168 In order to reduce the cost of variable-byte operations and simplify some
 169 string manipulation tasks, Parrot defines an additional normalization:
 170 Normalization Form G (NFG). In NFG, every grapheme is guaranteed to be
 171 represented by a single codepoint. Graphemes that don't have a single
 172 codepoint representation in Unicode are given a dynamically generated
 173 codepoint unique to the NFG string.
 174
 175 An NFG string is a sequence of signed 32-bit Unicode codepoints. It's
 176 equivalent to UCS-4 except for the normalization form semantics. UCS-4
 177 specifies an encoding for Unicode codepoints from 0 to 0x7FFFFFFF. In other
 178 words, any codepoints with the first bit set are undefined. NFG interprets the
 179 unused bit as a sign bit, and reserves all negative codepoints as dynamic
 180 codepoints. A negative codepoint acts as an index into a lookup table, which
 181 maps between a dynamic codepoint and its associated decomposition.
 182
 183 In practice, this goes as follows: When our Russified Serbo-Croat string is
 184 converted to NFG, it is normalized to a single character having the codepoint
 185 C<0xFFFFFFFFF> (in other words, -1 in 2's complement). At the same time,
 186 Parrot inserts an entry into the string's grapheme table at array index -1,
 187 containing the Unicode decomposition of the grapheme C<0x00000438
 188 0x000000030F>.
 189
 190 Parrot will provide both grapheme-aware and codepoint-aware string operations,
 191 such as iterators for string traversal and calculations of string length.
 192 Individual language implementations can choose between the two types of
 193 operations depending on whether their string semantics are character-based or
 194 codepoint-based. For languages that don't currently have Unicode support, the
 195 grapheme operations will allow them to safely manipulate Unicode data without
 196 changing their string semantics.
 197
 198 =head4 Advantages
 199
 200 Applications that don't care about graphemes can handle a NFG codepoint in a
 201 string as if it's any other character. Only applications that care about the
 202 specific properties of Unicode characters need to take the load of peeking
 203 inside the grapheme table and reading the decomposition.
 204
 205 Using negative numbers for dynamic codepoints allows Parrot to check if a
 206 particular codepoint is dynamic using a single sign-comparison operation. It
 207 also means that NFG can be used without conflict on encodings from 7-bit
 208 (signed 8-bit integers) to 63-bit (using signed 64-bit integers) and beyond.
 209
 210 Because any grapheme from any character set can be represented by a single NFG
 211 codepoint, NFG strings are useful as an intermediate representation for
 212 converting between string types.
 213
 214 =head4 Disadvantages
 215
 216 A 32-bit encoding is quite large, considering the fact that the Unicode
 217 codespace only requires up to C<0x10FFFF>. The Unicode Consortium's FAQ notes
 218 that most Unicode interfaces use UTF-16 instead of UTF-32, out of memory
 219 considerations. This means that although Parrot will use 32-bit NFG strings
 220 for optimizations within operations, for the most part individual users should
 221 use the native character set and encoding of their data, rather than using NFG
 222 strings directly.
 223
 224 The conceptual cost of adding a normalization form beyond those defined in the
 225 Unicode Standard has to be considered. However, to fully support Unicode,
 226 Parrot already needs to keep track of what normalization form a given string
 227 is in, and provide functions to convert between normalization forms. The
 228 conceptual cost of one additional normalization form is relatively small.
 229
 230 =head4 The grapheme table
 231
 232 When constructing strings in NFG, graphemes not expressible as a single
 233 character in Unicode are represented by a dynamic codepoint index into the
 234 string's grapheme table. When Parrot comes across a multi-codepoint grapheme,
 235 it must first determine whether or not the grapheme already has an entry in
 236 the grapheme table. Therefore the table cannot strictly be an array, as that
 237 would make lookup inefficient. The grapheme table is represented, then, as
 238 both an array and a hash structure. The array interface provides
 239 forward-lookup and the hash interface reverse lookup. Converting a
 240 multi-codepoint grapheme into a dynamic codepoint can be demonstrated with the
 241 following Perl 5 pseudocode, for the grapheme C<0x438 0x30F>:
 242
 243    $codepoint = ($grapheme_lookup->{0x438}{0x30F} ||= do {
 244                    push @grapheme_table, "\x{438}\x{30F}";
 245                    ~ $#grapheme_table;
 246                 });
 247    push @string, $codepoint;
 248
 249 =head3 String API
 250
 251 Strings in the Parrot core should use the Parrot C<STRING> structure. Parrot
 252 developers generally shouldn't deal with C<char *> or other string-like types
 253 outside of this abstraction. It's also best not to access members of the
 254 C<STRING> structure directly. The interpretation of the data inside the
 255 structure is determined by the data's encoding. Parrot's strings are
 256 encoding-aware so your functions don't need to be.
 257
 258 Parrot's internal strings (C<STRING>s) have the following structure:
 259
 260     struct parrot_string_t {
 261         Parrot_UInt flags;
 262         void *     _bufstart;
 263         size_t     _buflen;
 264         char       *strstart;
 265         UINTVAL     bufused;
 266         UINTVAL     strlen;
 267         UINTVAL     hashval;
 268         const struct _encoding *encoding;
 269         const struct _charset  *charset;
 270 };
 271
 272 The fields are:
 273
 274 =over 4
 275
 276 =item _bufstart
 277
 278 A pointer to the buffer for the string data.
 279
 280 =item _buflen
 281
 282 The size of the buffer in bytes.
 283
 284 =item flags
 285
 286 Binary flags used for garbage collection, copy-on-write tracking, and other
 287 metadata.
 288
 289 =item bufused
 290
 291 The amount of the buffer currently in use, in bytes.
 292
 293 =item strlen
 294
 295 The length of the string, in bytes. {{NOTE, not in characters, as characters
 296 may be variably sized.}}
 297
 298 =item hashval
 299
 300 A cache of the hash value of the string, for rapid lookups when the string is
 301 used as a hash key.
 302
 303 =item encoding
 304
 305 How the data is encoded (e.g. fixed 8-bit characters, UTF-8, or UTF-32).  Note
 306 that this specifies encoding only -- it's valid to encode  EBCDIC characters
 307 with the UTF-8 algorithm. Silly, but valid.
 308
 309 The encoding structure specifies the encoding (by index number and by name,
 310 for ease of lookup), the maximum number of bytes that a single character will
 311 occupy in that encoding, as well as functions for manipulating strings with
 312 that encoding.
 313
 314 =item charset
 315
 316 What sort of string data is in the buffer, for example ASCII, EBCDIC, or
 317 Unicode.
 318
 319 The charset structure specifies the character set (by index number and by
 320 name) and provides functions for transcoding to and from that character set.
 321
 322 =back
 323
 324 {{DEPRECATION NOTE: the enum C<parrot_string_representation_t> will be removed
 325 from the parrot string structure. It's been commented out for years.}}
 326
 327 {{DEPRECATION NOTE: the C<char *> pointer C<strstart> will be removed. It
 328 complicates the entire string subsystem for a tiny optimization on substring
 329 operations, and offset math is messy with encodings that aren't byte-based.}}
 330
 331 =head4 Conversions between normalization form, encoding, and charset
 332
 333 Conversion will be done with a function called C<Parrot_str_grapheme_copy>:
 334
 335     INTVAL Parrot_str_grapheme_copy(STRING *src, STRING *dst)
 336
 337 Converting a string from one format to another involves creating a new empty
 338 string with the required attributes, and passing the source string and the new
 339 string to C<Parrot_str_grapheme_copy>. This function iterates through the
 340 source string one grapheme at a time, using the character set function pointer
 341 C<get_grapheme> (which may read ahead multiple characters with strings that
 342 aren't in NFG). For each source grapheme, the function will call
 343 C<set_grapheme> on the destination string (which may append multiple
 344 characters in non-NFG strings). This conversion effectively uses an
 345 intermediate NFG representation.
 346
 347
 348 =head3 String Interface Functions
 349
 350 The current string functions will be maintained, with some modifications for
 351 the addition of the NFG string format. Many string functions that are part of
 352 Parrot's external API will be renamed for the standard "Parrot_*" naming
 353 conventions.
 354
 355 =head4 Parrot_str_set (was string_set)
 356
 357 Set one string to a copy of the value of another string.
 358
 359 =head4 Parrot_str_new_COW (was Parrot_make_COW_reference)
 360
 361 Create a new copy-on-write string. Creating a new string header, clone the
 362 struct members of the original string, and point to the same string buffer as
 363 the original string.
 364
 365 =head4 Parrot_str_reuse_COW (was Parrot_reuse_COW_reference)
 366
 367 Create a new copy-on-write string. Clone the struct members of the original
 368 string into a passed in string header, and point the reused string header to
 369 the same string buffer as the original string.
 370
 371 =head4 Parrot_str_write_COW (was Parrot_unmake_COW)
 372
 373 If the specified Parrot string is copy-on-write, copy the string's contents
 374 to a new string buffer and clear the copy-on-write flag.
 375
 376 =head4 Parrot_str_concat (was string_concat)
 377
 378 Concatenate two strings. Takes three arguments: two strings, and one integer
 379 value of flags. If both string arguments are null, return a new string created
 380 according to the integer flags.
 381
 382 =head4 Parrot_str_append (was string_append)
 383
 384 Append one string to another and return the result. In the default case, the
 385 return value is the same as the first string argument (modifying that argument
 386 in place). If the first argument is COW or read-only, then the return value is
 387 a new string.
 388
 389 =head4 Parrot_str_new (was string_from_cstring)
 390
 391 Return a new string with the default encoding and character set. Accepts two
 392 arguments, a C string (C<char *>) to initialize the value of the string, and
 393 an integer length of the string (number of characters). If the integer length
 394 isn't passed, the function will calculate the length.
 395
 396 {{NOTE: the integer length isn't really necessary, and is under consideration
 397 for deprecation.}}
 398
 399 =head4 Parrot_str_new_noinit (was string_make_empty)
 400
 401 Returns a new empty string with the default encoding and character set.
 402
 403 =head4 Parrot_str_new_init (was string_make_direct)
 404
 405 Returns a new string of the requested encoding, character set, and
 406 normalization form, initializing the string value to the value passed in.  The
 407 five arguments are a C string (C<char *>), an integer length of the string
 408 argument in bytes, and struct pointers for encoding, character set, and
 409 normalization form structs. If the C string (C<char *>) value is not passed,
 410 returns an empty string. If the encoding, character set, or normalization form
 411 are passed as null values, default values are used.
 412
 413 {{ NOTE: the crippled version of this function, C<string_make>, used to accept
 414 a string name for the character set. This behavior is no longer supported, but
 415 C<Parrot_find_encoding> and C<Parrot_find_charset> can look up the encoding or
 416 character set structs. }}
 417
 418 =head4 Parrot_str_new_constant (was const_string)
 419
 420 Creates and returns a new Parrot constant string. Takes one C string (a C<char
 421 *>) as an argument, the value of the constant string. The length of the C
 422 string is calculated internally.
 423
 424 =head4 Parrot_str_resize (was string_grow)
 425
 426 Resize the string buffer of the given string adding the number of bytes passed
 427 in the integer argument. If the argument is negative, remove the given number
 428 of bytes. Throws an exception if shrinking the string buffer size will
 429 truncate the string (if C<strlen> will be longer than C<buflen>).
 430
 431 =head4 Parrot_str_length (was string_compute_strlen)
 432
 433 Returns the number of characters in the string. Combining characters are each
 434 counted separately. Variable-width encodings may lookahead.
 435
 436 =head4 Parrot_str_grapheme_length
 437
 438 Returns the number of graphemes in the string. Groups of combining characters
 439 count as a single grapheme.
 440
 441 =head4 Parrot_str_byte_length (was string_length)
 442
 443 Returns the number of bytes in the string. The character width of
 444 variable-width encodings is ignored. Combining characters are not treated any
 445 differently than other characters. This is equivalent to accessing the
 446 C<strlen> member of the C<STRING> struct directly.
 447
 448 =head4 Parrot_str_indexed (was string_index)
 449
 450 Returns the character at the specified index (the Nth character from the start
 451 of the string). Combining characters are counted separately. Variable-width
 452 encodings will lookahead to capture full character values.
 453
 454 =head4 Parrot_str_grapheme_indexed
 455
 456 Returns the grapheme at the given index (the Nth grapheme from the string's
 457 start). Groups of combining characters count as a single grapheme, so this
 458 function may return multiple characters.
 459
 460 =head4 Parrot_str_find_index (was string_str_index)
 461
 462 Search for a given substring within a string. If it's found, return an integer
 463 index to the substring's location (the Nth character from the start of the
 464 string). Combining characters are counted separately. Variable-width encodings
 465 will lookahead to capture full character values. Returns -1 unless the
 466 substring is found.
 467
 468 =head4 Parrot_str_copy (was string_copy)
 469
 470 Make a COW copy a string (a new string header pointing to the same string
 471 buffer).
 472
 473 =head4 Parrot_str_grapheme_copy (new)
 474
 475 Accepts two string arguments: a destination and a source. Iterates through the
 476 source string one grapheme at a time and appends it to the destination string.
 477
 478 This function can be used to convert a string of one format to another format.
 479
 480 =head4 Parrot_str_repeat (was string_repeat)
 481
 482 Return a string containing the passed string argument, repeated the number of
 483 times in the integer argument.
 484
 485 =head4 Parrot_str_substr (was string_substr)
 486
 487 Return a substring starting at an integer offset with an integer length. The
 488 offset and length specify characters. Combining characters are counted
 489 separately. Variable-width encodings will lookahead to capture full character
 490 values.
 491
 492 =head4 Parrot_str_grapheme_substr
 493
 494 Return a substring starting at an integer offset with an integer length. The
 495 offset and length specify graphemes. Groups of combining characters count as a
 496 single grapheme.
 497
 498 =head4 Parrot_str_replace (was string_replace)
 499
 500 Replaces a substring within the first string argument with the second string
 501 argument. An integer offset and length, in characters, specify where the
 502 removed substring starts and how long it is.
 503
 504 =head4 Parrot_str_grapheme_replace
 505
 506 Replaces a substring within the first string argument with the second string
 507 argument. An integer offset and length in graphemes specify where the removed
 508 substring starts and how long it is.
 509
 510 =head4 Parrot_str_chopn (was string_chopn)
 511
 512 Chop the requested number of characters off the end of a string without
 513 modifying the original string.
 514
 515 =head4 Parrot_str_chopn_inplace (was string_chopn_inplace).
 516
 517 Chop the requested number of characters off the end of a string, modifying the
 518 original string.
 519
 520 =head4 Parrot_str_grapheme_chopn
 521
 522 Chop the requested number of graphemes off the end of a string without
 523 modifying the original string.
 524
 525 =head4 Parrot_str_compare (was string_compare)
 526
 527 Compare two strings to each other. Return 0 if they are equal, 1 if the first
 528 is greater and -1 if the second is greater. Uses character set collation order
 529 for the comparison. (Two strings that are logically equivalent in terms of
 530 display, but stored in different normalizations are not equal.)
 531
 532 =head4 Parrot_str_grapheme_compare
 533
 534 Compare two strings to each other. Return 0 if they are equal, 1 if the first
 535 is greater and -1 if the second is greater. Uses NFG normalization to compare
 536 the two strings.
 537
 538 =head4 Parrot_str_equal
 539
 540 Compare two strings, return 1 if they are equal, 0 if they are not equal.
 541
 542 =head4 Parrot_str_not_equal (was string_equal)
 543
 544 Compare two strings, return 0 if they are equal, 1 if they are not equal.
 545
 546 {{DEPRECATION NOTE: The return value of 'Parrot_str_equal' is reversed from
 547 the old logic, but 'Parrot_str_not_equal' is provided as a drop-in
 548 replacement for the old function.}}
 549
 550 =head4 Parrot_str_grapheme_equal
 551
 552 Compare two strings using NFG normalization, return 1 if they are equal, 0 if
 553 they are not equal.
 554
 555 =head3 Internal String Functions
 556
 557 The following functions are used internally and are not part of the public
 558 interface.
 559
 560 =head4 Parrot_str_init (was string_init)
 561
 562 Initialize Parrot's string subsystem, including string allocation and garbage
 563 collection.
 564
 565 =head4 Parrot_str_finish (was string_deinit)
 566
 567 Terminate and clean up Parrot's string subsystem, including string allocation
 568 and garbage collection.
 569
 570 =head4 string_max_bytes
 571
 572 Calculate the number of bytes needed to hold a given number of characters in a
 573 particular encoding, multiplying the maximum possible width of a character in
 574 the encoding by the number of characters requested.
 575
 576 {{NOTE: pretty primitive and not very useful. May be deprecated.}}
 577
 578 =head3 Deprecated String Functions
 579
 580 The following string functions are slated to be deprecated.
 581
 582 =head4 string_primary_encoding_for_representation
 583
 584 Not useful, it only ever returned ASCII.
 585
 586 =head4 string_rep_compatible
 587
 588 Only useful on a very narrow set of string encodings/character sets.
 589
 590 =head4 string_make
 591
 592 A crippled version of a string initializer, now replaced with the full version
 593 C<Parrot_string_new_init>.
 594
 595 =head4 string_capacity
 596
 597 This was used to calculate the size of the buffer after the C<strstart>
 598 pointer. Deprecated with C<strstart>.
 599
 600 =head4 string_ord
 601
 602 Replaced by C<Parrot_str_indexed>.
 603
 604 =head4 string_chr
 605
 606 This is handled just fine by C<Parrot_str_new>, we don't need a special
 607 version for a single character.
 608
 609 =head4 make_writable
 610
 611 An archaic function that uses a method of describing strings that hasn't been
 612 allowed for years.
 613
 614 =head4 string_to_cstring_nullable
 615
 616 Just the implementation of string_to_cstring, no need for a separate function
 617 that specially allows returning a NULL string.
 618
 619 =head4 string_increment
 620
 621 Old Perl 5-style behavior where "aa" goes to "bb". Only useful for ASCII
 622 strings, and not terribly useful even there.
 623
 624 =head4 Parrot_string_cstring
 625
 626 Unsafe, and behavior handled by Parrot_str_to_cstring.
 627
 628
 629 =head4 Parrot_str_split
 630
 631 Splits the string C<str> at the delimiter C<delim>.
 632
 633 =head4 Parrot_str_free (was string_free)
 634
 635 Unsafe and unuseful, let the garbage collector take care.
 636
 637 =head3 String PMC API
 638
 639 The String PMC provides a high-level object interface to the string
 640 functionality. It contains a standard Parrot string, holding the string data.
 641
 642 =head4 Vtable Functions
 643
 644 The String PMC implements the following vtable functions.
 645
 646 =over 4
 647
 648 =item init
 649
 650 Initialize a new String PMC.
 651
 652 =item clone
 653
 654 Clone a String PMC.
 655
 656 =item mark
 657
 658 Mark the string value of the String PMC as live.
 659
 660
 661 =item get_integer
 662
 663 Return the integer representation of the string.
 664
 665 =item get_number
 666
 667 Return the floating-point representation of the string.
 668
 669 =item get_string
 670
 671 Return the string value of the String PMC.
 672
 673 =item get_bool
 674
 675 Return the boolean value of the string.
 676
 677 =item set_integer_native
 678
 679 Set the string to an integer value, transforming the integer to its string
 680 equivalent.
 681
 682 =item set_bool
 683
 684 Set the string to a boolean (integer) value, transforming the boolean to its
 685 string equivalent.
 686
 687 =item set_number_native
 688
 689 Set the string to a floating-point value by transforming the number to its
 690 string equivalent.
 691
 692 =item set_string_native
 693
 694 Set the String PMC's stored string value to be the string argument. If the
 695 passed in string is a constant, store a copy.
 696
 697 =item assign_string_native
 698
 699 Set the String PMC's stored string value to a copy of the string argument.
 700
 701 =item set_string_same
 702
 703 Set the String PMC's stored string value to the same as another String PMC's
 704 stored string value. {{NOTE: uses direct access into the storage of the two
 705 PMCs, very ugly.}}
 706
 707 =item set_pmc
 708
 709 Set the String PMC's stored string value to the same as another PMC's string
 710 value, as returned by that PMC's C<get_string> vtable function.
 711
 712 =item *bitwise*
 713
 714 All the bitwise string vtable functions, for AND, OR, XOR, and NOT, both
 715 inplace and standard return.
 716
 717 =item is_equal
 718
 719 Compares the string values of two PMCs and returns true if they match exactly.
 720
 721 =item is_equal_num
 722
 723 Compares the numeric values of two PMCs (first transforming any strings to
 724 numbers) and returns true if they match exactly.
 725
 726 =item is_equal_string
 727
 728 Compares the string values of two PMCs and returns true if they match exactly.
 729 {{ NOTE: the documentation for the PMC says that it returns FALSE if they
 730 match.  This is not the desired behavior. }}
 731
 732 =item is_same
 733
 734 Compares two PMCs and returns true if they are the same PMC class and contain
 735 the same string (not an equivalent string value, but aliases to the same
 736 low-level string).
 737
 738 =item cmp
 739
 740 Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length
 741 strings, and -1 if the passed in string argument is shorter.
 742
 743 =item cmp_num
 744
 745 Compares the numeric values of two PMCs (first changing those values to
 746 numbers) and returns 1 if SELF is smaller, 0 if they are equal, and -1 if the
 747 passed in string argument is smaller.
 748
 749 =item cmp_string
 750
 751 Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length
 752 strings, and -1 if the passed in string argument is shorter.
 753
 754 =item substr
 755
 756 Extract a substring of a given length starting from a given offset (in
 757 graphemes) and store the result in the string argument.
 758
 759 =item substr_str
 760
 761 Extract a substring of a given length starting from a given offset (in
 762 graphemes) and return the string.
 763
 764 =item exists_keyed
 765
 766 Return true if the Nth grapheme in the string exists. Negative numbers count
 767 from the end.
 768
 769 =item get_string_keyed
 770
 771 Return the Nth grapheme in the string. Negative numbers count from the end.
 772
 773 =item set_string_keyed
 774
 775 Insert a string at the Nth grapheme position in the string. {{NOTE: this is
 776 different than the current implementation.}}
 777
 778 =item get_integer_keyed
 779
 780 Returns the integer value of the Nth C<char> in the string. {{DEPRECATE}}
 781
 782 =item set_integer_keyed
 783
 784 Replace the C<char> at the Nth character position in the string with the
 785 C<char> that corresponds to the passed integer value key. {{DEPRECATE}}
 786
 787 =back
 788
 789 =head4 Methods
 790
 791 The String PMC provides the following methods.
 792
 793 =over 4
 794
 795 =item replace
 796
 797 Replace every occurrence of one string with another.
 798
 799 =item to_int
 800
 801 Return the integer equivalent of a string.
 802
 803 =item lower
 804
 805 Change the string to all lowercase.
 806
 807 =item trans
 808
 809 Translate an ASCII string with entries from a translation table.
 810
 811 {{NOTE: likely to be deprecated.}}
 812
 813 =item reverse
 814
 815 Reverse a string, one grapheme at a time. {{ NOTE: Currently only works for
 816 ASCII strings, because it reverses one C<char> at a time. }}
 817
 818
 819 =item is_integer
 820
 821 Checks if the string is just an integer. {{ NOTE: Currently only works for
 822 ASCII strings, fix or deprecate. }}
 823
 824 =back
 825
 826
 827 =head2 References
 828
 829 http://sirviente.9grid.es/sources/plan9/sys/doc/utf.ps - Plan 9's Runes are
 830 not dissimilar to NFG strings, and this is a good introduction to the Unicode
 831 world.
 832
 833 http://www.unicode.org/reports/tr15/ - The Unicode Consortium's
 834 explanation of different normalization forms.
 835
 836 http://unicode.org/reports/tr29/ - "grapheme clusters" in the Unicode Standard
 837 Annex
 838
 839 "Unicode: A Primer", Tony Graham - Arguably the most readable book on
 840 how Unicode works.
 841
 842 "Advanced Perl Programming", Chapter 6, "Unicode"
 843
 844 =cut
 845
 846 __END__
 847 Local Variables:
 848   fill-column:78
 849 End: