docs/strings.pod

   1 # Copyright (C) 2001-2005, The Perl Foundation.
   2 # $Id$
   3
   4 =head1 NAME
   5
   6 docs/strings.pod - Parrot Strings
   7
   8 =head1 ABSTRACT
   9
  10 This document describes how Parrot abstracts the programmer's interface to
  11 string types.
  12
  13 =head1 OVERVIEW
  14
  15 For various reasons, some of which relate to the sequence-of-integer
  16 abstraction, and some of which relate to "infinite" strings and arrays, Parrot
  17 Strings are represented by a list of chunks, where each chunk is a sequence of
  18 integers of the same size or representation, but different chunks can have
  19 different integer sizes or representations.  The Parrot String API hides this
  20 from any module that wishes to work at the abstract string level.  In
  21 particular, it must hide this from the regex engine, which works on pure
  22 sequences in the abstract.
  23
  24 So Parrot Strings are a wizzy internationalized equivalent of the old standard
  25 C library's string.h functions.
  26
  27 =head1 The Parrot String API
  28
  29 All strings used in the Parrot core should use the Parrot C<STRING> structure;
  30 Parrot programmers should not deal with C<char *> or other string-like types
  31 outside of this abstraction without very good reason.
  32
  33 =head1 Interface functions on C<STRING>s
  34
  35 In fact, programmers should hardly ever even access members of the C<STRING>
  36 structure directly. The reason for this is that the interpretation of the data
  37 inside the structure will be a function of the data's encoding. The idea is
  38 that Parrot's strings are encoding-aware so your functions don't need to be; if
  39 you break the abstraction, you suddenly have to start worrying about what the
  40 data actually means.
  41
  42 =head2 String Constructors
  43
  44 The most basic way of creating a string is through the function
  45 C<string_make_direct>:
  46
  47     STRING* string_make_direct(Interp*     interpreter,
  48                                const void* buffer,
  49                                UINTVAL     buflen,
  50                                ENCODING*   encoding,
  51                                CHARSET*    charset,
  52                                UINTVAL     flags)
  53
  54 In here you pass a pointer to a buffer of a given encoding, and the number of
  55 bytes in that buffer to examine, the encoding, the charset, and the initial
  56 values of the C<flags>. These should usually be zero.  In return, you'll get a
  57 brand new Parrot string. This string will have its own private copy of the
  58 buffer, so you don't need to keep it.
  59
  60 Additionally there several convenience functions, that are wrapping
  61 string_make_direct.  See F<src/string.c> for details.
  62
  63 =over 3
  64
  65 =item *
  66
  67 I<Hint>: Nothing stops you doing
  68
  69     string_make_direct(interpreter, NULL, 0, ...
  70
  71 =back
  72
  73 If you already have a string, you can make a copy of it by calling
  74
  75     STRING* string_copy(Interp *, STRING* s)
  76
  77 This is itself implemented in terms of C<string_make>.
  78
  79 =head2 String Manipulation Functions
  80
  81 Unless otherwise stated, all lengths, offsets, and so on, are given in
  82 characters; you are not allowed to care about the byte representation of a
  83 string, so it doesn't make sense to give the values in bytes.
  84
  85 To find out the length of a string, use
  86
  87     INTVAL string_length(const STRING *s)
  88
  89 You I<may> explicitly use C<< s->strlen >> for this since it is such a useful
  90 operation.
  91
  92 To concatenate two strings - that is, to add the contents of string C<b> to the
  93 end of string C<a>, use:
  94
  95     STRING* string_concat(Interp *, STRING* a, STRING *b, INTVAL flag)
  96
  97 C<a> is updated, and is also returned as a convenience. If the flag is set to a
  98 non-zero value, then C<b> will be transcoded to C<a>'s encoding before
  99 concatenation if the strings are of different encodings. You almost certainly
 100 don't want to stick, say, a UTF-32 string on the end of a Big-5 string.
 101
 102 To repeat a string, (ie, turn 'xyz' into 'xyzxyzxyz') use:
 103
 104     STRING* string_repeat(Interp *, const STRING* s, UINTVAL n, STRING** d)
 105
 106 Which will repeat string I<s> n times and store the result into I<d>, which it
 107 also returns.  If I<*d> or I<**d> is NULL, a new string will be allocated to
 108 hold the result.  I<s> is not modified by this operation.  If I<d> is not of
 109 the same type as I<s>, it will be upgraded appropriately.
 110
 111 Chopping C<n> characters off the end of a string is achieved with the
 112 unlikely-sounding
 113
 114     STRING* string_chopn(STRING* s, INTVAL n)
 115
 116 To retrieve a substring of the string, call
 117
 118     STRING* string_substr(Interp*,
 119                           STRING*    src,
 120                           INTVAL     offset,
 121                           INTVAL     length,
 122                           STRING**   dest)
 123
 124 The result will be placed in C<dest>.  (Passing in C<dest> avoids allocating a
 125 new string at runtime. If C<*dest> is a null pointer, a new string structure is
 126 created with the same encoding as C<src>.)
 127
 128 To retrieve a single character of the string, call
 129
 130     INTVAL string_ord(const STRING* s, INTVAL n)
 131
 132 The result will be returned from the function. It checks for the existence of
 133 C<s>, and tests for C<n> being out of range. Currently it applies the method
 134 that perl uses on arrays to handle negative indices. That is to say, negative
 135 values count backwards from the end of the string. For example, index -1 is the
 136 last character in the string, -2 is the next-to-last, and so on.
 137
 138 If C<s> is null or C<s> is zero-length, it throws an exception. If C<n> is out
 139 of range, it also throws an exception.
 140
 141 To compare two strings, use:
 142
 143     INTVAL string_compare(Interp *, STRING* s1, STRING* s2)
 144
 145 The value returned will be less than, equal to, or greater than zero depending
 146 on whether C<s1> is less than, equal to, or greater than C<s2>.
 147
 148 Strings whose encodings are not the same can be compared - in this case a
 149 UTF-32 copy will be made of each string and these copies will be compared.
 150
 151 To test a string for truth, use:
 152
 153     INTVAL string_bool(STRING* s);
 154
 155 A string is false if it
 156
 157  o  is not yet allocated
 158  o  has zero length
 159  o  consists of one digit character whose numeric value (as
 160     decided by its character type) is zero.
 161
 162 Otherwise the string will be true.
 163
 164 To format output into a string, use
 165
 166     STRING* string_nprintf(Interp*,
 167                            STRING*   dest,
 168                            INTVAL    len,
 169                            char*     format,
 170                            ...)
 171
 172 C<dest> may be a null pointer, in which case a new string will be created. If
 173 C<len> is zero, the behaviour becomes more C<sprintf>ish than C<snprintf>-like.
 174
 175 =head1 Notes for Implementers
 176
 177 =head2 Termination
 178
 179 The character buffer pointed to by *strstart is not expected to be terminated
 180 by a NULL byte and functions which provide the string API will not add one.  Any
 181 functions which access the buffer directly and which require a terminating NULL
 182 byte must place one there themselves and also be very careful about NULL bytes
 183 within the used portion of the character buffer.  In particular, if C<bufused
 184 == buflen> more space must be allocated to hold a terminating byte.
 185
 186 =head1 Elements of the C<STRING> structure
 187
 188 Those implementing the C<STRING> API will obviously need to know about how the
 189 C<STRING> structure works. You can find the definition of this structure in
 190 F<pobj.h>:
 191
 192     struct parrot_string_t {
 193         pobj_t obj;
 194         UINTVAL bufused;
 195         void *strstart;
 196         UINTVAL strlen;
 197         const ENCODING *encoding;
 198         const CHARTYPE *type;
 199         INTVAL language;
 200     };
 201
 202 Let's look at each element of this structure in turn.
 203
 204 =head2 C<obj.u.b.bufstart>
 205
 206 This pointer points to the buffer which holds the string, encoded in whatever
 207 is the string's specified encoding. Because of this, you should not make any
 208 assumptions about what's in the buffer, and hence you shouldn't try and access
 209 it directly.
 210
 211 =head2 C<obj.u.b.buflen>
 212
 213 This is used for memory allocation; it tells you the currently allocated size
 214 of the buffer in bytes.
 215
 216 =head2 C<obj.flags>
 217
 218 This is a general holding area for string flags. The exact flags required have
 219 not yet been determined.
 220
 221 =head2 C<bufused>
 222
 223 C<bufused> on the other hand, contains the number of bytes out of the allocated
 224 buffer which are actually in use. This, together with C<buflen>, is used by the
 225 buffer growing algorithm to determine when and by how much to grow the
 226 allocation buffer.
 227
 228 =head2 C<strstart>
 229
 230 This stores the actual start of the string. In the case of COW strings holding
 231 references to portions of a larger string, (for example, in regex match
 232 variables), this is a pointer into the start of the string.
 233
 234 =head2 C<strlen>
 235
 236 This is the length of the string in characters, as you would expect to find
 237 from C<length $string> in Perl. Again, because string buffers may be in one of
 238 a number of encodings, this must be computed by the appropriate encoding.
 239 C<string_compute_strlen(STRING)> updates this value, calling the encoding's
 240 C<characters()> function.
 241
 242 =head2 C<encoding>
 243
 244 This specifies the encoding used to encode the characters in the data. There
 245 are currently four character encodings used in Parrot: singlebyte, UTF-8,
 246 UTF-16 and UTF-32. UTF-16 and UTF-32 should use the native endianness of the
 247 machine.
 248
 249 =head2 C<type>
 250
 251 This specifies the character set for the string. There are currently two
 252 character sets in Parrot: US ASCII and Unicode. Each character set has a
 253 default encoding. The default character set is US ASCII.
 254
 255 =head2 C<language>
 256
 257 This field is currently unused; however, it can be used to hold a pointer to
 258 the correct vtable for foreign strings.
 259
 260 =head1 Non-user-visible String Manipulation Functions
 261
 262 If you've read this far, I hope you're a Parrot implementer. If you're not
 263 helping construct the Parrot core itself, you probably want to look away now.
 264
 265 The first two functions to note are
 266
 267     INTVAL string_compute_strlen(STRING* s)
 268
 269 and
 270
 271     INTVAL string_max_bytes(STRING *s, INTVAL iv)
 272
 273 The first updates the contents of C<<s->strlen>> by contemplating the buffer
 274 C<strstart> and working out how many characters it contains. The second is
 275 given a number of characters which we assume are going to be added into the
 276 string at some point; it returns the maximum number of bytes that need to be
 277 allocated to admit that number of characters. For fixed-width encodings, this
 278 is trivial - the singlebyte encoding, for instance, encodes one byte per
 279 character, so C<string_max_bytes()> simply returns the C<INTVAL> it is passed;
 280 calling C<string_max_bytes()> on a UTF-8 string, on the other hand, returns
 281 three times the value that it is passed because a UTF-8 character may occupy up
 282 to three bytes.
 283
 284 To grow a string to a specified size, use
 285
 286     void string_grow(Interp *, STRING *s, INTVAL newsize)
 287
 288 The size is given in characters; C<string_max_bytes()> is called to turn this
 289 into a size in bytes, and then the buffer is grown to accommodate (at least)
 290 that many bytes.
 291
 292 =head1 Transcoding
 293
 294 The fact that Parrot strings are encoding-abstracted really has to bottom out
 295 at some point, and it's usually when two strings of different encodings
 296 interact. When we try to append one type of string to another, we have the
 297 option of turning the later string into a string that matches the first
 298 string's encoding. This process, translating a string from one encoding into
 299 another, is called "transcoding".
 300
 301 In Parrot, transcoding is implemented by C<Parrot_CharType_Transcode> functions
 302 which take two character sets (C<CHARTYPE>) and a character (C<Parrot_UInt>)
 303 and returns the character converted from the first to the second character set.
 304
 305 Each C<CHARTYPE> has a number of transcoders associated with it, of which those
 306 to and from Unicode are explicitly singled out because of their expected
 307 frequent use. The C<transcoders> array is currently not used.
 308
 309 =head2 Foreign Encodings
 310
 311 Fill this in later; if anyone wants to implement new encodings at this stage
 312 they must be mad.
 313
 314 =head1 SEE ALSO
 315
 316 F<src/string.c>, F<include/parrot/string.h>, F<include/parrot/string_funcs.h>.
 317
 318 =head1 HISTORY
 319
 320 =over
 321
 322 =item 4 October 2003
 323
 324 Revised to reflect changes since Buffer/PMC unification.
 325
 326 =back