draft-ietf-idn-punycode-03.txt

   1 INTERNET-DRAFT                                          Adam M. Costello
   2 draft-ietf-idn-punycode-03.txt                               2002-Oct-08
   3 Expires 2003-Apr-08
   4
   5           Punycode: A Bootstring encoding of Unicode for IDNA
   6
   7 Status of this Memo
   8
   9     This document is an Internet-Draft and is in full conformance with
  10     all provisions of Section 10 of RFC2026.
  11
  12     Internet-Drafts are working documents of the Internet Engineering
  13     Task Force (IETF), its areas, and its working groups.  Note
  14     that other groups may also distribute working documents as
  15     Internet-Drafts.
  16
  17     Internet-Drafts are draft documents valid for a maximum of six
  18     months and may be updated, replaced, or obsoleted by other documents
  19     at any time.  It is inappropriate to use Internet-Drafts as
  20     reference material or to cite them other than as "work in progress."
  21
  22     The list of current Internet-Drafts can be accessed at
  23     http://www.ietf.org/ietf/1id-abstracts.txt
  24
  25     The list of Internet-Draft Shadow Directories can be accessed at
  26     http://www.ietf.org/shadow.html
  27
  28     Distribution of this document is unlimited.
  29
  30 Abstract
  31
  32     Punycode is a simple and efficient transfer encoding syntax designed
  33     for use with Internationalized Domain Names in Applications [IDNA].
  34     It uniquely and reversibly transforms a Unicode string [UNICODE]
  35     into an ASCII string [ASCII].  ASCII characters in the Unicode
  36     string are represented literally, and non-ASCII characters are
  37     represented by ASCII characters that are allowed in host name labels
  38     (letters, digits, and hyphens).  This document defines a general
  39     algorithm called Bootstring that allows a string of basic code
  40     points to uniquely represent any string of code points drawn from
  41     a larger set.  Punycode is an instance of Bootstring that uses
  42     particular parameter values specified by this document, appropriate
  43     for IDNA.
  44 \f
  45 Contents
  46
  47      1. Introduction
  48          1.1 Features
  49          1.2 Interaction of protocol parts
  50      2. Terminology
  51      3. Bootstring description
  52          3.1 Basic code point segregation
  53          3.2 Insertion unsort coding
  54          3.3 Generalized variable-length integers
  55          3.4 Bias adaptation
  56      4. Bootstring parameters
  57      5. Parameter values for Punycode
  58      6. Bootstring algorithms
  59          6.1 Bias adaptation function
  60          6.2 Decoding procedure
  61          6.3 Encoding procedure
  62          6.4 Overflow handling
  63      7. Punycode examples
  64          7.1 Sample strings
  65          7.2 Decoding traces
  66          7.3 Encoding traces
  67      8. Security considerations
  68      9. References (non-normative)
  69      A. Author contact information
  70      B. Mixed-case annotation
  71      C. Disclaimer and license
  72      D. Punycode sample implementation
  73
  74 1. Introduction
  75
  76     [IDNA] describes an architecture for supporting internationalized
  77     domain names.  Labels containing non-ASCII characters can be
  78     represented by ACE labels, which begin with a special ACE prefix and
  79     contain only ASCII characters.  The remainder of the label after the
  80     prefix is a Punycode encoding of a Unicode string satisfying certain
  81     constraints.  For the details of the prefix and constraints, see
  82     [IDNA] and [NAMEPREP].
  83
  84     Punycode is an instance of a more general algorithm called
  85     Bootstring, which allows strings composed from a small set of
  86     "basic" code points to uniquely represent any string of code points
  87     drawn from a larger set.  Punycode is Bootstring with particular
  88     parameter values appropriate for IDNA.
  89
  90 1.1 Features
  91
  92     Bootstring has been designed to have the following features:
  93
  94       * Completeness:  Every extended string (sequence of arbitrary code
  95         points) can be represented by a basic string (sequence of basic
  96         code points).  Restrictions on what strings are allowed, and on
  97         length, can be imposed by higher layers.
  98
  99       * Uniqueness:  There is at most one basic string that represents a
 100         given extended string.
 101 \f
 102       * Reversibility:  Any extended string mapped to a basic string can
 103         be recovered from that basic string.
 104
 105       * Efficient encoding:  The ratio of basic string length to
 106         extended string length is small.  This is important in the
 107         context of domain names because RFC 1034 [RFC1034] restricts the
 108         length of a domain label to 63 characters.
 109
 110       * Simplicity:  The encoding and decoding algorithms are reasonably
 111         simple to implement.  The goals of efficiency and simplicity are
 112         at odds; Bootstring aims at a good balance between them.
 113
 114       * Readability:  Basic code points appearing in the extended string
 115         are represented as themselves in the basic string (although the
 116         main purpose is to improve efficiency, not readability).
 117
 118     Punycode can also support an additional feature that is not used
 119     by the ToASCII and ToUnicode operations of [IDNA].  When extended
 120     strings are case-folded prior to encoding, the basic string can
 121     use mixed case to tell how to convert the folded string into a
 122     mixed-case string.  See appendix B "Mixed-case annotation".
 123
 124 1.2 Interaction of protocol parts
 125
 126     Punycode is used by the IDNA protocol [IDNA] for converting domain
 127     labels into ASCII; it is not designed for any other purpose.  It is
 128     explicitly not designed for processing arbitrary free text.
 129
 130 2. Terminology
 131
 132     The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
 133     "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 134     document are to be interpreted as described in RFC 2119 [RFC2119].
 135
 136     A code point is an integral value associated with a character in a
 137     coded character set.
 138
 139     As in the Unicode Standard [UNICODE], Unicode code points are
 140     denoted by "U+" followed by four to six hexadecimal digits, while a
 141     range of code points is denoted by two hexadecimal numbers separated
 142     by "..", with no prefixes.
 143
 144     The operators div and mod perform integer division; (x div y) is the
 145     quotient of x divided by y, discarding the remainder, and (x mod y)
 146     is the remainder, so (x div y) * y + (x mod y) == x.  Bootstring
 147     uses these operators only with nonnegative operands, so the quotient
 148     and remainder are always nonnegative.
 149
 150     The break statement jumps out of the innermost loop (as in C).
 151
 152     An overflow is an attempt to compute a value that exceeds the
 153     maximum value of an integer variable.
 154 \f
 155 3. Bootstring description
 156
 157     Bootstring represents an arbitrary sequence of code points (the
 158     "extended string") as a sequence of basic code points (the "basic
 159     string").  This section describes the representation.  Section 6
 160     "Bootstring algorithms" presents the algorithms as pseudocode.
 161     Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
 162     algorithms for sample inputs.
 163
 164     The following sections describe the four techniques used in
 165     Bootstring.  "Basic code point segregation" is a very simple
 166     and efficient encoding for basic code points occurring in the
 167     extended string: they are simply copied all at once.  "Insertion
 168     unsort coding" encodes the non-basic code points as deltas, and
 169     processes the code points in numerical order rather than in order of
 170     appearance, which typically results in smaller deltas.  The deltas
 171     are represented as "generalized variable-length integers", which use
 172     basic code points to represent nonnegative integers.  The parameters
 173     of this integer representation are dynamically adjusted using "bias
 174     adaptation", to improve efficiency when consecutive deltas have
 175     similar magnitudes.
 176
 177 3.1 Basic code point segregation
 178
 179     All basic code points appearing in the extended string are
 180     represented literally at the beginning of the basic string, in their
 181     original order, followed by a delimiter if (and only if) the number
 182     of basic code points is nonzero.  The delimiter is a particular
 183     basic code point, which never appears in the remainder of the basic
 184     string.  The decoder can therefore find the end of the literal
 185     portion (if there is one) by scanning for the last delimiter.
 186
 187 3.2 Insertion unsort coding
 188
 189     The remainder of the basic string (after the last delimiter if there
 190     is one) represents a sequence of nonnegative integral deltas as
 191     generalized variable-length integers, described in section 3.3.  The
 192     meaning of the deltas is best understood in terms of the decoder.
 193
 194     The decoder builds the extended string incrementally.  Initially,
 195     the extended string is a copy of the literal portion of the basic
 196     string (excluding the last delimiter).  The decoder inserts
 197     non-basic code points, one for each delta, into the extended string,
 198     ultimately arriving at the final decoded string.
 199
 200     At the heart of this process is a state machine with two state
 201     variables: an index i and a counter n.  The index i refers to
 202     a position in the extended string; it ranges from 0 (the first
 203     position) to the current length of the extended string (which refers
 204     to a potential position beyond the current end).  If the current
 205     state is <n,i>, the next state is <n,i+1> if i is less than the
 206     length of the extended string, or <n+1,0> if i equals the length of
 207     the extended string.  In other words, each state change causes i to
 208     increment, wrapping around to zero if necessary, and n counts the
 209     number of wrap-arounds.
 210 \f
 211     Notice that the state always advances monotonically (there is no
 212     way for the decoder to return to an earlier state).  At each state,
 213     an insertion is either performed or not performed.  At most one
 214     insertion is performed in a given state.  An insertion inserts the
 215     value of n at position i in the extended string.  The deltas are
 216     a run-length encoding of this sequence of events: they are the
 217     lengths of the runs of non-insertion states preceeding the insertion
 218     states.  Hence, for each delta, the decoder performs delta state
 219     changes, then an insertion, and then one more state change.  (An
 220     implementation need not perform each state change individually, but
 221     can instead use division and remainder calculations to compute the
 222     next insertion state directly.)  It is an error if the inserted code
 223     point is a basic code point (because basic code points were supposed
 224     to be segregated as described in section 3.1).
 225
 226     The encoder's main task is to derive the sequence of deltas that
 227     will cause the decoder to construct the desired string.  It can do
 228     this by repeatedly scanning the extended string for the next code
 229     point that the decoder would need to insert, and counting the number
 230     of state changes the decoder would need to perform, mindful of the
 231     fact that the decoder's extended string will include only those
 232     code points that have already been inserted.  Section 6.3 "Encoding
 233     procedure" gives a precise algorithm.
 234
 235 3.3 Generalized variable-length integers
 236
 237     In a conventional integer representation the base is the number of
 238     distinct symbols for digits, whose values are 0 through base-1.  Let
 239     digit_0 denote the least significant digit, digit_1 the next least
 240     significant, and so on.  The value represented is the sum over j of
 241     digit_j * w(j), where w(j) = base^j is the weight (scale factor)
 242     for position j.  For example, in the base 8 integer 437, the digits
 243     are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is
 244     7 + 3*8 + 4*64 = 287.  This representation has two disadvantages:
 245     First, there are multiple encodings of each value (because there
 246     can be extra zeros in the most significant positions), which is
 247     inconvenient when unique encodings are needed.  Second, the integer
 248     is not self-delimiting, so if multiple integers are concatenated the
 249     boundaries between them are lost.
 250
 251     The generalized variable-length representation solves these two
 252     problems.  The digit values are still 0 through base-1, but now
 253     the integer is self-delimiting by means of thresholds t(j), each
 254     of which is in the range 0 through base-1.  Exactly one digit, the
 255     most significant, satisfies digit_j < t(j).  Therefore, if several
 256     integers are concatenated, it is easy to separate them, starting
 257     with the first if they are little-endian (least significant digit
 258     first), or starting with the last if they are big-endian (most
 259     significant digit first).  As before, the value is the sum over j of
 260     digit_j * w(j), but the weights are different:
 261
 262         w(0) = 1
 263         w(j) = w(j-1) * (base - t(j-1)) for j > 0
 264 \f
 265     For example, consider the little-endian sequence of base 8 digits
 266     734251...  Suppose the thresholds are 2, 3, 5, 5, 5, 5...  This
 267     implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5)
 268     = 90, 90*(8-5) = 270, and so on.  7 is not less than 2, and 3 is
 269     not less than 3, but 4 is less than 5, so 4 is the last digit.  The
 270     value of 734 is 7*1 + 3*6 + 4*30 = 145.  The next integer is 251,
 271     with value 2*1 + 5*6 + 1*30 = 62.  Decoding this representation
 272     is very similar to decoding a conventional integer:  Start with a
 273     current value of N = 0 and a weight w = 1.  Fetch the next digit d
 274     and increase N by d * w.  If d is less than the current threshold
 275     (t) then stop, otherwise increase w by a factor of (base - t),
 276     update t for the next position, and repeat.
 277
 278     Encoding this representation is similar to encoding a conventional
 279     integer:  If N < t then output one digit for N and stop, otherwise
 280     output the digit for t + ((N - t) mod (base - t)), then replace N
 281     with (N - t) div (base - t), update t for the next position, and
 282     repeat.
 283
 284     For any particular set of values of t(j), there is exactly one
 285     generalized variable-length representation of each nonnegative
 286     integral value.
 287
 288     Bootstring uses little-endian ordering so that the deltas can be
 289     separated starting with the first.  The t(j) values are defined in
 290     terms of the constants base, tmin, and tmax, and a state variable
 291     called bias:
 292
 293         t(j) = base * (j + 1) - bias,
 294         clamped to the range tmin through tmax
 295
 296     The clamping means that if the formula yields a value less than tmin
 297     or greater than tmax, then t(j) = tmin or tmax, respectively.  (In
 298     the pseudocode in section 6 "Bootstring algorithms", the expression
 299     base * (j + 1) is denoted by k for performance reasons.)  These
 300     t(j) values cause the representation to favor integers within a
 301     particular range determined by the bias.
 302
 303 3.4 Bias adaptation
 304
 305     After each delta is encoded or decoded, bias is set for the next
 306     delta as follows:
 307
 308      1. Delta is scaled in order to avoid overflow in the next step:
 309
 310             let delta = delta div 2
 311
 312         But when this is the very first delta, the divisor is not 2, but
 313         instead a constant called damp.  This compensates for the fact
 314         that the second delta is usually much smaller than the first.
 315
 316      2. Delta is increased to compensate for the fact that the next
 317         delta will be inserting into a longer string:
 318 \f
 319             let delta = delta + (delta div numpoints)
 320
 321         numpoints is the total number of code points encoded/decoded so
 322         far (including the one corresponding to this delta itself, and
 323         including the basic code points).
 324
 325      3. Delta is repeatedly divided until it falls within a threshold,
 326         to predict the minimum number of digits needed to represent the
 327         next delta:
 328
 329             while delta > ((base - tmin) * tmax) div 2
 330             do let delta = delta div (base - tmin)
 331
 332      4. The bias is set:
 333
 334             let bias =
 335               (base * the number of divisions performed in step 3) +
 336               (((base - tmin + 1) * delta) div (delta + skew))
 337
 338     The motivation for this procedure is that the current delta provides
 339     a hint about the likely size of the next delta, and so t(j) is
 340     set to tmax for the more significant digits starting with the one
 341     expected to be last, tmin for the less significant digits up through
 342     the one expected to be third-last, and somewhere between tmin and
 343     tmax for the digit expected to be second-last (balancing the hope of
 344     the expected-last digit being unnecessary against the danger of it
 345     being insufficient).
 346
 347 4. Bootstring parameters
 348
 349     Given a set of basic code points, one needs to be designated as
 350     the delimiter.  The base cannot be greater than the number of
 351     distinguishable basic code points remaining.  The digit-values in
 352     the range 0 through base-1 need to be associated with distinct
 353     non-delimiter basic code points.  In some cases multiple code points
 354     need to have the same digit-value; for example, uppercase and
 355     lowercase versions of the same letter need to be equivalent if basic
 356     strings are case-insensitive.
 357
 358     The initial value of n cannot be greater than the minimum non-basic
 359     code point that could appear in extended strings.
 360
 361     The remaining five parameters (tmin, tmax, skew, damp, and the
 362     initial value of bias) need to satisfy the following constraints:
 363
 364         0 <= tmin <= tmax <= base-1
 365         skew >= 1
 366         damp >= 2
 367         initial_bias mod base <= base - tmin
 368
 369     Provided the constraints are satisfied, these five parameters affect
 370     efficiency but not correctness.  They are best chosen empirically.
 371
 372     If support for mixed-case annotation is desired (see appendix B),
 373     make sure that the code points corresponding to 0 through tmax-1 all
 374     have both uppercase and lowercase forms.
 375 \f
 376 5. Parameter values for Punycode
 377
 378     Punycode uses the following Bootstring parameter values:
 379
 380         base         = 36
 381         tmin         = 1
 382         tmax         = 26
 383         skew         = 38
 384         damp         = 700
 385         initial_bias = 72
 386         initial_n    = 128 = 0x80
 387
 388     Although the only restriction Punycode imposes on the input integers
 389     is that they be nonnegative, these parameters are especially
 390     designed to work well with Unicode [UNICODE] code points, which
 391     are integers in the range 0..10FFFF (but not D800..DFFF, which are
 392     reserved for use by the UTF-16 encoding of Unicode).  The basic code
 393     points are the ASCII [ASCII] code points (0..7F), of which U+002D
 394     (-) is the delimiter, and some of the others have digit-values as
 395     follows:
 396
 397         code points    digit-values
 398         ------------   ----------------------
 399         41..5A (A-Z) =  0 to 25, respectively
 400         61..7A (a-z) =  0 to 25, respectively
 401         30..39 (0-9) = 26 to 35, respectively
 402
 403     Using hyphen-minus as the delimiter implies that the encoded string
 404     can end with a hyphen-minus only if the Unicode string consists
 405     entirely of basic code points, but IDNA forbids such strings from
 406     being encoded.  The encoded string can begin with a hyphen-minus,
 407     but IDNA prepends a prefix.  Therefore IDNA using Punycode conforms
 408     to the RFC 952 rule that host name labels neither begin nor end with
 409     a hyphen-minus [RFC952].
 410
 411     A decoder MUST recognize the letters in both uppercase and lowercase
 412     forms (including mixtures of both forms).  An encoder SHOULD output
 413     only uppercase forms or only lowercase forms, unless it uses
 414     mixed-case annotation (see appendix B).
 415
 416     Presumably most users will not manually write or type encoded
 417     strings (as opposed to cutting and pasting them), but those who do
 418     will need to be alert to the potential visual ambiguity between the
 419     following sets of characters:
 420
 421         G 6
 422         I l 1
 423         O 0
 424         S 5
 425         U V
 426         Z 2
 427
 428     Such ambiguities are usually resolved by context, but in a Punycode
 429     encoded string there is no context apparent to humans.
 430 \f
 431 6. Bootstring algorithms
 432
 433     Some parts of the pseudocode can be omitted if the parameters
 434     satisfy certain conditions (for which Punycode qualifies).  These
 435     parts are enclosed in {braces}, and notes immediately following the
 436     pseudocode explain the conditions under which they can be omitted.
 437
 438     Formally, code points are integers, and hence the pseudocode assumes
 439     that arithmetic operations can be performed directly on code points.
 440     In some programming languages, explicit conversion between code
 441     points and integers might be necessary.
 442
 443 6.1 Bias adaptation function
 444
 445     function adapt(delta,numpoints,firsttime):
 446       if firsttime then let delta = delta div damp
 447       else let delta = delta div 2
 448       let delta = delta + (delta div numpoints)
 449       let k = 0
 450       while delta > ((base - tmin) * tmax) div 2 do begin
 451         let delta = delta div (base - tmin)
 452         let k = k + base
 453       end
 454       return k + (((base - tmin + 1) * delta) div (delta + skew))
 455
 456     It does not matter whether the modifications to delta and k
 457     inside adapt() affect variables of the same name inside the
 458     encoding/decoding procedures, because after calling adapt() the
 459     caller does not read those variables before overwriting them.
 460 \f
 461 6.2 Decoding procedure
 462
 463     let n = initial_n
 464     let i = 0
 465     let bias = initial_bias
 466     let output = an empty string indexed from 0
 467     consume all code points before the last delimiter (if there is one)
 468       and copy them to output, fail on any non-basic code point
 469     if more than zero code points were consumed then consume one more
 470       (which will be the last delimiter)
 471     while the input is not exhausted do begin
 472       let oldi = i
 473       let w = 1
 474       for k = base to infinity in steps of base do begin
 475         consume a code point, or fail if there was none to consume
 476         let digit = the code point's digit-value, fail if it has none
 477         let i = i + digit * w, fail on overflow
 478         let t = tmin if k <= bias {+ tmin}, or
 479                 tmax if k >= bias + tmax, or k - bias otherwise
 480         if digit < t then break
 481         let w = w * (base - t), fail on overflow
 482       end
 483       let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
 484       let n = n + i div (length(output) + 1), fail on overflow
 485       let i = i mod (length(output) + 1)
 486       {if n is a basic code point then fail}
 487       insert n into output at position i
 488       increment i
 489     end
 490
 491     The full statement enclosed in braces (checking whether n is a basic
 492     code point) can be omitted if initial_n exceeds all basic code
 493     points (which is true for Punycode), because n is never less than
 494     initial_n.
 495
 496     In the assignment of t, where t is clamped to the range tmin through
 497     tmax, "+ tmin" can always be omitted.  This makes the clamping
 498     calculation incorrect when bias < k < bias + tmin, but that cannot
 499     happen because of the way bias is computed and because of the
 500     constraints on the parameters.
 501
 502     Because the decoder state can only advance monotonically, and there
 503     is only one representation of any delta, there is therefore only
 504     one encoded string that can represent a given sequence of integers.
 505     The only error conditions are invalid code points, unexpected
 506     end-of-input, overflow, and basic code points encoded using deltas
 507     instead of appearing literally.  If the decoder fails on these
 508     errors as shown above, then it cannot produce the same output for
 509     two distinct inputs.  Without this property it would have been
 510     necessary to re-encode the output and verify that it matches the
 511     input in order to guarantee the uniqueness of the encoding.
 512 \f
 513 6.3 Encoding procedure
 514
 515     let n = initial_n
 516     let delta = 0
 517     let bias = initial_bias
 518     let h = b = the number of basic code points in the input
 519     copy them to the output in order, followed by a delimiter if b > 0
 520     {if the input contains a non-basic code point < n then fail}
 521     while h < length(input) do begin
 522       let m = the minimum {non-basic} code point >= n in the input
 523       let delta = delta + (m - n) * (h + 1), fail on overflow
 524       let n = m
 525       for each code point c in the input (in order) do begin
 526         if c < n {or c is basic} then increment delta, fail on overflow
 527         if c == n then begin
 528           let q = delta
 529           for k = base to infinity in steps of base do begin
 530             let t = tmin if k <= bias {+ tmin}, or
 531                     tmax if k >= bias + tmax, or k - bias otherwise
 532             if q < t then break
 533             output the code point for digit t + ((q - t) mod (base - t))
 534             let q = (q - t) div (base - t)
 535           end
 536           output the code point for digit q
 537           let bias = adapt(delta, h + 1, test h equals b?)
 538           let delta = 0
 539           increment h
 540         end
 541       end
 542       increment delta and n
 543     end
 544
 545     The full statement enclosed in braces (checking whether the input
 546     contains a non-basic code point less than n) can be omitted if all
 547     code points less than initial_n are basic code points (which is true
 548     for Punycode if code points are unsigned).
 549
 550     The brace-enclosed conditions "non-basic" and "or c is basic" can be
 551     omitted if initial_n exceeds all basic code points (which is true
 552     for Punycode), because the code point being tested is never less
 553     than initial_n.
 554
 555     In the assignment of t, where t is clamped to the range tmin through
 556     tmax, "+ tmin" can always be omitted.  This makes the clamping
 557     calculation incorrect when bias < k < bias + tmin, but that cannot
 558     happen because of the way bias is computed and because of the
 559     constraints on the parameters.
 560
 561     The checks for overflow are necessary to avoid producing invalid
 562     output when the input contains very large values or is very long.
 563
 564     The increment of delta at the bottom of the outer loop cannot
 565     overflow because delta < length(input) before the increment, and
 566     length(input) is already assumed to be representable.  The increment
 567     of n could overflow, but only if h == length(input), in which case
 568     the procedure is finished anyway.
 569 \f
 570 6.4 Overflow handling
 571
 572     For IDNA, 26-bit unsigned integers are sufficient to handle all
 573     valid IDNA labels without overflow, because any string that
 574     needed a 27-bit delta would have to exceed either the code point
 575     limit (0..10FFFF) or the label length limit (63 characters).
 576     However, overflow handling is necessary because the inputs are not
 577     necessarily valid IDNA labels.
 578
 579     If the programming language does not provide overflow detection,
 580     the following technique can be used.  Suppose A, B, and C are
 581     representable nonnegative integers and C is nonzero.  Then A + B
 582     overflows if and only if B > maxint - A, and A + (B * C) overflows
 583     if and only if B > (maxint - A) div C, where maxint is the greatest
 584     integer for which maxint + 1 cannot be represented.  Refer to
 585     appendix D "Punycode sample implementation" for demonstrations of
 586     this technique in the C language.
 587
 588     The decoding and encoding algorithms shown in sections 6.2 and
 589     6.3 handle overflow by detecting it whenever it happens.  Another
 590     approach is to enforce limits on the inputs that prevent overflow
 591     from happening.  For example, if the encoder were to verify that
 592     no input code points exceed M and that the input length does not
 593     exceed L, then no delta could ever exceed (M - initial_n) * (L + 1),
 594     and hence no overflow could occur if integer variables were capable
 595     of representing values that large.  This prevention approach would
 596     impose more restrictions on the input than the detection approach
 597     does, but might be considered simpler in some programming languages.
 598
 599     In theory, the decoder could use an analogous approach, limiting the
 600     number of digits in a variable-length integer (that is, limiting the
 601     number of iterations in the innermost loop).  However, the number
 602     of digits that suffice to represent a given delta can sometimes
 603     represent much larger deltas (because of the adaptation), and hence
 604     this approach would probably need integers wider than 32 bits.
 605
 606     Yet another approach for the decoder is to allow overflow to occur,
 607     but to check the final output string by re-encoding it and comparing
 608     to the decoder input.  If and only if they do not match (using a
 609     case-insensitive ASCII comparison) overflow has occurred.  This
 610     delayed-detection approach would not impose any more restrictions on
 611     the input than the immediate-detection approach does, and might be
 612     considered simpler in some programming languages.
 613
 614     In fact, if the decoder is used only inside the IDNA ToUnicode
 615     operation [IDNA], then it need not check for overflow at all,
 616     because ToUnicode performs a higher level re-encoding and
 617     comparison, and a mismatch has the same consequence as if the
 618     Punycode decoder had failed.
 619
 620 7. Punycode examples
 621
 622 7.1 Sample strings
 623
 624     In the Punycode encodings below, the ACE prefix is not shown.
 625     Backslashes show where line breaks have been inserted in strings too
 626     long for one line.
 627 \f
 628     The first several examples are all translations of the sentence "Why
 629     can't they just speak in <language>?" (courtesy of Michael Kaplan's
 630     "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
 631     been removed, as is often done in domain names.
 632
 633     (A) Arabic (Egyptian):
 634         u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
 635         u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
 636         Punycode: egbpdaj6bu4bxfgehfvwxn
 637
 638     (B) Chinese (simplified):
 639         u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
 640         Punycode: ihqwcrb4cv8a8dqg056pqjye
 641
 642     (C) Chinese (traditional):
 643         u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
 644         Punycode: ihqwctvzc91f659drss3x8bo0yb
 645
 646     (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
 647         U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
 648         u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
 649         u+0065 u+0073 u+006B u+0079
 650         Punycode: Proprostnemluvesky-uyb24dma41a
 651
 652     (E) Hebrew:
 653         u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
 654         u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
 655         u+05D1 u+05E8 u+05D9 u+05EA
 656         Punycode: 4dbcagdahymbxekheh6e0a7fei0b
 657
 658     (F) Hindi (Devanagari):
 659         u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
 660         u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
 661         u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
 662         u+0939 u+0948 u+0902
 663         Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
 664
 665     (G) Japanese (kanji and hiragana):
 666         u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
 667         u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
 668         Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
 669
 670     (H) Korean (Hangul syllables):
 671         u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
 672         u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
 673         u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
 674         Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
 675                   psd879ccm6fea98c
 676
 677     (I) Russian (Cyrillic):
 678         U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
 679         u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
 680         u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
 681         u+0438
 682         Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
 683 \f
 684     (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
 685         U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
 686         u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
 687         u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
 688         u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
 689         u+0061 u+00F1 u+006F u+006C
 690         Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a
 691
 692     (K) Vietnamese:
 693         T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
 694         <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
 695         U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
 696         u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
 697         u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
 698         U+0056 u+0069 u+1EC7 u+0074
 699         Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g
 700
 701     The next several examples are all names of Japanese music artists,
 702     song titles, and TV programs, just because the author happens to
 703     have them handy (but Japanese is useful for providing examples
 704     of single-row text, two-row text, ideographic text, and various
 705     mixtures thereof).
 706
 707     (L) 3<nen>B<gumi><kinpachi><sensei>
 708         u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
 709         Punycode: 3B-ww4c5e180e575a65lsy2b
 710
 711     (M) <amuro><namie>-with-SUPER-MONKEYS
 712         u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
 713         u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
 714         U+004F U+004E U+004B U+0045 U+0059 U+0053
 715         Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
 716
 717     (N) Hello-Another-Way-<sorezore><no><basho>
 718         U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
 719         u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
 720         u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
 721         Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b
 722
 723     (O) <hitotsu><yane><no><shita>2
 724         u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
 725         Punycode: 2-u9tlzr9756bt3uc0v
 726
 727     (P) Maji<de>Koi<suru>5<byou><mae>
 728         U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
 729         u+308B u+0035 u+79D2 u+524D
 730         Punycode: MajiKoi5-783gue6qz075azm5e
 731
 732     (Q) <pafii>de<runba>
 733         u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
 734         Punycode: de-jg4avhby1noc0d
 735
 736     (R) <sono><supiido><de>
 737         u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
 738         Punycode: d9juau41awczczp
 739 \f
 740     The last example is an ASCII string that breaks the existing rules
 741     for host name labels.  (It is not a realistic example for IDNA,
 742     because IDNA never encodes pure ASCII labels.)
 743
 744     (S) -> $1.00 <-
 745         u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
 746         u+003C u+002D
 747         Punycode: -> $1.00 <--
 748
 749 7.2 Decoding traces
 750
 751     In the following traces, the evolving state of the decoder is
 752     shown as a sequence of hexadecimal values, representing the code
 753     points in the extended string.  An asterisk appears just after the
 754     most recently inserted code point, indicating both n (the value
 755     preceeding the asterisk) and i (the position of the value just after
 756     the asterisk).  Other numerical values are decimal.
 757
 758     Decoding trace of example B from section 7.1:
 759
 760     n is 128, i is 0, bias is 72
 761     input is "ihqwcrb4cv8a8dqg056pqjye"
 762     there is no delimiter, so extended string starts empty
 763     delta "ihq" decodes to 19853
 764     bias becomes 21
 765     4E0D *
 766     delta "wc" decodes to 64
 767     bias becomes 20
 768     4E0D 4E2D *
 769     delta "rb" decodes to 37
 770     bias becomes 13
 771     4E3A * 4E0D 4E2D
 772     delta "4c" decodes to 56
 773     bias becomes 17
 774     4E3A 4E48 * 4E0D 4E2D
 775     delta "v8a" decodes to 599
 776     bias becomes 32
 777     4E3A 4EC0 * 4E48 4E0D 4E2D
 778     delta "8d" decodes to 130
 779     bias becomes 23
 780     4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
 781     delta "qg" decodes to 154
 782     bias becomes 25
 783     4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
 784     delta "056p" decodes to 46301
 785     bias becomes 84
 786     4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
 787     delta "qjye" decodes to 88531
 788     bias becomes 90
 789     4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
 790 \f
 791     Decoding trace of example L from section 7.1:
 792
 793     n is 128, i is 0, bias is 72
 794     input is "3B-ww4c5e180e575a65lsy2b"
 795     literal portion is "3B-", so extended string starts as:
 796     0033 0042
 797     delta "ww4c" decodes to 62042
 798     bias becomes 27
 799     0033 0042 5148 *
 800     delta "5e" decodes to 139
 801     bias becomes 24
 802     0033 0042 516B * 5148
 803     delta "180e" decodes to 16683
 804     bias becomes 67
 805     0033 5E74 * 0042 516B 5148
 806     delta "575a" decodes to 34821
 807     bias becomes 82
 808     0033 5E74 0042 516B 5148 751F *
 809     delta "65l" decodes to 14592
 810     bias becomes 67
 811     0033 5E74 0042 7D44 * 516B 5148 751F
 812     delta "sy2b" decodes to 42088
 813     bias becomes 84
 814     0033 5E74 0042 7D44 91D1 * 516B 5148 751F
 815
 816 7.3 Encoding traces
 817
 818     In the following traces, code point values are hexadecimal, while
 819     other numerical values are decimal.
 820 \f
 821     Encoding trace of example B from section 7.1:
 822
 823     bias is 72
 824     input is:
 825     4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
 826     there are no basic code points, so no literal portion
 827     next code point to insert is 4E0D
 828     needed delta is 19853, encodes as "ihq"
 829     bias becomes 21
 830     next code point to insert is 4E2D
 831     needed delta is 64, encodes as "wc"
 832     bias becomes 20
 833     next code point to insert is 4E3A
 834     needed delta is 37, encodes as "rb"
 835     bias becomes 13
 836     next code point to insert is 4E48
 837     needed delta is 56, encodes as "4c"
 838     bias becomes 17
 839     next code point to insert is 4EC0
 840     needed delta is 599, encodes as "v8a"
 841     bias becomes 32
 842     next code point to insert is 4ED6
 843     needed delta is 130, encodes as "8d"
 844     bias becomes 23
 845     next code point to insert is 4EEC
 846     needed delta is 154, encodes as "qg"
 847     bias becomes 25
 848     next code point to insert is 6587
 849     needed delta is 46301, encodes as "056p"
 850     bias becomes 84
 851     next code point to insert is 8BF4
 852     needed delta is 88531, encodes as "qjye"
 853     bias becomes 90
 854     output is "ihqwcrb4cv8a8dqg056pqjye"
 855 \f
 856     Encoding trace of example L from section 7.1:
 857
 858     bias is 72
 859     input is:
 860     0033 5E74 0042 7D44 91D1 516B 5148 751F
 861     basic code points (0033, 0042) are copied to literal portion: "3B-"
 862     next code point to insert is 5148
 863     needed delta is 62042, encodes as "ww4c"
 864     bias becomes 27
 865     next code point to insert is 516B
 866     needed delta is 139, encodes as "5e"
 867     bias becomes 24
 868     next code point to insert is 5E74
 869     needed delta is 16683, encodes as "180e"
 870     bias becomes 67
 871     next code point to insert is 751F
 872     needed delta is 34821, encodes as "575a"
 873     bias becomes 82
 874     next code point to insert is 7D44
 875     needed delta is 14592, encodes as "65l"
 876     bias becomes 67
 877     next code point to insert is 91D1
 878     needed delta is 42088, encodes as "sy2b"
 879     bias becomes 84
 880     output is "3B-ww4c5e180e575a65lsy2b"
 881
 882 8. Security considerations
 883
 884     Users expect each domain name in DNS to be controlled by a single
 885     authority.  If a Unicode string intended for use as a domain label
 886     could map to multiple ACE labels, then an internationalized domain
 887     name could map to multiple ASCII domain names, each controlled by
 888     a different authority, some of which could be spoofs that hijack
 889     service requests intended for another.  Therefore Punycode is
 890     designed so that each Unicode string has a unique encoding.
 891
 892     However, there can still be multiple Unicode representations of the
 893     "same" text, for various definitions of "same".  This problem is
 894     addressed to some extent by the Unicode standard under the topic of
 895     canonicalization, and this work is leveraged for domain names by
 896     Nameprep [NAMEPREP].
 897
 898 9. References (non-normative)
 899
 900     [ASCII] Vint Cerf, "ASCII format for Network Interchange",
 901     1969-Oct-16, RFC 20.
 902
 903     [IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello,
 904     "Internationalizing Domain Names In Applications (IDNA)",
 905     draft-ietf-idn-idna.
 906
 907     [NAMEPREP] Paul Hoffman, Marc Blanchet, "Nameprep: A
 908     Stringprep Profile for Internationalized Domain Names",
 909     draft-ietf-idn-nameprep.
 910
 911     [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
 912     http://www.trigeminal.com/samples/provincial.html.
 913 \f
 914     [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
 915     Table Specification", 1985-Oct, RFC 952.
 916
 917     [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
 918     1987-Nov, RFC 1034.
 919
 920     [UNICODE] The Unicode Consortium, "The Unicode Standard",
 921     http://www.unicode.org/unicode/standard/standard.html.
 922
 923 A. Author contact information
 924
 925     Adam M. Costello
 926     University of California, Berkeley
 927     http://www.nicemice.net/amc/
 928
 929 B. Mixed-case annotation
 930
 931     In order to use Punycode to represent case-insensitive strings,
 932     higher layers need to case-fold the strings prior to Punycode
 933     encoding.  The encoded string can use mixed case as an annotation
 934     telling how to convert the folded string into a mixed-case string
 935     for display purposes.  Note, however, that mixed-case annotation
 936     is not used by the ToASCII and ToUnicode operations specified in
 937     [IDNA], and therefore implementors of IDNA can disregard this
 938     appendix.
 939
 940     Basic code points can use mixed case directly, because the decoder
 941     copies them verbatim, leaving lowercase code points lowercase, and
 942     leaving uppercase code points uppercase.  Each non-basic code point
 943     is represented by a delta, which is represented by a sequence of
 944     basic code points, the last of which provides the annotation.  If it
 945     is uppercase, it is a suggestion to map the non-basic code point to
 946     uppercase (if possible); if it is lowercase, it is a suggestion to
 947     map the non-basic code point to lowercase (if possible).
 948
 949     These annotations do not alter the code points returned by decoders;
 950     the annotations are returned separately, for the caller to use or
 951     ignore.  Encoders can accept annotations in addition to code points,
 952     but the annotations do not alter the output, except to influence the
 953     uppercase/lowercase form of ASCII letters.
 954
 955     Punycode encoders and decoders need not support these annotations,
 956     and higher layers need not use them.
 957
 958 C. Disclaimer and license
 959
 960     Regarding this entire document or any portion of it (including
 961     the pseudocode and C code), the author makes no guarantees and
 962     is not responsible for any damage resulting from its use.  The
 963     author grants irrevocable permission to anyone to use, modify,
 964     and distribute it in any way that does not diminish the rights
 965     of anyone else to use, modify, and distribute it, provided that
 966     redistributed derivative works do not contain misleading author or
 967     version information.  Derivative works need not be licensed under
 968     similar terms.
 969
 970
 971 D. Punycode sample implementation
 972 \f
 973 /*
 974 punycode.c from draft-ietf-idn-punycode-03
 975 http://www.nicemice.net/idn/
 976 Adam M. Costello
 977 http://www.nicemice.net/amc/
 978
 979 This is ANSI C code (C89) implementing
 980 Punycode (draft-ietf-idn-punycode-03).
 981
 982 */
 983
 984
 985 /************************************************************/
 986 /* Public interface (would normally go in its own .h file): */
 987
 988 #include <limits.h>
 989
 990 enum punycode_status {
 991   punycode_success,
 992   punycode_bad_input,   /* Input is invalid.                       */
 993   punycode_big_output,  /* Output would exceed the space provided. */
 994   punycode_overflow     /* Input needs wider integers to process.  */
 995 };
 996
 997 #if UINT_MAX >= (1 << 26) - 1
 998 typedef unsigned int punycode_uint;
 999 #else
1000 typedef unsigned long punycode_uint;
1001 #endif
1002
1003 enum punycode_status punycode_encode(
1004   punycode_uint input_length,
1005   const punycode_uint input[],
1006   const unsigned char case_flags[],
1007   punycode_uint *output_length,
1008   char output[] );
1009 \f
1010     /* punycode_encode() converts Unicode to Punycode.  The input     */
1011     /* is represented as an array of Unicode code points (not code    */
1012     /* units; surrogate pairs are not allowed), and the output        */
1013     /* will be represented as an array of ASCII code points.  The     */
1014     /* output string is *not* null-terminated; it will contain        */
1015     /* zeros if and only if the input contains zeros.  (Of course     */
1016     /* the caller can leave room for a terminator and add one if      */
1017     /* needed.)  The input_length is the number of code points in     */
1018     /* the input.  The output_length is an in/out argument: the       */
1019     /* caller passes in the maximum number of code points that it     */
1020     /* can receive, and on successful return it will contain the      */
1021     /* number of code points actually output.  The case_flags array   */
1022     /* holds input_length boolean values, where nonzero suggests that */
1023     /* the corresponding Unicode character be forced to uppercase     */
1024     /* after being decoded (if possible), and zero suggests that      */
1025     /* it be forced to lowercase (if possible).  ASCII code points    */
1026     /* are encoded literally, except that ASCII letters are forced    */
1027     /* to uppercase or lowercase according to the corresponding       */
1028     /* uppercase flags.  If case_flags is a null pointer then ASCII   */
1029     /* letters are left as they are, and other code points are        */
1030     /* treated as if their uppercase flags were zero.  The return     */
1031     /* value can be any of the punycode_status values defined above   */
1032     /* except punycode_bad_input; if not punycode_success, then       */
1033     /* output_size and output might contain garbage.                  */
1034
1035 enum punycode_status punycode_decode(
1036   punycode_uint input_length,
1037   const char input[],
1038   punycode_uint *output_length,
1039   punycode_uint output[],
1040   unsigned char case_flags[] );
1041
1042     /* punycode_decode() converts Punycode to Unicode.  The input is  */
1043     /* represented as an array of ASCII code points, and the output   */
1044     /* will be represented as an array of Unicode code points.  The   */
1045     /* input_length is the number of code points in the input.  The   */
1046     /* output_length is an in/out argument: the caller passes in      */
1047     /* the maximum number of code points that it can receive, and     */
1048     /* on successful return it will contain the actual number of      */
1049     /* code points output.  The case_flags array needs room for at    */
1050     /* least output_length values, or it can be a null pointer if the */
1051     /* case information is not needed.  A nonzero flag suggests that  */
1052     /* the corresponding Unicode character be forced to uppercase     */
1053     /* by the caller (if possible), while zero suggests that it be    */
1054     /* forced to lowercase (if possible).  ASCII code points are      */
1055     /* output already in the proper case, but their flags will be set */
1056     /* appropriately so that applying the flags would be harmless.    */
1057     /* The return value can be any of the punycode_status values      */
1058     /* defined above; if not punycode_success, then output_length,    */
1059     /* output, and case_flags might contain garbage.  On success, the */
1060     /* decoder will never need to write an output_length greater than */
1061     /* input_length, because of how the encoding is defined.          */
1062 \f
1063 /**********************************************************/
1064 /* Implementation (would normally go in its own .c file): */
1065
1066 #include <string.h>
1067
1068 /*** Bootstring parameters for Punycode ***/
1069
1070 enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
1071        initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
1072
1073 /* basic(cp) tests whether cp is a basic code point: */
1074 #define basic(cp) ((punycode_uint)(cp) < 0x80)
1075
1076 /* delim(cp) tests whether cp is a delimiter: */
1077 #define delim(cp) ((cp) == delimiter)
1078
1079 /* decode_digit(cp) returns the numeric value of a basic code */
1080 /* point (for use in representing integers) in the range 0 to */
1081 /* base-1, or base if cp is does not represent a value.       */
1082
1083 static punycode_uint decode_digit(punycode_uint cp)
1084 {
1085   return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
1086           cp - 97 < 26 ? cp - 97 :  base;
1087 }
1088
1089 /* encode_digit(d,flag) returns the basic code point whose value      */
1090 /* (when used for representing integers) is d, which needs to be in   */
1091 /* the range 0 to base-1.  The lowercase form is used unless flag is  */
1092 /* nonzero, in which case the uppercase form is used.  The behavior   */
1093 /* is undefined if flag is nonzero and digit d has no uppercase form. */
1094
1095 static char encode_digit(punycode_uint d, int flag)
1096 {
1097   return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
1098   /*  0..25 map to ASCII a..z or A..Z */
1099   /* 26..35 map to ASCII 0..9         */
1100 }
1101
1102 /* flagged(bcp) tests whether a basic code point is flagged */
1103 /* (uppercase).  The behavior is undefined if bcp is not a  */
1104 /* basic code point.                                        */
1105
1106 #define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
1107
1108 /* encode_basic(bcp,flag) forces a basic code point to lowercase */
1109 /* if flag is zero, uppercase if flag is nonzero, and returns    */
1110 /* the resulting code point.  The code point is unchanged if it  */
1111 /* is caseless.  The behavior is undefined if bcp is not a basic */
1112 /* code point.                                                   */
1113
1114 static char encode_basic(punycode_uint bcp, int flag)
1115 {
1116   bcp -= (bcp - 97 < 26) << 5;
1117   return bcp + ((!flag && (bcp - 65 < 26)) << 5);
1118 }
1119 \f
1120 /*** Platform-specific constants ***/
1121
1122 /* maxint is the maximum value of a punycode_uint variable: */
1123 static const punycode_uint maxint = -1;
1124 /* Because maxint is unsigned, -1 becomes the maximum value. */
1125
1126 /*** Bias adaptation function ***/
1127
1128 static punycode_uint adapt(
1129   punycode_uint delta, punycode_uint numpoints, int firsttime )
1130 {
1131   punycode_uint k;
1132
1133   delta = firsttime ? delta / damp : delta >> 1;
1134   /* delta >> 1 is a faster way of doing delta / 2 */
1135   delta += delta / numpoints;
1136
1137   for (k = 0;  delta > ((base - tmin) * tmax) / 2;  k += base) {
1138     delta /= base - tmin;
1139   }
1140
1141   return k + (base - tmin + 1) * delta / (delta + skew);
1142 }
1143
1144 /*** Main encode function ***/
1145
1146 enum punycode_status punycode_encode(
1147   punycode_uint input_length,
1148   const punycode_uint input[],
1149   const unsigned char case_flags[],
1150   punycode_uint *output_length,
1151   char output[] )
1152 {
1153   punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
1154
1155   /* Initialize the state: */
1156
1157   n = initial_n;
1158   delta = out = 0;
1159   max_out = *output_length;
1160   bias = initial_bias;
1161
1162   /* Handle the basic code points: */
1163
1164   for (j = 0;  j < input_length;  ++j) {
1165     if (basic(input[j])) {
1166       if (max_out - out < 2) return punycode_big_output;
1167       output[out++] =
1168         case_flags ?  encode_basic(input[j], case_flags[j]) : input[j];
1169     }
1170     /* else if (input[j] < n) return punycode_bad_input; */
1171     /* (not needed for Punycode with unsigned code points) */
1172   }
1173 \f
1174   h = b = out;
1175
1176   /* h is the number of code points that have been handled, b is the  */
1177   /* number of basic code points, and out is the number of characters */
1178   /* that have been output.                                           */
1179
1180   if (b > 0) output[out++] = delimiter;
1181
1182   /* Main encoding loop: */
1183
1184   while (h < input_length) {
1185     /* All non-basic code points < n have been     */
1186     /* handled already.  Find the next larger one: */
1187
1188     for (m = maxint, j = 0;  j < input_length;  ++j) {
1189       /* if (basic(input[j])) continue; */
1190       /* (not needed for Punycode) */
1191       if (input[j] >= n && input[j] < m) m = input[j];
1192     }
1193
1194     /* Increase delta enough to advance the decoder's    */
1195     /* <n,i> state to <m,0>, but guard against overflow: */
1196
1197     if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
1198     delta += (m - n) * (h + 1);
1199     n = m;
1200
1201     for (j = 0;  j < input_length;  ++j) {
1202       /* Punycode does not need to check whether input[j] is basic: */
1203       if (input[j] < n /* || basic(input[j]) */ ) {
1204         if (++delta == 0) return punycode_overflow;
1205       }
1206
1207       if (input[j] == n) {
1208         /* Represent delta as a generalized variable-length integer: */
1209
1210         for (q = delta, k = base;  ;  k += base) {
1211           if (out >= max_out) return punycode_big_output;
1212           t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1213               k >= bias + tmax ? tmax : k - bias;
1214           if (q < t) break;
1215           output[out++] = encode_digit(t + (q - t) % (base - t), 0);
1216           q = (q - t) / (base - t);
1217         }
1218
1219         output[out++] = encode_digit(q, case_flags && case_flags[j]);
1220         bias = adapt(delta, h + 1, h == b);
1221         delta = 0;
1222         ++h;
1223       }
1224     }
1225
1226     ++delta, ++n;
1227   }
1228
1229   *output_length = out;
1230   return punycode_success;
1231 }
1232 \f
1233 /*** Main decode function ***/
1234
1235 enum punycode_status punycode_decode(
1236   punycode_uint input_length,
1237   const char input[],
1238   punycode_uint *output_length,
1239   punycode_uint output[],
1240   unsigned char case_flags[] )
1241 {
1242   punycode_uint n, out, i, max_out, bias,
1243                  b, j, in, oldi, w, k, digit, t;
1244
1245   /* Initialize the state: */
1246
1247   n = initial_n;
1248   out = i = 0;
1249   max_out = *output_length;
1250   bias = initial_bias;
1251
1252   /* Handle the basic code points:  Let b be the number of input code */
1253   /* points before the last delimiter, or 0 if there is none, then    */
1254   /* copy the first b code points to the output.                      */
1255
1256   for (b = j = 0;  j < input_length;  ++j) if (delim(input[j])) b = j;
1257   if (b > max_out) return punycode_big_output;
1258
1259   for (j = 0;  j < b;  ++j) {
1260     if (case_flags) case_flags[out] = flagged(input[j]);
1261     if (!basic(input[j])) return punycode_bad_input;
1262     output[out++] = input[j];
1263   }
1264
1265   /* Main decoding loop:  Start just after the last delimiter if any  */
1266   /* basic code points were copied; start at the beginning otherwise. */
1267
1268   for (in = b > 0 ? b + 1 : 0;  in < input_length;  ++out) {
1269
1270     /* in is the index of the next character to be consumed, and */
1271     /* out is the number of code points in the output array.     */
1272
1273     /* Decode a generalized variable-length integer into delta,  */
1274     /* which gets added to i.  The overflow checking is easier   */
1275     /* if we increase i as we go, then subtract off its starting */
1276     /* value at the end to obtain delta.                         */
1277
1278     for (oldi = i, w = 1, k = base;  ;  k += base) {
1279       if (in >= input_length) return punycode_bad_input;
1280       digit = decode_digit(input[in++]);
1281       if (digit >= base) return punycode_bad_input;
1282       if (digit > (maxint - i) / w) return punycode_overflow;
1283       i += digit * w;
1284       t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1285           k >= bias + tmax ? tmax : k - bias;
1286       if (digit < t) break;
1287       if (w > maxint / (base - t)) return punycode_overflow;
1288       w *= (base - t);
1289     }
1290 \f
1291     bias = adapt(i - oldi, out + 1, oldi == 0);
1292
1293     /* i was supposed to wrap around from out+1 to 0,   */
1294     /* incrementing n each time, so we'll fix that now: */
1295
1296     if (i / (out + 1) > maxint - n) return punycode_overflow;
1297     n += i / (out + 1);
1298     i %= (out + 1);
1299
1300     /* Insert n at position i of the output: */
1301
1302     /* not needed for Punycode: */
1303     /* if (decode_digit(n) <= base) return punycode_invalid_input; */
1304     if (out >= max_out) return punycode_big_output;
1305
1306     if (case_flags) {
1307       memmove(case_flags + i + 1, case_flags + i, out - i);
1308       /* Case of last character determines uppercase flag: */
1309       case_flags[i] = flagged(input[in - 1]);
1310     }
1311
1312     memmove(output + i + 1, output + i, (out - i) * sizeof *output);
1313     output[i++] = n;
1314   }
1315
1316   *output_length = out;
1317   return punycode_success;
1318 }
1319
1320
1321 /******************************************************************/
1322 /* Wrapper for testing (would normally go in a separate .c file): */
1323
1324 #include <assert.h>
1325 #include <stdio.h>
1326 #include <stdlib.h>
1327 #include <string.h>
1328
1329 /* For testing, we'll just set some compile-time limits rather than */
1330 /* use malloc(), and set a compile-time option rather than using a  */
1331 /* command-line option.                                             */
1332
1333 enum {
1334   unicode_max_length = 256,
1335   ace_max_length = 256
1336 };
1337 \f
1338 static void usage(char **argv)
1339 {
1340   fprintf(stderr,
1341     "\n"
1342     "%s -e reads code points and writes a Punycode string.\n"
1343     "%s -d reads a Punycode string and writes code points.\n"
1344     "\n"
1345     "Input and output are plain text in the native character set.\n"
1346     "Code points are in the form u+hex separated by whitespace.\n"
1347     "Although the specification allows Punycode strings to contain\n"
1348     "any characters from the ASCII repertoire, this test code\n"
1349     "supports only the printable characters, and needs the Punycode\n"
1350     "string to be followed by a newline.\n"
1351     "The case of the u in u+hex is the force-to-uppercase flag.\n"
1352     , argv[0], argv[0]);
1353   exit(EXIT_FAILURE);
1354 }
1355
1356
1357 static void fail(const char *msg)
1358 {
1359   fputs(msg,stderr);
1360   exit(EXIT_FAILURE);
1361 }
1362
1363 static const char too_big[] =
1364   "input or output is too large, recompile with larger limits\n";
1365 static const char invalid_input[] = "invalid input\n";
1366 static const char overflow[] = "arithmetic overflow\n";
1367 static const char io_error[] = "I/O error\n";
1368
1369
1370 /* The following string is used to convert printable */
1371 /* characters between ASCII and the native charset:  */
1372
1373 static const char print_ascii[] =
1374   "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1375   "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1376   " !\"#$%&'()*+,-./"
1377   "0123456789:;<=>?"
1378   "@ABCDEFGHIJKLMNO"
1379   "PQRSTUVWXYZ[\\]^_"
1380   "`abcdefghijklmno"
1381   "pqrstuvwxyz{|}~\n";
1382
1383
1384 int main(int argc, char **argv)
1385 {
1386   enum punycode_status status;
1387   int r;
1388   unsigned int input_length, output_length, j;
1389   unsigned char case_flags[unicode_max_length];
1390
1391   if (argc != 2) usage(argv);
1392   if (argv[1][0] != '-') usage(argv);
1393   if (argv[1][2] != 0) usage(argv);
1394 \f
1395   if (argv[1][1] == 'e') {
1396     punycode_uint input[unicode_max_length];
1397     unsigned long codept;
1398     char output[ace_max_length+1], uplus[3];
1399     int c;
1400
1401     /* Read the input code points: */
1402
1403     input_length = 0;
1404
1405     for (;;) {
1406       r = scanf("%2s%lx", uplus, &codept);
1407       if (ferror(stdin)) fail(io_error);
1408       if (r == EOF || r == 0) break;
1409
1410       if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
1411         fail(invalid_input);
1412       }
1413
1414       if (input_length == unicode_max_length) fail(too_big);
1415
1416       if (uplus[0] == 'u') case_flags[input_length] = 0;
1417       else if (uplus[0] == 'U') case_flags[input_length] = 1;
1418       else fail(invalid_input);
1419
1420       input[input_length++] = codept;
1421     }
1422
1423     /* Encode: */
1424
1425     output_length = ace_max_length;
1426     status = punycode_encode(input_length, input, case_flags,
1427                              &output_length, output);
1428     if (status == punycode_bad_input) fail(invalid_input);
1429     if (status == punycode_big_output) fail(too_big);
1430     if (status == punycode_overflow) fail(overflow);
1431     assert(status == punycode_success);
1432
1433     /* Convert to native charset and output: */
1434
1435     for (j = 0;  j < output_length;  ++j) {
1436       c = output[j];
1437       assert(c >= 0 && c <= 127);
1438       if (print_ascii[c] == 0) fail(invalid_input);
1439       output[j] = print_ascii[c];
1440     }
1441
1442     output[j] = 0;
1443     r = puts(output);
1444     if (r == EOF) fail(io_error);
1445     return EXIT_SUCCESS;
1446   }
1447
1448   if (argv[1][1] == 'd') {
1449     char input[ace_max_length+2], *p, *pp;
1450     punycode_uint output[unicode_max_length];
1451 \f
1452     /* Read the Punycode input string and convert to ASCII: */
1453
1454     fgets(input, ace_max_length+2, stdin);
1455     if (ferror(stdin)) fail(io_error);
1456     if (feof(stdin)) fail(invalid_input);
1457     input_length = strlen(input) - 1;
1458     if (input[input_length] != '\n') fail(too_big);
1459     input[input_length] = 0;
1460
1461     for (p = input;  *p != 0;  ++p) {
1462       pp = strchr(print_ascii, *p);
1463       if (pp == 0) fail(invalid_input);
1464       *p = pp - print_ascii;
1465     }
1466
1467     /* Decode: */
1468
1469     output_length = unicode_max_length;
1470     status = punycode_decode(input_length, input, &output_length,
1471                              output, case_flags);
1472     if (status == punycode_bad_input) fail(invalid_input);
1473     if (status == punycode_big_output) fail(too_big);
1474     if (status == punycode_overflow) fail(overflow);
1475     assert(status == punycode_success);
1476
1477     /* Output the result: */
1478
1479     for (j = 0;  j < output_length;  ++j) {
1480       r = printf("%s+%04lX\n",
1481                  case_flags[j] ? "U" : "u",
1482                  (unsigned long) output[j] );
1483       if (r < 0) fail(io_error);
1484     }
1485
1486     return EXIT_SUCCESS;
1487   }
1488
1489   usage(argv);
1490   return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
1491 }
1492
1493
1494
1495                    INTERNET-DRAFT expires 2003-Apr-08