1 INTERNET-DRAFT Adam M. Costello
2 draft-ietf-idn-punycode-03.txt 2002-Oct-08
5 Punycode: A Bootstring encoding of Unicode for IDNA
9 This document is an Internet-Draft and is in full conformance with
10 all provisions of Section 10 of RFC2026.
12 Internet-Drafts are working documents of the Internet Engineering
13 Task Force (IETF), its areas, and its working groups. Note
14 that other groups may also distribute working documents as
17 Internet-Drafts are draft documents valid for a maximum of six
18 months and may be updated, replaced, or obsoleted by other documents
19 at any time. It is inappropriate to use Internet-Drafts as
20 reference material or to cite them other than as "work in progress."
22 The list of current Internet-Drafts can be accessed at
23 http://www.ietf.org/ietf/1id-abstracts.txt
25 The list of Internet-Draft Shadow Directories can be accessed at
26 http://www.ietf.org/shadow.html
28 Distribution of this document is unlimited.
32 Punycode is a simple and efficient transfer encoding syntax designed
33 for use with Internationalized Domain Names in Applications [IDNA].
34 It uniquely and reversibly transforms a Unicode string [UNICODE]
35 into an ASCII string [ASCII]. ASCII characters in the Unicode
36 string are represented literally, and non-ASCII characters are
37 represented by ASCII characters that are allowed in host name labels
38 (letters, digits, and hyphens). This document defines a general
39 algorithm called Bootstring that allows a string of basic code
40 points to uniquely represent any string of code points drawn from
41 a larger set. Punycode is an instance of Bootstring that uses
42 particular parameter values specified by this document, appropriate
49 1.2 Interaction of protocol parts
51 3. Bootstring description
52 3.1 Basic code point segregation
53 3.2 Insertion unsort coding
54 3.3 Generalized variable-length integers
56 4. Bootstring parameters
57 5. Parameter values for Punycode
58 6. Bootstring algorithms
59 6.1 Bias adaptation function
60 6.2 Decoding procedure
61 6.3 Encoding procedure
67 8. Security considerations
68 9. References (non-normative)
69 A. Author contact information
70 B. Mixed-case annotation
71 C. Disclaimer and license
72 D. Punycode sample implementation
76 [IDNA] describes an architecture for supporting internationalized
77 domain names. Labels containing non-ASCII characters can be
78 represented by ACE labels, which begin with a special ACE prefix and
79 contain only ASCII characters. The remainder of the label after the
80 prefix is a Punycode encoding of a Unicode string satisfying certain
81 constraints. For the details of the prefix and constraints, see
82 [IDNA] and [NAMEPREP].
84 Punycode is an instance of a more general algorithm called
85 Bootstring, which allows strings composed from a small set of
86 "basic" code points to uniquely represent any string of code points
87 drawn from a larger set. Punycode is Bootstring with particular
88 parameter values appropriate for IDNA.
92 Bootstring has been designed to have the following features:
94 * Completeness: Every extended string (sequence of arbitrary code
95 points) can be represented by a basic string (sequence of basic
96 code points). Restrictions on what strings are allowed, and on
97 length, can be imposed by higher layers.
99 * Uniqueness: There is at most one basic string that represents a
100 given extended string.
102 * Reversibility: Any extended string mapped to a basic string can
103 be recovered from that basic string.
105 * Efficient encoding: The ratio of basic string length to
106 extended string length is small. This is important in the
107 context of domain names because RFC 1034 [RFC1034] restricts the
108 length of a domain label to 63 characters.
110 * Simplicity: The encoding and decoding algorithms are reasonably
111 simple to implement. The goals of efficiency and simplicity are
112 at odds; Bootstring aims at a good balance between them.
114 * Readability: Basic code points appearing in the extended string
115 are represented as themselves in the basic string (although the
116 main purpose is to improve efficiency, not readability).
118 Punycode can also support an additional feature that is not used
119 by the ToASCII and ToUnicode operations of [IDNA]. When extended
120 strings are case-folded prior to encoding, the basic string can
121 use mixed case to tell how to convert the folded string into a
122 mixed-case string. See appendix B "Mixed-case annotation".
124 1.2 Interaction of protocol parts
126 Punycode is used by the IDNA protocol [IDNA] for converting domain
127 labels into ASCII; it is not designed for any other purpose. It is
128 explicitly not designed for processing arbitrary free text.
132 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
133 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
134 document are to be interpreted as described in RFC 2119 [RFC2119].
136 A code point is an integral value associated with a character in a
139 As in the Unicode Standard [UNICODE], Unicode code points are
140 denoted by "U+" followed by four to six hexadecimal digits, while a
141 range of code points is denoted by two hexadecimal numbers separated
142 by "..", with no prefixes.
144 The operators div and mod perform integer division; (x div y) is the
145 quotient of x divided by y, discarding the remainder, and (x mod y)
146 is the remainder, so (x div y) * y + (x mod y) == x. Bootstring
147 uses these operators only with nonnegative operands, so the quotient
148 and remainder are always nonnegative.
150 The break statement jumps out of the innermost loop (as in C).
152 An overflow is an attempt to compute a value that exceeds the
153 maximum value of an integer variable.
155 3. Bootstring description
157 Bootstring represents an arbitrary sequence of code points (the
158 "extended string") as a sequence of basic code points (the "basic
159 string"). This section describes the representation. Section 6
160 "Bootstring algorithms" presents the algorithms as pseudocode.
161 Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
162 algorithms for sample inputs.
164 The following sections describe the four techniques used in
165 Bootstring. "Basic code point segregation" is a very simple
166 and efficient encoding for basic code points occurring in the
167 extended string: they are simply copied all at once. "Insertion
168 unsort coding" encodes the non-basic code points as deltas, and
169 processes the code points in numerical order rather than in order of
170 appearance, which typically results in smaller deltas. The deltas
171 are represented as "generalized variable-length integers", which use
172 basic code points to represent nonnegative integers. The parameters
173 of this integer representation are dynamically adjusted using "bias
174 adaptation", to improve efficiency when consecutive deltas have
177 3.1 Basic code point segregation
179 All basic code points appearing in the extended string are
180 represented literally at the beginning of the basic string, in their
181 original order, followed by a delimiter if (and only if) the number
182 of basic code points is nonzero. The delimiter is a particular
183 basic code point, which never appears in the remainder of the basic
184 string. The decoder can therefore find the end of the literal
185 portion (if there is one) by scanning for the last delimiter.
187 3.2 Insertion unsort coding
189 The remainder of the basic string (after the last delimiter if there
190 is one) represents a sequence of nonnegative integral deltas as
191 generalized variable-length integers, described in section 3.3. The
192 meaning of the deltas is best understood in terms of the decoder.
194 The decoder builds the extended string incrementally. Initially,
195 the extended string is a copy of the literal portion of the basic
196 string (excluding the last delimiter). The decoder inserts
197 non-basic code points, one for each delta, into the extended string,
198 ultimately arriving at the final decoded string.
200 At the heart of this process is a state machine with two state
201 variables: an index i and a counter n. The index i refers to
202 a position in the extended string; it ranges from 0 (the first
203 position) to the current length of the extended string (which refers
204 to a potential position beyond the current end). If the current
205 state is <n,i>, the next state is <n,i+1> if i is less than the
206 length of the extended string, or <n+1,0> if i equals the length of
207 the extended string. In other words, each state change causes i to
208 increment, wrapping around to zero if necessary, and n counts the
209 number of wrap-arounds.
211 Notice that the state always advances monotonically (there is no
212 way for the decoder to return to an earlier state). At each state,
213 an insertion is either performed or not performed. At most one
214 insertion is performed in a given state. An insertion inserts the
215 value of n at position i in the extended string. The deltas are
216 a run-length encoding of this sequence of events: they are the
217 lengths of the runs of non-insertion states preceeding the insertion
218 states. Hence, for each delta, the decoder performs delta state
219 changes, then an insertion, and then one more state change. (An
220 implementation need not perform each state change individually, but
221 can instead use division and remainder calculations to compute the
222 next insertion state directly.) It is an error if the inserted code
223 point is a basic code point (because basic code points were supposed
224 to be segregated as described in section 3.1).
226 The encoder's main task is to derive the sequence of deltas that
227 will cause the decoder to construct the desired string. It can do
228 this by repeatedly scanning the extended string for the next code
229 point that the decoder would need to insert, and counting the number
230 of state changes the decoder would need to perform, mindful of the
231 fact that the decoder's extended string will include only those
232 code points that have already been inserted. Section 6.3 "Encoding
233 procedure" gives a precise algorithm.
235 3.3 Generalized variable-length integers
237 In a conventional integer representation the base is the number of
238 distinct symbols for digits, whose values are 0 through base-1. Let
239 digit_0 denote the least significant digit, digit_1 the next least
240 significant, and so on. The value represented is the sum over j of
241 digit_j * w(j), where w(j) = base^j is the weight (scale factor)
242 for position j. For example, in the base 8 integer 437, the digits
243 are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is
244 7 + 3*8 + 4*64 = 287. This representation has two disadvantages:
245 First, there are multiple encodings of each value (because there
246 can be extra zeros in the most significant positions), which is
247 inconvenient when unique encodings are needed. Second, the integer
248 is not self-delimiting, so if multiple integers are concatenated the
249 boundaries between them are lost.
251 The generalized variable-length representation solves these two
252 problems. The digit values are still 0 through base-1, but now
253 the integer is self-delimiting by means of thresholds t(j), each
254 of which is in the range 0 through base-1. Exactly one digit, the
255 most significant, satisfies digit_j < t(j). Therefore, if several
256 integers are concatenated, it is easy to separate them, starting
257 with the first if they are little-endian (least significant digit
258 first), or starting with the last if they are big-endian (most
259 significant digit first). As before, the value is the sum over j of
260 digit_j * w(j), but the weights are different:
263 w(j) = w(j-1) * (base - t(j-1)) for j > 0
265 For example, consider the little-endian sequence of base 8 digits
266 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This
267 implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5)
268 = 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is
269 not less than 3, but 4 is less than 5, so 4 is the last digit. The
270 value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251,
271 with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation
272 is very similar to decoding a conventional integer: Start with a
273 current value of N = 0 and a weight w = 1. Fetch the next digit d
274 and increase N by d * w. If d is less than the current threshold
275 (t) then stop, otherwise increase w by a factor of (base - t),
276 update t for the next position, and repeat.
278 Encoding this representation is similar to encoding a conventional
279 integer: If N < t then output one digit for N and stop, otherwise
280 output the digit for t + ((N - t) mod (base - t)), then replace N
281 with (N - t) div (base - t), update t for the next position, and
284 For any particular set of values of t(j), there is exactly one
285 generalized variable-length representation of each nonnegative
288 Bootstring uses little-endian ordering so that the deltas can be
289 separated starting with the first. The t(j) values are defined in
290 terms of the constants base, tmin, and tmax, and a state variable
293 t(j) = base * (j + 1) - bias,
294 clamped to the range tmin through tmax
296 The clamping means that if the formula yields a value less than tmin
297 or greater than tmax, then t(j) = tmin or tmax, respectively. (In
298 the pseudocode in section 6 "Bootstring algorithms", the expression
299 base * (j + 1) is denoted by k for performance reasons.) These
300 t(j) values cause the representation to favor integers within a
301 particular range determined by the bias.
305 After each delta is encoded or decoded, bias is set for the next
308 1. Delta is scaled in order to avoid overflow in the next step:
310 let delta = delta div 2
312 But when this is the very first delta, the divisor is not 2, but
313 instead a constant called damp. This compensates for the fact
314 that the second delta is usually much smaller than the first.
316 2. Delta is increased to compensate for the fact that the next
317 delta will be inserting into a longer string:
319 let delta = delta + (delta div numpoints)
321 numpoints is the total number of code points encoded/decoded so
322 far (including the one corresponding to this delta itself, and
323 including the basic code points).
325 3. Delta is repeatedly divided until it falls within a threshold,
326 to predict the minimum number of digits needed to represent the
329 while delta > ((base - tmin) * tmax) div 2
330 do let delta = delta div (base - tmin)
335 (base * the number of divisions performed in step 3) +
336 (((base - tmin + 1) * delta) div (delta + skew))
338 The motivation for this procedure is that the current delta provides
339 a hint about the likely size of the next delta, and so t(j) is
340 set to tmax for the more significant digits starting with the one
341 expected to be last, tmin for the less significant digits up through
342 the one expected to be third-last, and somewhere between tmin and
343 tmax for the digit expected to be second-last (balancing the hope of
344 the expected-last digit being unnecessary against the danger of it
347 4. Bootstring parameters
349 Given a set of basic code points, one needs to be designated as
350 the delimiter. The base cannot be greater than the number of
351 distinguishable basic code points remaining. The digit-values in
352 the range 0 through base-1 need to be associated with distinct
353 non-delimiter basic code points. In some cases multiple code points
354 need to have the same digit-value; for example, uppercase and
355 lowercase versions of the same letter need to be equivalent if basic
356 strings are case-insensitive.
358 The initial value of n cannot be greater than the minimum non-basic
359 code point that could appear in extended strings.
361 The remaining five parameters (tmin, tmax, skew, damp, and the
362 initial value of bias) need to satisfy the following constraints:
364 0 <= tmin <= tmax <= base-1
367 initial_bias mod base <= base - tmin
369 Provided the constraints are satisfied, these five parameters affect
370 efficiency but not correctness. They are best chosen empirically.
372 If support for mixed-case annotation is desired (see appendix B),
373 make sure that the code points corresponding to 0 through tmax-1 all
374 have both uppercase and lowercase forms.
376 5. Parameter values for Punycode
378 Punycode uses the following Bootstring parameter values:
386 initial_n = 128 = 0x80
388 Although the only restriction Punycode imposes on the input integers
389 is that they be nonnegative, these parameters are especially
390 designed to work well with Unicode [UNICODE] code points, which
391 are integers in the range 0..10FFFF (but not D800..DFFF, which are
392 reserved for use by the UTF-16 encoding of Unicode). The basic code
393 points are the ASCII [ASCII] code points (0..7F), of which U+002D
394 (-) is the delimiter, and some of the others have digit-values as
397 code points digit-values
398 ------------ ----------------------
399 41..5A (A-Z) = 0 to 25, respectively
400 61..7A (a-z) = 0 to 25, respectively
401 30..39 (0-9) = 26 to 35, respectively
403 Using hyphen-minus as the delimiter implies that the encoded string
404 can end with a hyphen-minus only if the Unicode string consists
405 entirely of basic code points, but IDNA forbids such strings from
406 being encoded. The encoded string can begin with a hyphen-minus,
407 but IDNA prepends a prefix. Therefore IDNA using Punycode conforms
408 to the RFC 952 rule that host name labels neither begin nor end with
409 a hyphen-minus [RFC952].
411 A decoder MUST recognize the letters in both uppercase and lowercase
412 forms (including mixtures of both forms). An encoder SHOULD output
413 only uppercase forms or only lowercase forms, unless it uses
414 mixed-case annotation (see appendix B).
416 Presumably most users will not manually write or type encoded
417 strings (as opposed to cutting and pasting them), but those who do
418 will need to be alert to the potential visual ambiguity between the
419 following sets of characters:
428 Such ambiguities are usually resolved by context, but in a Punycode
429 encoded string there is no context apparent to humans.
431 6. Bootstring algorithms
433 Some parts of the pseudocode can be omitted if the parameters
434 satisfy certain conditions (for which Punycode qualifies). These
435 parts are enclosed in {braces}, and notes immediately following the
436 pseudocode explain the conditions under which they can be omitted.
438 Formally, code points are integers, and hence the pseudocode assumes
439 that arithmetic operations can be performed directly on code points.
440 In some programming languages, explicit conversion between code
441 points and integers might be necessary.
443 6.1 Bias adaptation function
445 function adapt(delta,numpoints,firsttime):
446 if firsttime then let delta = delta div damp
447 else let delta = delta div 2
448 let delta = delta + (delta div numpoints)
450 while delta > ((base - tmin) * tmax) div 2 do begin
451 let delta = delta div (base - tmin)
454 return k + (((base - tmin + 1) * delta) div (delta + skew))
456 It does not matter whether the modifications to delta and k
457 inside adapt() affect variables of the same name inside the
458 encoding/decoding procedures, because after calling adapt() the
459 caller does not read those variables before overwriting them.
461 6.2 Decoding procedure
465 let bias = initial_bias
466 let output = an empty string indexed from 0
467 consume all code points before the last delimiter (if there is one)
468 and copy them to output, fail on any non-basic code point
469 if more than zero code points were consumed then consume one more
470 (which will be the last delimiter)
471 while the input is not exhausted do begin
474 for k = base to infinity in steps of base do begin
475 consume a code point, or fail if there was none to consume
476 let digit = the code point's digit-value, fail if it has none
477 let i = i + digit * w, fail on overflow
478 let t = tmin if k <= bias {+ tmin}, or
479 tmax if k >= bias + tmax, or k - bias otherwise
480 if digit < t then break
481 let w = w * (base - t), fail on overflow
483 let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
484 let n = n + i div (length(output) + 1), fail on overflow
485 let i = i mod (length(output) + 1)
486 {if n is a basic code point then fail}
487 insert n into output at position i
491 The full statement enclosed in braces (checking whether n is a basic
492 code point) can be omitted if initial_n exceeds all basic code
493 points (which is true for Punycode), because n is never less than
496 In the assignment of t, where t is clamped to the range tmin through
497 tmax, "+ tmin" can always be omitted. This makes the clamping
498 calculation incorrect when bias < k < bias + tmin, but that cannot
499 happen because of the way bias is computed and because of the
500 constraints on the parameters.
502 Because the decoder state can only advance monotonically, and there
503 is only one representation of any delta, there is therefore only
504 one encoded string that can represent a given sequence of integers.
505 The only error conditions are invalid code points, unexpected
506 end-of-input, overflow, and basic code points encoded using deltas
507 instead of appearing literally. If the decoder fails on these
508 errors as shown above, then it cannot produce the same output for
509 two distinct inputs. Without this property it would have been
510 necessary to re-encode the output and verify that it matches the
511 input in order to guarantee the uniqueness of the encoding.
513 6.3 Encoding procedure
517 let bias = initial_bias
518 let h = b = the number of basic code points in the input
519 copy them to the output in order, followed by a delimiter if b > 0
520 {if the input contains a non-basic code point < n then fail}
521 while h < length(input) do begin
522 let m = the minimum {non-basic} code point >= n in the input
523 let delta = delta + (m - n) * (h + 1), fail on overflow
525 for each code point c in the input (in order) do begin
526 if c < n {or c is basic} then increment delta, fail on overflow
529 for k = base to infinity in steps of base do begin
530 let t = tmin if k <= bias {+ tmin}, or
531 tmax if k >= bias + tmax, or k - bias otherwise
533 output the code point for digit t + ((q - t) mod (base - t))
534 let q = (q - t) div (base - t)
536 output the code point for digit q
537 let bias = adapt(delta, h + 1, test h equals b?)
542 increment delta and n
545 The full statement enclosed in braces (checking whether the input
546 contains a non-basic code point less than n) can be omitted if all
547 code points less than initial_n are basic code points (which is true
548 for Punycode if code points are unsigned).
550 The brace-enclosed conditions "non-basic" and "or c is basic" can be
551 omitted if initial_n exceeds all basic code points (which is true
552 for Punycode), because the code point being tested is never less
555 In the assignment of t, where t is clamped to the range tmin through
556 tmax, "+ tmin" can always be omitted. This makes the clamping
557 calculation incorrect when bias < k < bias + tmin, but that cannot
558 happen because of the way bias is computed and because of the
559 constraints on the parameters.
561 The checks for overflow are necessary to avoid producing invalid
562 output when the input contains very large values or is very long.
564 The increment of delta at the bottom of the outer loop cannot
565 overflow because delta < length(input) before the increment, and
566 length(input) is already assumed to be representable. The increment
567 of n could overflow, but only if h == length(input), in which case
568 the procedure is finished anyway.
570 6.4 Overflow handling
572 For IDNA, 26-bit unsigned integers are sufficient to handle all
573 valid IDNA labels without overflow, because any string that
574 needed a 27-bit delta would have to exceed either the code point
575 limit (0..10FFFF) or the label length limit (63 characters).
576 However, overflow handling is necessary because the inputs are not
577 necessarily valid IDNA labels.
579 If the programming language does not provide overflow detection,
580 the following technique can be used. Suppose A, B, and C are
581 representable nonnegative integers and C is nonzero. Then A + B
582 overflows if and only if B > maxint - A, and A + (B * C) overflows
583 if and only if B > (maxint - A) div C, where maxint is the greatest
584 integer for which maxint + 1 cannot be represented. Refer to
585 appendix D "Punycode sample implementation" for demonstrations of
586 this technique in the C language.
588 The decoding and encoding algorithms shown in sections 6.2 and
589 6.3 handle overflow by detecting it whenever it happens. Another
590 approach is to enforce limits on the inputs that prevent overflow
591 from happening. For example, if the encoder were to verify that
592 no input code points exceed M and that the input length does not
593 exceed L, then no delta could ever exceed (M - initial_n) * (L + 1),
594 and hence no overflow could occur if integer variables were capable
595 of representing values that large. This prevention approach would
596 impose more restrictions on the input than the detection approach
597 does, but might be considered simpler in some programming languages.
599 In theory, the decoder could use an analogous approach, limiting the
600 number of digits in a variable-length integer (that is, limiting the
601 number of iterations in the innermost loop). However, the number
602 of digits that suffice to represent a given delta can sometimes
603 represent much larger deltas (because of the adaptation), and hence
604 this approach would probably need integers wider than 32 bits.
606 Yet another approach for the decoder is to allow overflow to occur,
607 but to check the final output string by re-encoding it and comparing
608 to the decoder input. If and only if they do not match (using a
609 case-insensitive ASCII comparison) overflow has occurred. This
610 delayed-detection approach would not impose any more restrictions on
611 the input than the immediate-detection approach does, and might be
612 considered simpler in some programming languages.
614 In fact, if the decoder is used only inside the IDNA ToUnicode
615 operation [IDNA], then it need not check for overflow at all,
616 because ToUnicode performs a higher level re-encoding and
617 comparison, and a mismatch has the same consequence as if the
618 Punycode decoder had failed.
624 In the Punycode encodings below, the ACE prefix is not shown.
625 Backslashes show where line breaks have been inserted in strings too
628 The first several examples are all translations of the sentence "Why
629 can't they just speak in <language>?" (courtesy of Michael Kaplan's
630 "provincial" page [PROVINCIAL]). Word breaks and punctuation have
631 been removed, as is often done in domain names.
633 (A) Arabic (Egyptian):
634 u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
635 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
636 Punycode: egbpdaj6bu4bxfgehfvwxn
638 (B) Chinese (simplified):
639 u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
640 Punycode: ihqwcrb4cv8a8dqg056pqjye
642 (C) Chinese (traditional):
643 u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
644 Punycode: ihqwctvzc91f659drss3x8bo0yb
646 (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
647 U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
648 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
649 u+0065 u+0073 u+006B u+0079
650 Punycode: Proprostnemluvesky-uyb24dma41a
653 u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
654 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
655 u+05D1 u+05E8 u+05D9 u+05EA
656 Punycode: 4dbcagdahymbxekheh6e0a7fei0b
658 (F) Hindi (Devanagari):
659 u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
660 u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
661 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
663 Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
665 (G) Japanese (kanji and hiragana):
666 u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
667 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
668 Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
670 (H) Korean (Hangul syllables):
671 u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
672 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
673 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
674 Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
677 (I) Russian (Cyrillic):
678 U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
679 u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
680 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
682 Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
684 (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
685 U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
686 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
687 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
688 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
689 u+0061 u+00F1 u+006F u+006C
690 Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a
693 T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
694 <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
695 U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
696 u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
697 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
698 U+0056 u+0069 u+1EC7 u+0074
699 Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g
701 The next several examples are all names of Japanese music artists,
702 song titles, and TV programs, just because the author happens to
703 have them handy (but Japanese is useful for providing examples
704 of single-row text, two-row text, ideographic text, and various
707 (L) 3<nen>B<gumi><kinpachi><sensei>
708 u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
709 Punycode: 3B-ww4c5e180e575a65lsy2b
711 (M) <amuro><namie>-with-SUPER-MONKEYS
712 u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
713 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
714 U+004F U+004E U+004B U+0045 U+0059 U+0053
715 Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
717 (N) Hello-Another-Way-<sorezore><no><basho>
718 U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
719 u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
720 u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
721 Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b
723 (O) <hitotsu><yane><no><shita>2
724 u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
725 Punycode: 2-u9tlzr9756bt3uc0v
727 (P) Maji<de>Koi<suru>5<byou><mae>
728 U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
729 u+308B u+0035 u+79D2 u+524D
730 Punycode: MajiKoi5-783gue6qz075azm5e
733 u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
734 Punycode: de-jg4avhby1noc0d
736 (R) <sono><supiido><de>
737 u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
738 Punycode: d9juau41awczczp
740 The last example is an ASCII string that breaks the existing rules
741 for host name labels. (It is not a realistic example for IDNA,
742 because IDNA never encodes pure ASCII labels.)
745 u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
747 Punycode: -> $1.00 <--
751 In the following traces, the evolving state of the decoder is
752 shown as a sequence of hexadecimal values, representing the code
753 points in the extended string. An asterisk appears just after the
754 most recently inserted code point, indicating both n (the value
755 preceeding the asterisk) and i (the position of the value just after
756 the asterisk). Other numerical values are decimal.
758 Decoding trace of example B from section 7.1:
760 n is 128, i is 0, bias is 72
761 input is "ihqwcrb4cv8a8dqg056pqjye"
762 there is no delimiter, so extended string starts empty
763 delta "ihq" decodes to 19853
766 delta "wc" decodes to 64
769 delta "rb" decodes to 37
772 delta "4c" decodes to 56
774 4E3A 4E48 * 4E0D 4E2D
775 delta "v8a" decodes to 599
777 4E3A 4EC0 * 4E48 4E0D 4E2D
778 delta "8d" decodes to 130
780 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
781 delta "qg" decodes to 154
783 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
784 delta "056p" decodes to 46301
786 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
787 delta "qjye" decodes to 88531
789 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
791 Decoding trace of example L from section 7.1:
793 n is 128, i is 0, bias is 72
794 input is "3B-ww4c5e180e575a65lsy2b"
795 literal portion is "3B-", so extended string starts as:
797 delta "ww4c" decodes to 62042
800 delta "5e" decodes to 139
802 0033 0042 516B * 5148
803 delta "180e" decodes to 16683
805 0033 5E74 * 0042 516B 5148
806 delta "575a" decodes to 34821
808 0033 5E74 0042 516B 5148 751F *
809 delta "65l" decodes to 14592
811 0033 5E74 0042 7D44 * 516B 5148 751F
812 delta "sy2b" decodes to 42088
814 0033 5E74 0042 7D44 91D1 * 516B 5148 751F
818 In the following traces, code point values are hexadecimal, while
819 other numerical values are decimal.
821 Encoding trace of example B from section 7.1:
825 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
826 there are no basic code points, so no literal portion
827 next code point to insert is 4E0D
828 needed delta is 19853, encodes as "ihq"
830 next code point to insert is 4E2D
831 needed delta is 64, encodes as "wc"
833 next code point to insert is 4E3A
834 needed delta is 37, encodes as "rb"
836 next code point to insert is 4E48
837 needed delta is 56, encodes as "4c"
839 next code point to insert is 4EC0
840 needed delta is 599, encodes as "v8a"
842 next code point to insert is 4ED6
843 needed delta is 130, encodes as "8d"
845 next code point to insert is 4EEC
846 needed delta is 154, encodes as "qg"
848 next code point to insert is 6587
849 needed delta is 46301, encodes as "056p"
851 next code point to insert is 8BF4
852 needed delta is 88531, encodes as "qjye"
854 output is "ihqwcrb4cv8a8dqg056pqjye"
856 Encoding trace of example L from section 7.1:
860 0033 5E74 0042 7D44 91D1 516B 5148 751F
861 basic code points (0033, 0042) are copied to literal portion: "3B-"
862 next code point to insert is 5148
863 needed delta is 62042, encodes as "ww4c"
865 next code point to insert is 516B
866 needed delta is 139, encodes as "5e"
868 next code point to insert is 5E74
869 needed delta is 16683, encodes as "180e"
871 next code point to insert is 751F
872 needed delta is 34821, encodes as "575a"
874 next code point to insert is 7D44
875 needed delta is 14592, encodes as "65l"
877 next code point to insert is 91D1
878 needed delta is 42088, encodes as "sy2b"
880 output is "3B-ww4c5e180e575a65lsy2b"
882 8. Security considerations
884 Users expect each domain name in DNS to be controlled by a single
885 authority. If a Unicode string intended for use as a domain label
886 could map to multiple ACE labels, then an internationalized domain
887 name could map to multiple ASCII domain names, each controlled by
888 a different authority, some of which could be spoofs that hijack
889 service requests intended for another. Therefore Punycode is
890 designed so that each Unicode string has a unique encoding.
892 However, there can still be multiple Unicode representations of the
893 "same" text, for various definitions of "same". This problem is
894 addressed to some extent by the Unicode standard under the topic of
895 canonicalization, and this work is leveraged for domain names by
898 9. References (non-normative)
900 [ASCII] Vint Cerf, "ASCII format for Network Interchange",
903 [IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello,
904 "Internationalizing Domain Names In Applications (IDNA)",
907 [NAMEPREP] Paul Hoffman, Marc Blanchet, "Nameprep: A
908 Stringprep Profile for Internationalized Domain Names",
909 draft-ietf-idn-nameprep.
911 [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
912 http://www.trigeminal.com/samples/provincial.html.
914 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
915 Table Specification", 1985-Oct, RFC 952.
917 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
920 [UNICODE] The Unicode Consortium, "The Unicode Standard",
921 http://www.unicode.org/unicode/standard/standard.html.
923 A. Author contact information
926 University of California, Berkeley
927 http://www.nicemice.net/amc/
929 B. Mixed-case annotation
931 In order to use Punycode to represent case-insensitive strings,
932 higher layers need to case-fold the strings prior to Punycode
933 encoding. The encoded string can use mixed case as an annotation
934 telling how to convert the folded string into a mixed-case string
935 for display purposes. Note, however, that mixed-case annotation
936 is not used by the ToASCII and ToUnicode operations specified in
937 [IDNA], and therefore implementors of IDNA can disregard this
940 Basic code points can use mixed case directly, because the decoder
941 copies them verbatim, leaving lowercase code points lowercase, and
942 leaving uppercase code points uppercase. Each non-basic code point
943 is represented by a delta, which is represented by a sequence of
944 basic code points, the last of which provides the annotation. If it
945 is uppercase, it is a suggestion to map the non-basic code point to
946 uppercase (if possible); if it is lowercase, it is a suggestion to
947 map the non-basic code point to lowercase (if possible).
949 These annotations do not alter the code points returned by decoders;
950 the annotations are returned separately, for the caller to use or
951 ignore. Encoders can accept annotations in addition to code points,
952 but the annotations do not alter the output, except to influence the
953 uppercase/lowercase form of ASCII letters.
955 Punycode encoders and decoders need not support these annotations,
956 and higher layers need not use them.
958 C. Disclaimer and license
960 Regarding this entire document or any portion of it (including
961 the pseudocode and C code), the author makes no guarantees and
962 is not responsible for any damage resulting from its use. The
963 author grants irrevocable permission to anyone to use, modify,
964 and distribute it in any way that does not diminish the rights
965 of anyone else to use, modify, and distribute it, provided that
966 redistributed derivative works do not contain misleading author or
967 version information. Derivative works need not be licensed under
971 D. Punycode sample implementation
974 punycode.c from draft-ietf-idn-punycode-03
975 http://www.nicemice.net/idn/
977 http://www.nicemice.net/amc/
979 This is ANSI C code (C89) implementing
980 Punycode (draft-ietf-idn-punycode-03).
985 /************************************************************/
986 /* Public interface (would normally go in its own .h file): */
990 enum punycode_status {
992 punycode_bad_input, /* Input is invalid. */
993 punycode_big_output, /* Output would exceed the space provided. */
994 punycode_overflow /* Input needs wider integers to process. */
997 #if UINT_MAX >= (1 << 26) - 1
998 typedef unsigned int punycode_uint;
1000 typedef unsigned long punycode_uint;
1003 enum punycode_status punycode_encode(
1004 punycode_uint input_length,
1005 const punycode_uint input[],
1006 const unsigned char case_flags[],
1007 punycode_uint *output_length,
1010 /* punycode_encode() converts Unicode to Punycode. The input */
1011 /* is represented as an array of Unicode code points (not code */
1012 /* units; surrogate pairs are not allowed), and the output */
1013 /* will be represented as an array of ASCII code points. The */
1014 /* output string is *not* null-terminated; it will contain */
1015 /* zeros if and only if the input contains zeros. (Of course */
1016 /* the caller can leave room for a terminator and add one if */
1017 /* needed.) The input_length is the number of code points in */
1018 /* the input. The output_length is an in/out argument: the */
1019 /* caller passes in the maximum number of code points that it */
1020 /* can receive, and on successful return it will contain the */
1021 /* number of code points actually output. The case_flags array */
1022 /* holds input_length boolean values, where nonzero suggests that */
1023 /* the corresponding Unicode character be forced to uppercase */
1024 /* after being decoded (if possible), and zero suggests that */
1025 /* it be forced to lowercase (if possible). ASCII code points */
1026 /* are encoded literally, except that ASCII letters are forced */
1027 /* to uppercase or lowercase according to the corresponding */
1028 /* uppercase flags. If case_flags is a null pointer then ASCII */
1029 /* letters are left as they are, and other code points are */
1030 /* treated as if their uppercase flags were zero. The return */
1031 /* value can be any of the punycode_status values defined above */
1032 /* except punycode_bad_input; if not punycode_success, then */
1033 /* output_size and output might contain garbage. */
1035 enum punycode_status punycode_decode(
1036 punycode_uint input_length,
1038 punycode_uint *output_length,
1039 punycode_uint output[],
1040 unsigned char case_flags[] );
1042 /* punycode_decode() converts Punycode to Unicode. The input is */
1043 /* represented as an array of ASCII code points, and the output */
1044 /* will be represented as an array of Unicode code points. The */
1045 /* input_length is the number of code points in the input. The */
1046 /* output_length is an in/out argument: the caller passes in */
1047 /* the maximum number of code points that it can receive, and */
1048 /* on successful return it will contain the actual number of */
1049 /* code points output. The case_flags array needs room for at */
1050 /* least output_length values, or it can be a null pointer if the */
1051 /* case information is not needed. A nonzero flag suggests that */
1052 /* the corresponding Unicode character be forced to uppercase */
1053 /* by the caller (if possible), while zero suggests that it be */
1054 /* forced to lowercase (if possible). ASCII code points are */
1055 /* output already in the proper case, but their flags will be set */
1056 /* appropriately so that applying the flags would be harmless. */
1057 /* The return value can be any of the punycode_status values */
1058 /* defined above; if not punycode_success, then output_length, */
1059 /* output, and case_flags might contain garbage. On success, the */
1060 /* decoder will never need to write an output_length greater than */
1061 /* input_length, because of how the encoding is defined. */
1063 /**********************************************************/
1064 /* Implementation (would normally go in its own .c file): */
1068 /*** Bootstring parameters for Punycode ***/
1070 enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
1071 initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
1073 /* basic(cp) tests whether cp is a basic code point: */
1074 #define basic(cp) ((punycode_uint)(cp) < 0x80)
1076 /* delim(cp) tests whether cp is a delimiter: */
1077 #define delim(cp) ((cp) == delimiter)
1079 /* decode_digit(cp) returns the numeric value of a basic code */
1080 /* point (for use in representing integers) in the range 0 to */
1081 /* base-1, or base if cp is does not represent a value. */
1083 static punycode_uint decode_digit(punycode_uint cp)
1085 return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 :
1086 cp - 97 < 26 ? cp - 97 : base;
1089 /* encode_digit(d,flag) returns the basic code point whose value */
1090 /* (when used for representing integers) is d, which needs to be in */
1091 /* the range 0 to base-1. The lowercase form is used unless flag is */
1092 /* nonzero, in which case the uppercase form is used. The behavior */
1093 /* is undefined if flag is nonzero and digit d has no uppercase form. */
1095 static char encode_digit(punycode_uint d, int flag)
1097 return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
1098 /* 0..25 map to ASCII a..z or A..Z */
1099 /* 26..35 map to ASCII 0..9 */
1102 /* flagged(bcp) tests whether a basic code point is flagged */
1103 /* (uppercase). The behavior is undefined if bcp is not a */
1104 /* basic code point. */
1106 #define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
1108 /* encode_basic(bcp,flag) forces a basic code point to lowercase */
1109 /* if flag is zero, uppercase if flag is nonzero, and returns */
1110 /* the resulting code point. The code point is unchanged if it */
1111 /* is caseless. The behavior is undefined if bcp is not a basic */
1114 static char encode_basic(punycode_uint bcp, int flag)
1116 bcp -= (bcp - 97 < 26) << 5;
1117 return bcp + ((!flag && (bcp - 65 < 26)) << 5);
1120 /*** Platform-specific constants ***/
1122 /* maxint is the maximum value of a punycode_uint variable: */
1123 static const punycode_uint maxint = -1;
1124 /* Because maxint is unsigned, -1 becomes the maximum value. */
1126 /*** Bias adaptation function ***/
1128 static punycode_uint adapt(
1129 punycode_uint delta, punycode_uint numpoints, int firsttime )
1133 delta = firsttime ? delta / damp : delta >> 1;
1134 /* delta >> 1 is a faster way of doing delta / 2 */
1135 delta += delta / numpoints;
1137 for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) {
1138 delta /= base - tmin;
1141 return k + (base - tmin + 1) * delta / (delta + skew);
1144 /*** Main encode function ***/
1146 enum punycode_status punycode_encode(
1147 punycode_uint input_length,
1148 const punycode_uint input[],
1149 const unsigned char case_flags[],
1150 punycode_uint *output_length,
1153 punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
1155 /* Initialize the state: */
1159 max_out = *output_length;
1160 bias = initial_bias;
1162 /* Handle the basic code points: */
1164 for (j = 0; j < input_length; ++j) {
1165 if (basic(input[j])) {
1166 if (max_out - out < 2) return punycode_big_output;
1168 case_flags ? encode_basic(input[j], case_flags[j]) : input[j];
1170 /* else if (input[j] < n) return punycode_bad_input; */
1171 /* (not needed for Punycode with unsigned code points) */
1176 /* h is the number of code points that have been handled, b is the */
1177 /* number of basic code points, and out is the number of characters */
1178 /* that have been output. */
1180 if (b > 0) output[out++] = delimiter;
1182 /* Main encoding loop: */
1184 while (h < input_length) {
1185 /* All non-basic code points < n have been */
1186 /* handled already. Find the next larger one: */
1188 for (m = maxint, j = 0; j < input_length; ++j) {
1189 /* if (basic(input[j])) continue; */
1190 /* (not needed for Punycode) */
1191 if (input[j] >= n && input[j] < m) m = input[j];
1194 /* Increase delta enough to advance the decoder's */
1195 /* <n,i> state to <m,0>, but guard against overflow: */
1197 if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
1198 delta += (m - n) * (h + 1);
1201 for (j = 0; j < input_length; ++j) {
1202 /* Punycode does not need to check whether input[j] is basic: */
1203 if (input[j] < n /* || basic(input[j]) */ ) {
1204 if (++delta == 0) return punycode_overflow;
1207 if (input[j] == n) {
1208 /* Represent delta as a generalized variable-length integer: */
1210 for (q = delta, k = base; ; k += base) {
1211 if (out >= max_out) return punycode_big_output;
1212 t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
1213 k >= bias + tmax ? tmax : k - bias;
1215 output[out++] = encode_digit(t + (q - t) % (base - t), 0);
1216 q = (q - t) / (base - t);
1219 output[out++] = encode_digit(q, case_flags && case_flags[j]);
1220 bias = adapt(delta, h + 1, h == b);
1229 *output_length = out;
1230 return punycode_success;
1233 /*** Main decode function ***/
1235 enum punycode_status punycode_decode(
1236 punycode_uint input_length,
1238 punycode_uint *output_length,
1239 punycode_uint output[],
1240 unsigned char case_flags[] )
1242 punycode_uint n, out, i, max_out, bias,
1243 b, j, in, oldi, w, k, digit, t;
1245 /* Initialize the state: */
1249 max_out = *output_length;
1250 bias = initial_bias;
1252 /* Handle the basic code points: Let b be the number of input code */
1253 /* points before the last delimiter, or 0 if there is none, then */
1254 /* copy the first b code points to the output. */
1256 for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j;
1257 if (b > max_out) return punycode_big_output;
1259 for (j = 0; j < b; ++j) {
1260 if (case_flags) case_flags[out] = flagged(input[j]);
1261 if (!basic(input[j])) return punycode_bad_input;
1262 output[out++] = input[j];
1265 /* Main decoding loop: Start just after the last delimiter if any */
1266 /* basic code points were copied; start at the beginning otherwise. */
1268 for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) {
1270 /* in is the index of the next character to be consumed, and */
1271 /* out is the number of code points in the output array. */
1273 /* Decode a generalized variable-length integer into delta, */
1274 /* which gets added to i. The overflow checking is easier */
1275 /* if we increase i as we go, then subtract off its starting */
1276 /* value at the end to obtain delta. */
1278 for (oldi = i, w = 1, k = base; ; k += base) {
1279 if (in >= input_length) return punycode_bad_input;
1280 digit = decode_digit(input[in++]);
1281 if (digit >= base) return punycode_bad_input;
1282 if (digit > (maxint - i) / w) return punycode_overflow;
1284 t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
1285 k >= bias + tmax ? tmax : k - bias;
1286 if (digit < t) break;
1287 if (w > maxint / (base - t)) return punycode_overflow;
1291 bias = adapt(i - oldi, out + 1, oldi == 0);
1293 /* i was supposed to wrap around from out+1 to 0, */
1294 /* incrementing n each time, so we'll fix that now: */
1296 if (i / (out + 1) > maxint - n) return punycode_overflow;
1300 /* Insert n at position i of the output: */
1302 /* not needed for Punycode: */
1303 /* if (decode_digit(n) <= base) return punycode_invalid_input; */
1304 if (out >= max_out) return punycode_big_output;
1307 memmove(case_flags + i + 1, case_flags + i, out - i);
1308 /* Case of last character determines uppercase flag: */
1309 case_flags[i] = flagged(input[in - 1]);
1312 memmove(output + i + 1, output + i, (out - i) * sizeof *output);
1316 *output_length = out;
1317 return punycode_success;
1321 /******************************************************************/
1322 /* Wrapper for testing (would normally go in a separate .c file): */
1329 /* For testing, we'll just set some compile-time limits rather than */
1330 /* use malloc(), and set a compile-time option rather than using a */
1331 /* command-line option. */
1334 unicode_max_length = 256,
1335 ace_max_length = 256
1338 static void usage(char **argv)
1342 "%s -e reads code points and writes a Punycode string.\n"
1343 "%s -d reads a Punycode string and writes code points.\n"
1345 "Input and output are plain text in the native character set.\n"
1346 "Code points are in the form u+hex separated by whitespace.\n"
1347 "Although the specification allows Punycode strings to contain\n"
1348 "any characters from the ASCII repertoire, this test code\n"
1349 "supports only the printable characters, and needs the Punycode\n"
1350 "string to be followed by a newline.\n"
1351 "The case of the u in u+hex is the force-to-uppercase flag.\n"
1352 , argv[0], argv[0]);
1357 static void fail(const char *msg)
1363 static const char too_big[] =
1364 "input or output is too large, recompile with larger limits\n";
1365 static const char invalid_input[] = "invalid input\n";
1366 static const char overflow[] = "arithmetic overflow\n";
1367 static const char io_error[] = "I/O error\n";
1370 /* The following string is used to convert printable */
1371 /* characters between ASCII and the native charset: */
1373 static const char print_ascii[] =
1374 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1375 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1381 "pqrstuvwxyz{|}~\n";
1384 int main(int argc, char **argv)
1386 enum punycode_status status;
1388 unsigned int input_length, output_length, j;
1389 unsigned char case_flags[unicode_max_length];
1391 if (argc != 2) usage(argv);
1392 if (argv[1][0] != '-') usage(argv);
1393 if (argv[1][2] != 0) usage(argv);
1395 if (argv[1][1] == 'e') {
1396 punycode_uint input[unicode_max_length];
1397 unsigned long codept;
1398 char output[ace_max_length+1], uplus[3];
1401 /* Read the input code points: */
1406 r = scanf("%2s%lx", uplus, &codept);
1407 if (ferror(stdin)) fail(io_error);
1408 if (r == EOF || r == 0) break;
1410 if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
1411 fail(invalid_input);
1414 if (input_length == unicode_max_length) fail(too_big);
1416 if (uplus[0] == 'u') case_flags[input_length] = 0;
1417 else if (uplus[0] == 'U') case_flags[input_length] = 1;
1418 else fail(invalid_input);
1420 input[input_length++] = codept;
1425 output_length = ace_max_length;
1426 status = punycode_encode(input_length, input, case_flags,
1427 &output_length, output);
1428 if (status == punycode_bad_input) fail(invalid_input);
1429 if (status == punycode_big_output) fail(too_big);
1430 if (status == punycode_overflow) fail(overflow);
1431 assert(status == punycode_success);
1433 /* Convert to native charset and output: */
1435 for (j = 0; j < output_length; ++j) {
1437 assert(c >= 0 && c <= 127);
1438 if (print_ascii[c] == 0) fail(invalid_input);
1439 output[j] = print_ascii[c];
1444 if (r == EOF) fail(io_error);
1445 return EXIT_SUCCESS;
1448 if (argv[1][1] == 'd') {
1449 char input[ace_max_length+2], *p, *pp;
1450 punycode_uint output[unicode_max_length];
1452 /* Read the Punycode input string and convert to ASCII: */
1454 fgets(input, ace_max_length+2, stdin);
1455 if (ferror(stdin)) fail(io_error);
1456 if (feof(stdin)) fail(invalid_input);
1457 input_length = strlen(input) - 1;
1458 if (input[input_length] != '\n') fail(too_big);
1459 input[input_length] = 0;
1461 for (p = input; *p != 0; ++p) {
1462 pp = strchr(print_ascii, *p);
1463 if (pp == 0) fail(invalid_input);
1464 *p = pp - print_ascii;
1469 output_length = unicode_max_length;
1470 status = punycode_decode(input_length, input, &output_length,
1471 output, case_flags);
1472 if (status == punycode_bad_input) fail(invalid_input);
1473 if (status == punycode_big_output) fail(too_big);
1474 if (status == punycode_overflow) fail(overflow);
1475 assert(status == punycode_success);
1477 /* Output the result: */
1479 for (j = 0; j < output_length; ++j) {
1480 r = printf("%s+%04lX\n",
1481 case_flags[j] ? "U" : "u",
1482 (unsigned long) output[j] );
1483 if (r < 0) fail(io_error);
1486 return EXIT_SUCCESS;
1490 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
1495 INTERNET-DRAFT expires 2003-Apr-08