2 <!DOCTYPE rfc SYSTEM "rfc2629.dtd">
6 <rfc category="info" ipr="full2026"
7 docName="draft-josefsson-idn-test-vectors">
11 <title>Nameprep and IDNA Test Vectors</title>
13 <author initials="S." surname="Josefsson" fullname="Simon Josefsson">
14 <organization></organization>
16 <email>simon@josefsson.org</email>
20 <date month="February" year="2003"/>
24 <t>This document contains test vectors for Nameprep and IDNA.</t>
32 <section title="Introduction">
34 <t>The Nameprep and IDNA specifications lack thorough examples that
35 would have aided in implementing them. This document act as a
36 complement to those specifications providing such examples.</t>
38 <t>It should be pointed out that this document is not normative, and
39 thus any errors in this document should not be treated as gospel that
40 defines Nameprep nor IDNA. When conforming to the specification and
41 generating output corresponding to values in this document is in
42 conflict, implementations should conform to the specification.</t>
44 <t><vspace blankLines="10000" /></t>
48 <section title="Format of Nameprep Test Vectors">
50 <t>The tests follow a certain syntax, described here by showing one
51 complete example with comments intermixed. The comments are prefixed
52 with the '#' character.</t>
56 # First the (UTF-8) string is printed as a C octet string, with
57 # characters [A-Za-z .0-9] shown inline and other characters shown
58 # escaped with \xAB where AB is the hex sequence of that octet. The
59 # number of octets are also shown.
64 # The input is also printed as Unicode codepoints.
69 # After printing the input, the nameprep steps starts. When the
70 # string is modified, the specific operation that caused it is printed
71 # along with the new string of Unicode code points.
73 # 1) Map -- For each character in the input, check if it has a mapping
74 # and, if so, replace it with its mapping. This is described in
77 Table B.2 maps U+1fb7 to U+03b1 U+0342 U+03b9.
80 # 2) Normalize -- Possibly normalize the result of step 1 using Unicode
81 # normalization. This is described in section 4.
83 Unicode normalization with form KC maps string into:
86 # 3) Prohibit -- Check for any characters that are not allowed in the
87 # output. If any are found, return an error. This is described in
90 # 4) Check bidi -- Possibly check for right-to-left characters, and if
91 # any are found, make sure that the whole string satisfies the
92 # requirements for bidirectional strings. If the string does not
93 # satisfy the requirements for bidirectional strings, return an
94 # error. This is described in section 6.
96 # 1) The characters in section 5.8 MUST be prohibited.
98 # 2) If a string contains any RandALCat character, the string MUST NOT
99 # contain any LCat character.
101 # 3) If a string contains any RandALCat character, a RandALCat
102 # character MUST be the first character of the string, and a
103 # RandALCat character MUST be the last character of the string.
105 # The output is printed as Unicode codepoints.
110 # And finally the output is printed as UTF-8
112 out (length 5 bytes):
119 <section title="Format of IDNA Test Vectors">
121 <t>The tests follow a certain syntax, described here by showing one
122 complete example with comments intermixed. The comments are prefixed
123 with the '#' character.</t>
127 # First the (UTF-8) string is printed as a C octet string, with
128 # characters [A-Za-z .0-9] shown inline and other characters shown
129 # escaped with \xAB where AB is the hex sequence of that octet. The
130 # number of octets are also shown.
132 in (length 39 bytes):
133 'Hello\x2DAnother\x2DWa'
134 'y\x2D\xE3\x81\x9D\xE3\x82\x8C\xE3\x81\x9E\xE3\x82\x8C\xE3\x81'
135 '\xAE\xE5\xA0\xB4\xE6\x89\x80
137 # The input is also printed as Unicode codepoints.
140 U+0048 U+0065 U+006c U+006c U+006f U+002d U+0041 U+006e
141 U+006f U+0074 U+0068 U+0065 U+0072 U+002d U+0057 U+0061
142 U+0079 U+002d U+305d U+308c U+305e U+308c U+306e U+5834
145 # After printing the input, the IDNA ToASCII step starts. The output
146 # is printed as an ASCII string.
148 out: xn--hello-another-way--fc4qua05auwb3674vfr0b
153 <t><vspace blankLines="10000" /></t>
157 <section title="Nameprep Test Vectors">
159 <?rfc include="foo"?>
163 <section title="IDNA ToASCII Test Vectors">
165 <?rfc include="bar"?>
169 <section title="IDNA ToUnicode Test Vectors">
174 <section title="Auxiliary Test Vectors">
176 <t>These test vectors do not test Nameprep nor IDNA proper, rather
177 they test the UTF-8 handling of software. Instead of outputting the
178 indicated Unicode code point, they should raise an error that the
179 input was invalid.</t>
181 <section title="Incorrect UTF-8 encoding of U+00DF">
194 <section title="Incorrect UTF-8 encoding of U+01F0">
209 <section title="Security Considerations">
211 <t>The security considerations from Nameprep and IDNA are
214 <t>These test vectors are not believed to introduce new security
215 considerations nor disrupt the operation of the Internet, but may
216 expose security weaknesses in existing implementations. Any such
217 incident should not be regarded as a problem with this document,
218 though, but rather taken as evidence that this document served its
227 <note title="Acknowledgments">
228 <t>Some IDNA test vectors were borrowed from Punycode <xref
229 target="RFC3492" />.</t>
232 <section title="Nameprep test vectors in C syntax">
234 <t>In order to avoid having implementors type in the test vectors
235 above, a C structure with the data is provided.</t>
237 <t>The comment field is the section titles used in this document. The
238 in field contains UTF-8 encoded strings. The out field contains
239 expected output, or NULL if the expected result is an error. The
240 profile field can be ignored. The only significant setting for the
241 flags field is STRINGPREP_NO_UNASSIGNED which signals to the Nameprep
242 implementation that it should perform unassigned code point checking,
243 aka the "AllowUnassigned" flag. The rc field contains expected error
244 codes, where 0 indicates success and the other flags should be self
262 "foo\xC2\xAD\xCD\x8F\xE1\xA0\x86\xE1\xA0\x8B"
263 "bar""\xE2\x80\x8B\xE2\x81\xA0""baz\xEF\xB8\x80\xEF\xB8\x88"
264 "\xEF\xB8\x8F\xEF\xBB\xBF", "foobarbaz"
267 "Case folding ASCII U+0043 U+0041 U+0046 U+0045",
271 "Case folding 8bit U+00DF (german sharp s)",
275 "Case folding U+0130 (turkish capital I with dot)",
276 "\xC4\xB0", "i\xcc\x87"
279 "Case folding multibyte U+0143 U+037A",
280 "\xC5\x83\xCD\xBA", "\xC5\x84 \xCE\xB9"
283 "Case folding U+2121 U+33C6 U+1D7BB",
284 "\xE2\x84\xA1\xE3\x8F\x86\xF0\x9D\x9E\xBB",
285 "telc\xE2\x88\x95""kg\xCF\x83"
288 "Normalization of U+006a U+030c U+00A0 U+00AA",
289 "\x6A\xCC\x8C\xC2\xA0\xC2\xAA", "\xC7\xB0 a"
292 "Case folding U+1FB7 and normalization",
293 "\xE1\xBE\xB7", "\xE1\xBE\xB6\xCE\xB9"
296 "Self-reverting case folding U+01F0 and normalization",
297 "\xC7\xF0", "\xC7\xB0"
300 "Self-reverting case folding U+0390 and normalization",
301 "\xCE\x90", "\xCE\x90"
304 "Self-reverting case folding U+03B0 and normalization",
305 "\xCE\xB0", "\xCE\xB0"
308 "Self-reverting case folding U+1E96 and normalization",
309 "\xE1\xBA\x96", "\xE1\xBA\x96"
312 "Self-reverting case folding U+1F56 and normalization",
313 "\xE1\xBD\x96", "\xE1\xBD\x96"
316 "ASCII space character U+0020",
320 "Non-ASCII 8bit space character U+00A0",
324 "Non-ASCII multibyte space character U+1680",
325 "\xE1\x9A\x80", NULL, "Nameprep", 0,
326 STRINGPREP_CONTAINS_PROHIBITED
329 "Non-ASCII multibyte space character U+2000",
330 "\xE2\x80\x80", "\x20"
333 "Zero Width Space U+200b",
337 "Non-ASCII multibyte space character U+3000",
338 "\xE3\x80\x80", "\x20"
341 "ASCII control characters U+0010 U+007F",
342 "\x10\x7F", "\x10\x7F"
345 "Non-ASCII 8bit control character U+0085",
346 "\xC2\x85", NULL, "Nameprep", 0,
347 STRINGPREP_CONTAINS_PROHIBITED
350 "Non-ASCII multibyte control character U+180E",
351 "\xE1\xA0\x8E", NULL, "Nameprep", 0,
352 STRINGPREP_CONTAINS_PROHIBITED
355 "Zero Width No-Break Space U+FEFF",
359 "Non-ASCII control character U+1D175",
360 "\xF0\x9D\x85\xB5", NULL, "Nameprep", 0,
361 STRINGPREP_CONTAINS_PROHIBITED
364 "Plane 0 private use character U+F123",
365 "\xEF\x84\xA3", NULL, "Nameprep", 0,
366 STRINGPREP_CONTAINS_PROHIBITED
369 "Plane 15 private use character U+F1234",
370 "\xF3\xB1\x88\xB4", NULL, "Nameprep", 0,
371 STRINGPREP_CONTAINS_PROHIBITED
374 "Plane 16 private use character U+10F234",
375 "\xF4\x8F\x88\xB4", NULL, "Nameprep", 0,
376 STRINGPREP_CONTAINS_PROHIBITED
379 "Non-character code point U+8FFFE",
380 "\xF2\x8F\xBF\xBE", NULL, "Nameprep", 0,
381 STRINGPREP_CONTAINS_PROHIBITED
384 "Non-character code point U+10FFFF",
385 "\xF4\x8F\xBF\xBF", NULL, "Nameprep", 0,
386 STRINGPREP_CONTAINS_PROHIBITED
389 "Surrogate code U+DF42",
390 "\xED\xBD\x82", NULL, "Nameprep", 0,
391 STRINGPREP_CONTAINS_PROHIBITED
394 "Non-plain text character U+FFFD",
395 "\xEF\xBF\xBD", NULL, "Nameprep", 0,
396 STRINGPREP_CONTAINS_PROHIBITED
399 "Ideographic description character U+2FF5",
400 "\xE2\xBF\xB5", NULL, "Nameprep", 0,
401 STRINGPREP_CONTAINS_PROHIBITED
404 "Display property character U+0341",
405 "\xCD\x81", "\xCC\x81"
408 "Left-to-right mark U+200E",
409 "\xE2\x80\x8E", "\xCC\x81", "Nameprep", 0,
410 STRINGPREP_CONTAINS_PROHIBITED
414 "\xE2\x80\xAA", "\xCC\x81", "Nameprep", 0,
415 STRINGPREP_CONTAINS_PROHIBITED
418 "Language tagging character U+E0001",
419 "\xF3\xA0\x80\x81", "\xCC\x81", "Nameprep", 0,
420 STRINGPREP_CONTAINS_PROHIBITED
423 "Language tagging character U+E0042",
424 "\xF3\xA0\x81\x82", NULL, "Nameprep", 0,
425 STRINGPREP_CONTAINS_PROHIBITED
428 "Bidi: RandALCat character U+05BE and LCat characters",
429 "foo\xD6\xBE""bar", NULL, "Nameprep", 0,
430 STRINGPREP_BIDI_BOTH_L_AND_RAL
433 "Bidi: RandALCat character U+FD50 and LCat characters",
434 "foo\xEF\xB5\x90""bar", NULL, "Nameprep", 0,
435 STRINGPREP_BIDI_BOTH_L_AND_RAL
438 "Bidi: RandALCat character U+FB38 and LCat characters",
439 "foo\xEF\xB9\xB6""bar", "foo \xd9\x8e""bar"
441 { "Bidi: RandALCat without trailing RandALCat U+0627 U+0031",
442 "\xD8\xA7\x31", NULL, "Nameprep", 0,
443 STRINGPREP_BIDI_LEADTRAIL_NOT_RAL}
446 "Bidi: RandALCat character U+0627 U+0031 U+0628",
447 "\xD8\xA7\x31\xD8\xA8", "\xD8\xA7\x31\xD8\xA8"
450 "Unassigned code point U+E0002",
451 "\xF3\xA0\x80\x82", NULL, "Nameprep", STRINGPREP_NO_UNASSIGNED,
452 STRINGPREP_CONTAINS_UNASSIGNED
455 "Larger test (shrinking)",
456 "X\xC2\xAD\xC3\xDF\xC4\xB0\xE2\x84\xA1\x6a\xcc\x8c\xc2\xa0\xc2"
457 "\xaa\xce\xb0\xe2\x80\x80", "xssi\xcc\x87""tel\xc7\xb0 a\xce\xb0 ",
461 "Larger test (expanding)",
462 "X\xC3\xDF\xe3\x8c\x96\xC4\xB0\xE2\x84\xA1\xE2\x92\x9F\xE3\x8c\x80",
463 "xss\xe3\x82\xad\xe3\x83\xad\xe3\x83\xa1\xe3\x83\xbc\xe3\x83\x88"
464 "\xe3\x83\xab""i\xcc\x87""tel\x28""d\x29\xe3\x82\xa2\xe3\x83\x91"
465 "\xe3\x83\xbc\xe3\x83\x88"
473 <section title="IDNA test vectors in C syntax">
475 <t>In order to avoid having implementors type in the IDNA test vectors
476 above, a C structure with the data is provided.</t>
478 <t>The name field is the section titles used in this document. The
479 inlen and in field contains Unicode code points. The out field
480 contains expected ToASCII output. The allowunassigned, and
481 usestd3asciirules can be ignored. The toasciirc and tounicoderc field
482 contains expected error codes, where 0 indicates success and the other
483 flags should be self explanatory.</t>
491 unsigned long in[100];
494 int usestd3asciirules;
500 "Arabic (Egyptian)", 17,
502 0x0644, 0x064A, 0x0647, 0x0645, 0x0627, 0x0628, 0x062A, 0x0643,
503 0x0644, 0x0645, 0x0648, 0x0634, 0x0639, 0x0631, 0x0628, 0x064A,
505 IDNA_ACE_PREFIX "egbpdaj6bu4bxfgehfvwxn", 0, 0, IDNA_SUCCESS,
508 "Chinese (simplified)", 9,
510 0x4ED6, 0x4EEC, 0x4E3A, 0x4EC0, 0x4E48, 0x4E0D, 0x8BF4, 0x4E2D, 0x6587},
511 IDNA_ACE_PREFIX "ihqwcrb4cv8a8dqg056pqjye", 0, 0, IDNA_SUCCESS,
514 "Chinese (traditional)", 9,
516 0x4ED6, 0x5011, 0x7232, 0x4EC0, 0x9EBD, 0x4E0D, 0x8AAA, 0x4E2D, 0x6587},
517 IDNA_ACE_PREFIX "ihqwctvzc91f659drss3x8bo0yb", 0, 0, IDNA_SUCCESS,
522 0x0050, 0x0072, 0x006F, 0x010D, 0x0070, 0x0072, 0x006F, 0x0073,
523 0x0074, 0x011B, 0x006E, 0x0065, 0x006D, 0x006C, 0x0075, 0x0076,
524 0x00ED, 0x010D, 0x0065, 0x0073, 0x006B, 0x0079},
525 IDNA_ACE_PREFIX "Proprostnemluvesky-uyb24dma41a", 0, 0, IDNA_SUCCESS,
530 0x05DC, 0x05DE, 0x05D4, 0x05D4, 0x05DD, 0x05E4, 0x05E9, 0x05D5,
531 0x05D8, 0x05DC, 0x05D0, 0x05DE, 0x05D3, 0x05D1, 0x05E8, 0x05D9,
532 0x05DD, 0x05E2, 0x05D1, 0x05E8, 0x05D9, 0x05EA},
533 IDNA_ACE_PREFIX "4dbcagdahymbxekheh6e0a7fei0b", 0, 0, IDNA_SUCCESS,
536 "Hindi (Devanagari)", 30,
538 0x092F, 0x0939, 0x0932, 0x094B, 0x0917, 0x0939, 0x093F, 0x0928,
539 0x094D, 0x0926, 0x0940, 0x0915, 0x094D, 0x092F, 0x094B, 0x0902,
540 0x0928, 0x0939, 0x0940, 0x0902, 0x092C, 0x094B, 0x0932, 0x0938,
541 0x0915, 0x0924, 0x0947, 0x0939, 0x0948, 0x0902},
542 IDNA_ACE_PREFIX "i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd", 0, 0,
545 "Japanese (kanji and hiragana)", 18,
547 0x306A, 0x305C, 0x307F, 0x3093, 0x306A, 0x65E5, 0x672C, 0x8A9E,
548 0x3092, 0x8A71, 0x3057, 0x3066, 0x304F, 0x308C, 0x306A, 0x3044,
550 IDNA_ACE_PREFIX "n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa", 0, 0,
553 "Russian (Cyrillic)", 28,
555 0x043F, 0x043E, 0x0447, 0x0435, 0x043C, 0x0443, 0x0436, 0x0435,
556 0x043E, 0x043D, 0x0438, 0x043D, 0x0435, 0x0433, 0x043E, 0x0432,
557 0x043E, 0x0440, 0x044F, 0x0442, 0x043F, 0x043E, 0x0440, 0x0443,
558 0x0441, 0x0441, 0x043A, 0x0438},
559 IDNA_ACE_PREFIX "b1abfaaepdrnnbgefbadotcwatmq2g4l", 0, 0,
560 IDNA_SUCCESS, IDNA_SUCCESS},
564 0x0050, 0x006F, 0x0072, 0x0071, 0x0075, 0x00E9, 0x006E, 0x006F,
565 0x0070, 0x0075, 0x0065, 0x0064, 0x0065, 0x006E, 0x0073, 0x0069,
566 0x006D, 0x0070, 0x006C, 0x0065, 0x006D, 0x0065, 0x006E, 0x0074,
567 0x0065, 0x0068, 0x0061, 0x0062, 0x006C, 0x0061, 0x0072, 0x0065,
568 0x006E, 0x0045, 0x0073, 0x0070, 0x0061, 0x00F1, 0x006F, 0x006C},
569 IDNA_ACE_PREFIX "PorqunopuedensimplementehablarenEspaol-fmd56a", 0, 0,
574 0x0054, 0x1EA1, 0x0069, 0x0073, 0x0061, 0x006F, 0x0068, 0x1ECD,
575 0x006B, 0x0068, 0x00F4, 0x006E, 0x0067, 0x0074, 0x0068, 0x1EC3,
576 0x0063, 0x0068, 0x1EC9, 0x006E, 0x00F3, 0x0069, 0x0074, 0x0069,
577 0x1EBF, 0x006E, 0x0067, 0x0056, 0x0069, 0x1EC7, 0x0074},
578 IDNA_ACE_PREFIX "TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g", 0, 0,
583 0x0033, 0x5E74, 0x0042, 0x7D44, 0x91D1, 0x516B, 0x5148, 0x751F},
584 IDNA_ACE_PREFIX "3B-ww4c5e180e575a65lsy2b", 0, 0, IDNA_SUCCESS,
589 0x5B89, 0x5BA4, 0x5948, 0x7F8E, 0x6075, 0x002D, 0x0077, 0x0069,
590 0x0074, 0x0068, 0x002D, 0x0053, 0x0055, 0x0050, 0x0045, 0x0052,
591 0x002D, 0x004D, 0x004F, 0x004E, 0x004B, 0x0045, 0x0059, 0x0053},
592 IDNA_ACE_PREFIX "-with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n", 0, 0,
597 0x0048, 0x0065, 0x006C, 0x006C, 0x006F, 0x002D, 0x0041, 0x006E,
598 0x006F, 0x0074, 0x0068, 0x0065, 0x0072, 0x002D, 0x0057, 0x0061,
599 0x0079, 0x002D, 0x305D, 0x308C, 0x305E, 0x308C, 0x306E, 0x5834,
601 IDNA_ACE_PREFIX "Hello-Another-Way--fc4qua05auwb3674vfr0b", 0, 0,
606 0x3072, 0x3068, 0x3064, 0x5C4B, 0x6839, 0x306E, 0x4E0B, 0x0032},
607 IDNA_ACE_PREFIX "2-u9tlzr9756bt3uc0v", 0, 0, IDNA_SUCCESS,
612 0x004D, 0x0061, 0x006A, 0x0069, 0x3067, 0x004B, 0x006F, 0x0069,
613 0x3059, 0x308B, 0x0035, 0x79D2, 0x524D},
614 IDNA_ACE_PREFIX "MajiKoi5-783gue6qz075azm5e", 0, 0, IDNA_SUCCESS,
619 0x30D1, 0x30D5, 0x30A3, 0x30FC, 0x0064, 0x0065, 0x30EB, 0x30F3, 0x30D0},
620 IDNA_ACE_PREFIX "de-jg4avhby1noc0d", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
624 0x305D, 0x306E, 0x30B9, 0x30D4, 0x30FC, 0x30C9, 0x3067},
625 IDNA_ACE_PREFIX "d9juau41awczczp", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
628 {0x03b5, 0x03bb, 0x03bb, 0x03b7, 0x03bd, 0x03b9, 0x03ba, 0x03ac},
629 IDNA_ACE_PREFIX "hxargifdar", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
631 "Maltese (Malti)", 10,
632 {0x0062, 0x006f, 0x006e, 0x0121, 0x0075, 0x0073, 0x0061, 0x0127,
634 IDNA_ACE_PREFIX "bonusaa-5bb1da", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
636 "Russian (Cyrillic)", 28,
637 {0x043f, 0x043e, 0x0447, 0x0435, 0x043c, 0x0443, 0x0436, 0x0435,
638 0x043e, 0x043d, 0x0438, 0x043d, 0x0435, 0x0433, 0x043e, 0x0432,
639 0x043e, 0x0440, 0x044f, 0x0442, 0x043f, 0x043e, 0x0440, 0x0443,
640 0x0441, 0x0441, 0x043a, 0x0438},
641 IDNA_ACE_PREFIX "b1abfaaepdrnnbgefbadotcwatmq2g4l", 0, 0,
642 IDNA_SUCCESS, IDNA_SUCCESS},
649 <references title="Normative References">
650 <?rfc include="reference.RFC.3491.xml"?>
651 <?rfc include="reference.RFC.3490.xml"?>
654 <references title="Informative References">
655 <?rfc include="reference.RFC.3492.xml"?>