2 <!DOCTYPE rfc SYSTEM "rfc2629.dtd">
6 <rfc category="info" ipr="full2026"
7 docName="draft-josefsson-idn-test-vectors">
11 <title>Nameprep and IDNA Test Vectors</title>
13 <author initials="S." surname="Josefsson" fullname="Simon Josefsson">
14 <organization></organization>
17 <street>Drottningholmsv. 70</street>
18 <city>Stockholm</city> <code>112 42</code>
19 <country>Sweden</country>
21 <email>simon@josefsson.org</email>
25 <date month="February" year="2003"/>
29 <t>This document contains test vectors for Nameprep and IDNA. The
30 majority of the test vectors are derived in order to cover various
31 corner cases in the specifications, but some anticipated typical data
32 from the real world are also included.</t>
40 <section title="Introduction">
42 <t>The Nameprep and IDNA specifications lack thorough examples that
43 would aid in implementing them. This document can act as a complement
44 to those specifications.</t>
46 <t>It should be pointed out that this document is not normative, and
47 thus any errors here should not be treated as gospel that defines
48 Nameprep or IDNA. When conforming to the specification and generating
49 output corresponding to values in this document is in conflict,
50 implementations should conform to the specification.</t>
52 <t><vspace blankLines="10000" /></t>
56 <section title="Format of Nameprep Test Vectors">
58 <t>The tests follow a certain syntax, described here by showing one
59 complete example with comments intermixed. The comments are prefixed
60 with the '#' character.</t>
64 # First the (UTF-8) string is printed as a C octet string, with
65 # characters [A-Za-z .0-9] shown inline and other characters shown
66 # escaped with \xAB where AB is the hex sequence of that octet. The
67 # number of octets are also shown.
72 # The input is also printed as Unicode codepoints.
77 # After printing the input, the nameprep steps starts. When the
78 # string is modified, the specific operation that caused it is printed
79 # along with the new string of Unicode code points.
81 # 1) Map -- For each character in the input, check if it has a mapping
82 # and, if so, replace it with its mapping. This is described in
85 Table B.2 maps U+1fb7 to U+03b1 U+0342 U+03b9.
88 # 2) Normalize -- Possibly normalize the result of step 1 using Unicode
89 # normalization. This is described in section 4.
91 Unicode normalization with form KC maps string into:
94 # 3) Prohibit -- Check for any characters that are not allowed in the
95 # output. If any are found, return an error. This is described in
98 # 4) Check bidi -- Possibly check for right-to-left characters, and if
99 # any are found, make sure that the whole string satisfies the
100 # requirements for bidirectional strings. If the string does not
101 # satisfy the requirements for bidirectional strings, return an
102 # error. This is described in section 6.
104 # 1) The characters in section 5.8 MUST be prohibited.
106 # 2) If a string contains any RandALCat character, the string MUST NOT
107 # contain any LCat character.
109 # 3) If a string contains any RandALCat character, a RandALCat
110 # character MUST be the first character of the string, and a
111 # RandALCat character MUST be the last character of the string.
113 # The output is printed as Unicode codepoints.
118 # And finally the output is printed as UTF-8
120 out (length 5 bytes):
127 <section title="Format of IDNA Test Vectors">
129 <t>The tests follow a certain syntax, described here by showing one
130 complete example with comments intermixed. The comments are prefixed
131 with the '#' character.</t>
135 # First the (UTF-8) string is printed as a C octet string, with
136 # characters [A-Za-z .0-9] shown inline and other characters shown
137 # escaped with \xAB where AB is the hex sequence of that octet. The
138 # number of octets are also shown.
140 in (length 39 bytes):
141 'Hello\x2DAnother\x2DWa'
142 'y\x2D\xE3\x81\x9D\xE3\x82\x8C\xE3\x81\x9E\xE3\x82\x8C\xE3\x81'
143 '\xAE\xE5\xA0\xB4\xE6\x89\x80
145 # The input is also printed as Unicode codepoints.
148 U+0048 U+0065 U+006c U+006c U+006f U+002d U+0041 U+006e
149 U+006f U+0074 U+0068 U+0065 U+0072 U+002d U+0057 U+0061
150 U+0079 U+002d U+305d U+308c U+305e U+308c U+306e U+5834
153 # After printing the input, the IDNA ToASCII step starts. The output
154 # is printed as an ASCII string.
156 out: xn--hello-another-way--fc4qua05auwb3674vfr0b
162 <t><vspace blankLines="10000" /></t>
166 <section title="Nameprep Test Vectors">
168 <?rfc include="foo"?>
172 <section title="IDNA Test Vectors">
174 <?rfc include="bar"?>
178 <section title="Security Considerations">
180 <t>The security considerations from Nameprep and IDNA are
183 <t>These test vectors are not believed to introduce new security
184 considerations nor disrupt the operation of the Internet, but may
185 expose security weaknesses in existing implementations. Any such
186 incident should not be regarded as a problem with this document,
187 though, but rather taken as evidence that this document served its
196 <note title="Acknowledgments">
197 <t>Some IDNA test vectors were borrowed from Punycode <xref
198 target="RFC3492" />.</t>
201 <section title="Nameprep test vectors in C syntax">
203 <t>In order to avoid having implementors type in the test vectors
204 above, a C structure with the data is provided.</t>
206 <t>The comment field is the section titles used in this document. The
207 in field contains UTF-8 encoded strings. The out field contains
208 expected output, or NULL if the expected result is an error. The
209 profile field can be ignored. The only significant setting for the
210 flags field is STRINGPREP_NO_UNASSIGNED which signals to the Nameprep
211 implementation that it should perform unassigned code point checking,
212 aka the "AllowUnassigned" flag. The rc field contains expected error
213 codes, where 0 indicates success and the other flags should be self
231 "foo\xC2\xAD\xCD\x8F\xE1\xA0\x86\xE1\xA0\x8B"
232 "bar""\xE2\x80\x8B\xE2\x81\xA0""baz\xEF\xB8\x80\xEF\xB8\x88"
233 "\xEF\xB8\x8F\xEF\xBB\xBF", "foobarbaz"
236 "Case folding ASCII U+0043 U+0041 U+0046 U+0045",
240 "Case folding 8bit U+00DF (german sharp s)",
244 "Case folding U+0130 (turkish capital I with dot)",
245 "\xC4\xB0", "i\xcc\x87"
248 "Case folding multibyte U+0143 U+037A",
249 "\xC5\x83\xCD\xBA", "\xC5\x84 \xCE\xB9"
252 "Case folding U+2121 U+33C6 U+1D7BB",
253 "\xE2\x84\xA1\xE3\x8F\x86\xF0\x9D\x9E\xBB",
254 "telc\xE2\x88\x95""kg\xCF\x83"
257 "Normalization of U+006a U+030c U+00A0 U+00AA",
258 "\x6A\xCC\x8C\xC2\xA0\xC2\xAA", "\xC7\xB0 a"
261 "Case folding U+1FB7 and normalization",
262 "\xE1\xBE\xB7", "\xE1\xBE\xB6\xCE\xB9"
265 "Self-reverting case folding U+01F0 and normalization",
266 "\xC7\xF0", "\xC7\xB0"
269 "Self-reverting case folding U+0390 and normalization",
270 "\xCE\x90", "\xCE\x90"
273 "Self-reverting case folding U+03B0 and normalization",
274 "\xCE\xB0", "\xCE\xB0"
277 "Self-reverting case folding U+1E96 and normalization",
278 "\xE1\xBA\x96", "\xE1\xBA\x96"
281 "Self-reverting case folding U+1F56 and normalization",
282 "\xE1\xBD\x96", "\xE1\xBD\x96"
285 "ASCII space character U+0020",
289 "Non-ASCII 8bit space character U+00A0",
293 "Non-ASCII multibyte space character U+1680",
294 "\xE1\x9A\x80", NULL, "Nameprep", 0,
295 STRINGPREP_CONTAINS_PROHIBITED
298 "Non-ASCII multibyte space character U+2000",
299 "\xE2\x80\x80", "\x20"
302 "Zero Width Space U+200b",
306 "Non-ASCII multibyte space character U+3000",
307 "\xE3\x80\x80", "\x20"
310 "ASCII control characters U+0010 U+007F",
311 "\x10\x7F", "\x10\x7F"
314 "Non-ASCII 8bit control character U+0085",
315 "\xC2\x85", NULL, "Nameprep", 0,
316 STRINGPREP_CONTAINS_PROHIBITED
319 "Non-ASCII multibyte control character U+180E",
320 "\xE1\xA0\x8E", NULL, "Nameprep", 0,
321 STRINGPREP_CONTAINS_PROHIBITED
324 "Zero Width No-Break Space U+FEFF",
328 "Non-ASCII control character U+1D175",
329 "\xF0\x9D\x85\xB5", NULL, "Nameprep", 0,
330 STRINGPREP_CONTAINS_PROHIBITED
333 "Plane 0 private use character U+F123",
334 "\xEF\x84\xA3", NULL, "Nameprep", 0,
335 STRINGPREP_CONTAINS_PROHIBITED
338 "Plane 15 private use character U+F1234",
339 "\xF3\xB1\x88\xB4", NULL, "Nameprep", 0,
340 STRINGPREP_CONTAINS_PROHIBITED
343 "Plane 16 private use character U+10F234",
344 "\xF4\x8F\x88\xB4", NULL, "Nameprep", 0,
345 STRINGPREP_CONTAINS_PROHIBITED
348 "Non-character code point U+8FFFE",
349 "\xF2\x8F\xBF\xBE", NULL, "Nameprep", 0,
350 STRINGPREP_CONTAINS_PROHIBITED
353 "Non-character code point U+10FFFF",
354 "\xF4\x8F\xBF\xBF", NULL, "Nameprep", 0,
355 STRINGPREP_CONTAINS_PROHIBITED
358 "Surrogate code U+DF42",
359 "\xED\xBD\x82", NULL, "Nameprep", 0,
360 STRINGPREP_CONTAINS_PROHIBITED
363 "Non-plain text character U+FFFD",
364 "\xEF\xBF\xBD", NULL, "Nameprep", 0,
365 STRINGPREP_CONTAINS_PROHIBITED
368 "Ideographic description character U+2FF5",
369 "\xE2\xBF\xB5", NULL, "Nameprep", 0,
370 STRINGPREP_CONTAINS_PROHIBITED
373 "Display property character U+0341",
374 "\xCD\x81", "\xCC\x81"
377 "Left-to-right mark U+200E",
378 "\xE2\x80\x8E", "\xCC\x81", "Nameprep", 0,
379 STRINGPREP_CONTAINS_PROHIBITED
383 "\xE2\x80\xAA", "\xCC\x81", "Nameprep", 0,
384 STRINGPREP_CONTAINS_PROHIBITED
387 "Language tagging character U+E0001",
388 "\xF3\xA0\x80\x81", "\xCC\x81", "Nameprep", 0,
389 STRINGPREP_CONTAINS_PROHIBITED
392 "Language tagging character U+E0042",
393 "\xF3\xA0\x81\x82", NULL, "Nameprep", 0,
394 STRINGPREP_CONTAINS_PROHIBITED
397 "Bidi: RandALCat character U+05BE and LCat characters",
398 "foo\xD6\xBE""bar", NULL, "Nameprep", 0,
399 STRINGPREP_BIDI_BOTH_L_AND_RAL
402 "Bidi: RandALCat character U+FD50 and LCat characters",
403 "foo\xEF\xB5\x90""bar", NULL, "Nameprep", 0,
404 STRINGPREP_BIDI_BOTH_L_AND_RAL
407 "Bidi: RandALCat character U+FB38 and LCat characters",
408 "foo\xEF\xB9\xB6""bar", "foo \xd9\x8e""bar"
410 { "Bidi: RandALCat without trailing RandALCat U+0627 U+0031",
411 "\xD8\xA7\x31", NULL, "Nameprep", 0,
412 STRINGPREP_BIDI_LEADTRAIL_NOT_RAL}
415 "Bidi: RandALCat character U+0627 U+0031 U+0628",
416 "\xD8\xA7\x31\xD8\xA8", "\xD8\xA7\x31\xD8\xA8"
419 "Unassigned code point U+E0002",
420 "\xF3\xA0\x80\x82", NULL, "Nameprep", STRINGPREP_NO_UNASSIGNED,
421 STRINGPREP_CONTAINS_UNASSIGNED
424 "Larger test (shrinking)",
425 "X\xC2\xAD\xC3\xDF\xC4\xB0\xE2\x84\xA1\x6a\xcc\x8c\xc2\xa0\xc2"
426 "\xaa\xce\xb0\xe2\x80\x80", "xssi\xcc\x87""tel\xc7\xb0 a\xce\xb0 ",
430 "Larger test (expanding)",
431 "X\xC3\xDF\xe3\x8c\x96\xC4\xB0\xE2\x84\xA1\xE2\x92\x9F\xE3\x8c\x80",
432 "xss\xe3\x82\xad\xe3\x83\xad\xe3\x83\xa1\xe3\x83\xbc\xe3\x83\x88"
433 "\xe3\x83\xab""i\xcc\x87""tel\x28""d\x29\xe3\x82\xa2\xe3\x83\x91"
434 "\xe3\x83\xbc\xe3\x83\x88"
442 <section title="IDNA test vectors in C syntax">
444 <t>In order to avoid having implementors type in the IDNA test vectors
445 above, a C structure with the data is provided.</t>
447 <t>The name field is the section titles used in this document. The
448 inlen and in field contains Unicode code points. The out field
449 contains expected ToASCII output. The allowunassigned, and
450 usestd3asciirules can be ignored. The toasciirc and tounicoderc field
451 contains expected error codes, where 0 indicates success and the other
452 flags should be self explanatory.</t>
460 unsigned long in[100];
463 int usestd3asciirules;
469 "Arabic (Egyptian)", 17,
471 0x0644, 0x064A, 0x0647, 0x0645, 0x0627, 0x0628, 0x062A, 0x0643,
472 0x0644, 0x0645, 0x0648, 0x0634, 0x0639, 0x0631, 0x0628, 0x064A,
474 IDNA_ACE_PREFIX "egbpdaj6bu4bxfgehfvwxn", 0, 0, IDNA_SUCCESS,
477 "Chinese (simplified)", 9,
479 0x4ED6, 0x4EEC, 0x4E3A, 0x4EC0, 0x4E48, 0x4E0D, 0x8BF4, 0x4E2D, 0x6587},
480 IDNA_ACE_PREFIX "ihqwcrb4cv8a8dqg056pqjye", 0, 0, IDNA_SUCCESS,
483 "Chinese (traditional)", 9,
485 0x4ED6, 0x5011, 0x7232, 0x4EC0, 0x9EBD, 0x4E0D, 0x8AAA, 0x4E2D, 0x6587},
486 IDNA_ACE_PREFIX "ihqwctvzc91f659drss3x8bo0yb", 0, 0, IDNA_SUCCESS,
491 0x0050, 0x0072, 0x006F, 0x010D, 0x0070, 0x0072, 0x006F, 0x0073,
492 0x0074, 0x011B, 0x006E, 0x0065, 0x006D, 0x006C, 0x0075, 0x0076,
493 0x00ED, 0x010D, 0x0065, 0x0073, 0x006B, 0x0079},
494 IDNA_ACE_PREFIX "Proprostnemluvesky-uyb24dma41a", 0, 0, IDNA_SUCCESS,
499 0x05DC, 0x05DE, 0x05D4, 0x05D4, 0x05DD, 0x05E4, 0x05E9, 0x05D5,
500 0x05D8, 0x05DC, 0x05D0, 0x05DE, 0x05D3, 0x05D1, 0x05E8, 0x05D9,
501 0x05DD, 0x05E2, 0x05D1, 0x05E8, 0x05D9, 0x05EA},
502 IDNA_ACE_PREFIX "4dbcagdahymbxekheh6e0a7fei0b", 0, 0, IDNA_SUCCESS,
505 "Hindi (Devanagari)", 30,
507 0x092F, 0x0939, 0x0932, 0x094B, 0x0917, 0x0939, 0x093F, 0x0928,
508 0x094D, 0x0926, 0x0940, 0x0915, 0x094D, 0x092F, 0x094B, 0x0902,
509 0x0928, 0x0939, 0x0940, 0x0902, 0x092C, 0x094B, 0x0932, 0x0938,
510 0x0915, 0x0924, 0x0947, 0x0939, 0x0948, 0x0902},
511 IDNA_ACE_PREFIX "i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd", 0, 0,
514 "Japanese (kanji and hiragana)", 18,
516 0x306A, 0x305C, 0x307F, 0x3093, 0x306A, 0x65E5, 0x672C, 0x8A9E,
517 0x3092, 0x8A71, 0x3057, 0x3066, 0x304F, 0x308C, 0x306A, 0x3044,
519 IDNA_ACE_PREFIX "n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa", 0, 0,
522 "Russian (Cyrillic)", 28,
524 0x043F, 0x043E, 0x0447, 0x0435, 0x043C, 0x0443, 0x0436, 0x0435,
525 0x043E, 0x043D, 0x0438, 0x043D, 0x0435, 0x0433, 0x043E, 0x0432,
526 0x043E, 0x0440, 0x044F, 0x0442, 0x043F, 0x043E, 0x0440, 0x0443,
527 0x0441, 0x0441, 0x043A, 0x0438},
528 IDNA_ACE_PREFIX "b1abfaaepdrnnbgefbadotcwatmq2g4l", 0, 0,
529 IDNA_SUCCESS, IDNA_SUCCESS},
533 0x0050, 0x006F, 0x0072, 0x0071, 0x0075, 0x00E9, 0x006E, 0x006F,
534 0x0070, 0x0075, 0x0065, 0x0064, 0x0065, 0x006E, 0x0073, 0x0069,
535 0x006D, 0x0070, 0x006C, 0x0065, 0x006D, 0x0065, 0x006E, 0x0074,
536 0x0065, 0x0068, 0x0061, 0x0062, 0x006C, 0x0061, 0x0072, 0x0065,
537 0x006E, 0x0045, 0x0073, 0x0070, 0x0061, 0x00F1, 0x006F, 0x006C},
538 IDNA_ACE_PREFIX "PorqunopuedensimplementehablarenEspaol-fmd56a", 0, 0,
543 0x0054, 0x1EA1, 0x0069, 0x0073, 0x0061, 0x006F, 0x0068, 0x1ECD,
544 0x006B, 0x0068, 0x00F4, 0x006E, 0x0067, 0x0074, 0x0068, 0x1EC3,
545 0x0063, 0x0068, 0x1EC9, 0x006E, 0x00F3, 0x0069, 0x0074, 0x0069,
546 0x1EBF, 0x006E, 0x0067, 0x0056, 0x0069, 0x1EC7, 0x0074},
547 IDNA_ACE_PREFIX "TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g", 0, 0,
552 0x0033, 0x5E74, 0x0042, 0x7D44, 0x91D1, 0x516B, 0x5148, 0x751F},
553 IDNA_ACE_PREFIX "3B-ww4c5e180e575a65lsy2b", 0, 0, IDNA_SUCCESS,
558 0x5B89, 0x5BA4, 0x5948, 0x7F8E, 0x6075, 0x002D, 0x0077, 0x0069,
559 0x0074, 0x0068, 0x002D, 0x0053, 0x0055, 0x0050, 0x0045, 0x0052,
560 0x002D, 0x004D, 0x004F, 0x004E, 0x004B, 0x0045, 0x0059, 0x0053},
561 IDNA_ACE_PREFIX "-with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n", 0, 0,
566 0x0048, 0x0065, 0x006C, 0x006C, 0x006F, 0x002D, 0x0041, 0x006E,
567 0x006F, 0x0074, 0x0068, 0x0065, 0x0072, 0x002D, 0x0057, 0x0061,
568 0x0079, 0x002D, 0x305D, 0x308C, 0x305E, 0x308C, 0x306E, 0x5834,
570 IDNA_ACE_PREFIX "Hello-Another-Way--fc4qua05auwb3674vfr0b", 0, 0,
575 0x3072, 0x3068, 0x3064, 0x5C4B, 0x6839, 0x306E, 0x4E0B, 0x0032},
576 IDNA_ACE_PREFIX "2-u9tlzr9756bt3uc0v", 0, 0, IDNA_SUCCESS,
581 0x004D, 0x0061, 0x006A, 0x0069, 0x3067, 0x004B, 0x006F, 0x0069,
582 0x3059, 0x308B, 0x0035, 0x79D2, 0x524D},
583 IDNA_ACE_PREFIX "MajiKoi5-783gue6qz075azm5e", 0, 0, IDNA_SUCCESS,
588 0x30D1, 0x30D5, 0x30A3, 0x30FC, 0x0064, 0x0065, 0x30EB, 0x30F3, 0x30D0},
589 IDNA_ACE_PREFIX "de-jg4avhby1noc0d", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
593 0x305D, 0x306E, 0x30B9, 0x30D4, 0x30FC, 0x30C9, 0x3067},
594 IDNA_ACE_PREFIX "d9juau41awczczp", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
597 {0x03b5, 0x03bb, 0x03bb, 0x03b7, 0x03bd, 0x03b9, 0x03ba, 0x03ac},
598 IDNA_ACE_PREFIX "hxargifdar", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
600 "Maltese (Malti)", 10,
601 {0x0062, 0x006f, 0x006e, 0x0121, 0x0075, 0x0073, 0x0061, 0x0127,
603 IDNA_ACE_PREFIX "bonusaa-5bb1da", 0, 0, IDNA_SUCCESS, IDNA_SUCCESS},
605 "Russian (Cyrillic)", 28,
606 {0x043f, 0x043e, 0x0447, 0x0435, 0x043c, 0x0443, 0x0436, 0x0435,
607 0x043e, 0x043d, 0x0438, 0x043d, 0x0435, 0x0433, 0x043e, 0x0432,
608 0x043e, 0x0440, 0x044f, 0x0442, 0x043f, 0x043e, 0x0440, 0x0443,
609 0x0441, 0x0441, 0x043a, 0x0438},
610 IDNA_ACE_PREFIX "b1abfaaepdrnnbgefbadotcwatmq2g4l", 0, 0,
611 IDNA_SUCCESS, IDNA_SUCCESS},
618 <references title="Normative References">
619 <?rfc include="reference.RFC.3491.xml"?>
620 <?rfc include="reference.RFC.3490.xml"?>
623 <references title="Informative References">
624 <?rfc include="reference.RFC.3492.xml"?>