draft-josefsson-idn-test-vectors.xml

   1 <?xml version="1.0"?>
   2 <!DOCTYPE rfc SYSTEM "rfc2629.dtd">
   3
   4 <?rfc compact="yes"?>
   5 <?rfc toc="yes"?>
   6
   7 <rfc category="info" ipr="full2026"
   8      maillist=""
   9      docName="draft-josefsson-idn-test-vectors">
  10
  11 <front>
  12
  13 <title>Nameprep and IDNA Test Vectors</title>
  14
  15 <author initials="S." surname="Josefsson" fullname="Simon Josefsson">
  16         <organization>Extundo</organization>
  17         <address>
  18                 <postal>
  19                         <street>Drottningholmsv. 70</street>
  20                         <city>Stockholm</city> <code>112 42</code>
  21                         <country>Sweden</country>
  22                 </postal>
  23                 <email>simon@josefsson.org</email>
  24         </address>
  25 </author>
  26
  27 <date month="February" year="2003"/>
  28
  29 <abstract>
  30
  31 <t>This document contains test vectors for Nameprep and IDNA.  The
  32 majority of test vectors are derived in order to cover various corner
  33 cases in the specifications, but some anticipated typical data from
  34 the real world are also included.  The aim is to promote
  35 interoperability and standards compliance in deployed
  36 implementations.</t>
  37
  38 </abstract>
  39
  40 </front>
  41
  42 <middle>
  43
  44 <section title="Introduction">
  45
  46 <t>The Nameprep and IDNA specifications lack thorough examples that
  47 would aid in implementing them.  This document is a complement to
  48 those specifications.</t>
  49
  50 <t>It should be pointed out that this document is not normative, and
  51 hence any errors here should not be treated as gospel defining
  52 Nameprep nor IDNA.  When conforming to the specification and
  53 generating output corresponding to values in this document is in
  54 conflict, implementations must conform to the specification.</t>
  55
  56 </section>
  57
  58 <section title="Format of Nameprep Test Vectors">
  59
  60 <t>The tests follow a certain syntax, described here by showing one
  61 complete example with comments intermixed.  The comments are prefixed
  62 with the '#' character.</t>
  63
  64 <figure>
  65 <artwork>
  66 # First the (UTF-8) string is printed as a C octet string, with
  67 # characters [A-Za-z .0-9] shown inline and other characters shown
  68 # escaped with \xAB where AB is the hex sequence of that octet.  The
  69 # number of octets are also shown.
  70
  71 in: `foo\xC2\xADbar' (length 8)
  72
  73 # The input is also printed as Unicode codepoints.
  74
  75 input: U+0066 U+006F U+006F U+00AD U+0062 U+0061 U+0072
  76
  77 # After printing the input, the nameprep steps starts.  When the
  78 # string is modified, the specific operation that caused it is printed
  79 # along with the new string of Unicode code points.
  80
  81 # 1) Map -- For each character in the input, check if it has a mapping
  82 #    and, if so, replace it with its mapping.  This is described in
  83 #    section 3.
  84
  85 Table B.1 maps U+00AD to nothing.
  86 U+0066 U+006f U+006f U+0062 U+0061 U+0072
  87
  88 # 2) Normalize -- Possibly normalize the result of step 1 using Unicode
  89 #    normalization.  This is described in section 4.
  90
  91 # 3) Prohibit -- Check for any characters that are not allowed in the
  92 #    output.  If any are found, return an error.  This is described in
  93 #    section 5.
  94
  95 # 4) Check bidi -- Possibly check for right-to-left characters, and if
  96 #    any are found, make sure that the whole string satisfies the
  97 #    requirements for bidirectional strings.  If the string does not
  98 #    satisfy the requirements for bidirectional strings, return an
  99 #    error.  This is described in section 6.
 100 #
 101 #    1) The characters in section 5.8 MUST be prohibited.
 102
 103 #    2) If a string contains any RandALCat character, the string MUST NOT
 104 #       contain any LCat character.
 105
 106 #    3) If a string contains any RandALCat character, a RandALCat
 107 #       character MUST be the first character of the string, and a
 108 #       RandALCat character MUST be the last character of the string.
 109
 110 # The output is printed as Unicode codepoints.
 111
 112 output: U+0066 U+006f U+006f U+0062 U+0061 U+0072
 113
 114 # And finally the output is printed as UTF-8
 115
 116 out: `foobar' (length 6 bytes)
 117 </artwork>
 118 </figure>
 119
 120 </section>
 121
 122 <section title="Nameprep Test Vectors">
 123
 124 <?rfc include="foo"?>
 125
 126 </section>
 127
 128 <section title="Security Considerations">
 129
 130 <t>The security considerations of Nameprep and IDNA are discussed in
 131 those specifications.</t>
 132
 133 </section>
 134
 135 </middle>
 136
 137 <back>
 138
 139 <note title="Acknowledgments">
 140
 141 <t>TBA</t>
 142
 143 </note>
 144
 145 <section title="Test vector in C syntax">
 146
 147 <t>In order to avoid having implementators type in the test vectors
 148 above, a C structure with the data is provided.  While it is specific
 149 for one implementation, it should be trivial to adopt for any package.</t>
 150
 151 <t>The comment field is the section titles used in this document.  The
 152 in field contains UTF-8 encoded strings.  The out field contains
 153 expected output, or NULL if the expected result is an error.  The
 154 profile field can be ignored.  The only significant setting for the
 155 flags field is STRINGPREP_NO_UNASSIGNED which signals to the Nameprep
 156 implementation that it should perform unassigned code point checking,
 157 aka the "AllowUnassigned" flag.  The rc field contains expected error
 158 codes, where 0 indicates success and the other flags should be pretty
 159 self explanatory.</t>
 160
 161 <figure>
 162 <artwork>
 163 struct stringprep
 164 {
 165   char *comment;
 166   char *in;
 167   char *out;
 168   Stringprep_profile *profile;
 169   int flags;
 170   int rc;
 171 }
 172 strprep[] =
 173 {
 174   {
 175     "Map to nothing",
 176     "foo\xC2\xAD\xCD\x8F\xE1\xA0\x86\xE1\xA0\x8B"
 177     "bar""\xE2\x80\x8B\xE2\x81\xA0""baz\xEF\xB8\x80\xEF\xB8\x88"
 178     "\xEF\xB8\x8F\xEF\xBB\xBF", "foobarbaz"
 179   },
 180   {
 181     "Case folding ASCII U+0043 U+0041 U+0046 U+0045",
 182     "CAFE", "cafe"
 183   },
 184   {
 185     "Case folding 8bit U+00DF (german sharp s)",
 186     "\xC3\xDF", "ss"
 187   },
 188   {
 189     "Case folding U+0130 (turkish capital I with dot)",
 190     "\xC4\xB0", "i\xcc\x87"
 191   },
 192   {
 193     "Case folding multibyte U+0143 U+037A",
 194     "\xC5\x83\xCD\xBA", "\xC5\x84 \xCE\xB9"
 195   },
 196   {
 197     "Case folding U+2121 U+33C6 U+1D7BB",
 198     "\xE2\x84\xA1\xE3\x8F\x86\xF0\x9D\x9E\xBB",
 199     "telc\xE2\x88\x95""kg\xCF\x83"
 200   },
 201   {
 202     "Normalization of U+006a U+030c U+00A0 U+00AA",
 203     "\x6A\xCC\x8C\xC2\xA0\xC2\xAA", "\xC7\xB0 a"
 204   },
 205   {
 206     "Case folding U+1FB7 and normalization",
 207     "\xE1\xBE\xB7", "\xE1\xBE\xB6\xCE\xB9"
 208   },
 209   {
 210     "Self-reverting case folding U+01F0 and normalization",
 211     "\xC7\xF0", "\xC7\xB0"
 212   },
 213   {
 214     "Self-reverting case folding U+0390 and normalization",
 215     "\xCE\x90", "\xCE\x90"
 216   },
 217   {
 218     "Self-reverting case folding U+03B0 and normalization",
 219     "\xCE\xB0", "\xCE\xB0"
 220   },
 221   {
 222     "Self-reverting case folding U+1E96 and normalization",
 223     "\xE1\xBA\x96", "\xE1\xBA\x96"
 224   },
 225   {
 226     "Self-reverting case folding U+1F56 and normalization",
 227     "\xE1\xBD\x96", "\xE1\xBD\x96"
 228   },
 229   {
 230     "ASCII space character U+0020",
 231     "\x20", "\x20"
 232   },
 233   {
 234     "Non-ASCII 8bit space character U+00A0",
 235     "\xC2\xA0", "\x20"
 236   },
 237   {
 238     "Non-ASCII multibyte space character U+1680",
 239     "\xE1\x9A\x80", NULL, stringprep_nameprep, 0,
 240     STRINGPREP_CONTAINS_PROHIBITED
 241   },
 242   {
 243     "Non-ASCII multibyte space character U+2000",
 244     "\xE2\x80\x80", "\x20"
 245   },
 246   {
 247     "Zero Width Space U+200b",
 248     "\xE2\x80\x8b", ""
 249   },
 250   {
 251     "Non-ASCII multibyte space character U+3000",
 252     "\xE3\x80\x80", "\x20"
 253   },
 254   {
 255     "ASCII control characters U+0010 U+007F",
 256     "\x10\x7F", "\x10\x7F"
 257   },
 258   {
 259     "Non-ASCII 8bit control character U+0085",
 260     "\xC2\x85", NULL, stringprep_nameprep, 0,
 261     STRINGPREP_CONTAINS_PROHIBITED
 262   },
 263   {
 264     "Non-ASCII multibyte control character U+180E",
 265     "\xE1\xA0\x8E", NULL, stringprep_nameprep, 0,
 266     STRINGPREP_CONTAINS_PROHIBITED
 267   },
 268   {
 269     "Zero Width No-Break Space U+FEFF",
 270     "\xEF\xBB\xBF", ""
 271   },
 272   {
 273     "Non-ASCII control character U+1D175",
 274     "\xF0\x9D\x85\xB5", NULL, stringprep_nameprep, 0,
 275     STRINGPREP_CONTAINS_PROHIBITED
 276   },
 277   {
 278     "Plane 0 private use character U+F123",
 279     "\xEF\x84\xA3", NULL, stringprep_nameprep, 0,
 280     STRINGPREP_CONTAINS_PROHIBITED
 281   },
 282   {
 283     "Plane 15 private use character U+F1234",
 284     "\xF3\xB1\x88\xB4", NULL, stringprep_nameprep, 0,
 285     STRINGPREP_CONTAINS_PROHIBITED
 286   },
 287   {
 288     "Plane 16 private use character U+10F234",
 289     "\xF4\x8F\x88\xB4", NULL, stringprep_nameprep, 0,
 290     STRINGPREP_CONTAINS_PROHIBITED
 291   },
 292   {
 293     "Non-character code point U+8FFFE",
 294     "\xF2\x8F\xBF\xBE", NULL, stringprep_nameprep, 0,
 295     STRINGPREP_CONTAINS_PROHIBITED
 296   },
 297   {
 298     "Non-character code point U+10FFFF",
 299     "\xF4\x8F\xBF\xBF", NULL, stringprep_nameprep, 0,
 300     STRINGPREP_CONTAINS_PROHIBITED
 301   },
 302   {
 303     "Surrogate code U+DF42",
 304     "\xED\xBD\x82", NULL, stringprep_nameprep, 0,
 305     STRINGPREP_CONTAINS_PROHIBITED
 306   },
 307   {
 308     "Non-plain text character U+FFFD",
 309     "\xEF\xBF\xBD", NULL, stringprep_nameprep, 0,
 310     STRINGPREP_CONTAINS_PROHIBITED
 311   },
 312   {
 313     "Ideographic description character U+2FF5",
 314     "\xE2\xBF\xB5", NULL, stringprep_nameprep, 0,
 315     STRINGPREP_CONTAINS_PROHIBITED
 316   },
 317   {
 318     "Display property character U+0341",
 319     "\xCD\x81", "\xCC\x81"
 320   },
 321   {
 322     "Left-to-right mark U+200E",
 323     "\xE2\x80\x8E", "\xCC\x81", stringprep_nameprep, 0,
 324     STRINGPREP_CONTAINS_PROHIBITED
 325   },
 326   {
 327     "Deprecated U+202A",
 328     "\xE2\x80\xAA", "\xCC\x81", stringprep_nameprep, 0,
 329     STRINGPREP_CONTAINS_PROHIBITED
 330   },
 331   {
 332     "Language tagging character U+E0001",
 333     "\xF3\xA0\x80\x81", "\xCC\x81", stringprep_nameprep, 0,
 334     STRINGPREP_CONTAINS_PROHIBITED
 335   },
 336   {
 337     "Language tagging character U+E0042",
 338     "\xF3\xA0\x81\x82", NULL, stringprep_nameprep, 0,
 339     STRINGPREP_CONTAINS_PROHIBITED
 340   },
 341   {
 342     "Bidi: RandALCat character U+05BE and LCat characters",
 343     "foo\xD6\xBE""bar", NULL, stringprep_nameprep, 0,
 344     STRINGPREP_BIDI_BOTH_L_AND_RAL
 345   },
 346   {
 347     "Bidi: RandALCat character U+FD50 and LCat characters",
 348     "foo\xEF\xB5\x90""bar", NULL, stringprep_nameprep, 0,
 349     STRINGPREP_BIDI_BOTH_L_AND_RAL
 350   },
 351   {
 352     "Bidi: RandALCat character U+FB38 and LCat characters",
 353     "foo\xEF\xB9\xB6""bar", "foo \xd9\x8e""bar"
 354   },
 355   { "Bidi: RandALCat without trailing RandALCat U+0627 U+0031",
 356     "\xD8\xA7\x31", NULL, stringprep_nameprep, 0,
 357     STRINGPREP_BIDI_LEADTRAIL_NOT_RAL}
 358   ,
 359   {
 360     "Bidi: RandALCat character U+0627 U+0031 U+0628",
 361     "\xD8\xA7\x31\xD8\xA8", "\xD8\xA7\x31\xD8\xA8"
 362   },
 363   {
 364     "Unassigned code point U+E0002",
 365     "\xF3\xA0\x80\x82", NULL, stringprep_nameprep, STRINGPREP_NO_UNASSIGNED,
 366     STRINGPREP_CONTAINS_UNASSIGNED
 367   },
 368   {
 369     "Larger test (shrinking)",
 370     "X\xC2\xAD\xC3\xDF\xC4\xB0\xE2\x84\xA1\x6a\xcc\x8c\xc2\xa0\xc2"
 371     "\xaa\xce\xb0\xe2\x80\x80", "xssi\xcc\x87""tel\xc7\xb0 a\xce\xb0 ",
 372     stringprep_nameprep
 373   },
 374   {
 375     "Larger test (expanding)",
 376     "X\xC3\xDF\xe3\x8c\x96\xC4\xB0\xE2\x84\xA1\xE2\x92\x9F\xE3\x8c\x80",
 377     "xss\xe3\x82\xad\xe3\x83\xad\xe3\x83\xa1\xe3\x83\xbc\xe3\x83\x88"
 378     "\xe3\x83\xab""i\xcc\x87""tel\x28""d\x29\xe3\x82\xa2\xe3\x83\x91"
 379     "\xe3\x83\xbc\xe3\x83\x88"
 380   },
 381 };
 382 </artwork>
 383 </figure>
 384
 385 </section>
 386
 387 <references title="Normative References">
 388    <?rfc include="reference.RFC.3454.xml"?>
 389 </references>
 390
 391 <references title="Informative References">
 392 </references>
 393
 394 </back>
 395
 396 </rfc>