2 <!DOCTYPE rfc SYSTEM "rfc2629.dtd">
7 <rfc category="info" ipr="full2026"
9 docName="draft-josefsson-idn-test-vectors">
13 <title>Nameprep and IDNA Test Vectors</title>
15 <author initials="S." surname="Josefsson" fullname="Simon Josefsson">
16 <organization>Extundo</organization>
19 <street>Drottningholmsv. 70</street>
20 <city>Stockholm</city> <code>112 42</code>
21 <country>Sweden</country>
23 <email>simon@josefsson.org</email>
27 <date month="February" year="2003"/>
31 <t>This document contains test vectors for Nameprep and IDNA. The
32 majority of test vectors are derived in order to cover various corner
33 cases in the specifications, but some anticipated typical data from
34 the real world are also included. The aim is to promote
35 interoperability and standards compliance in deployed
44 <section title="Introduction">
46 <t>The Nameprep and IDNA specifications lack thorough examples that
47 would aid in implementing them. This document is a complement to
48 those specifications.</t>
50 <t>It should be pointed out that this document is not normative, and
51 hence any errors here should not be treated as gospel defining
52 Nameprep nor IDNA. When conforming to the specification and
53 generating output corresponding to values in this document is in
54 conflict, implementations must conform to the specification.</t>
58 <section title="Format of Nameprep Test Vectors">
60 <t>The tests follow a certain syntax, described here by showing one
61 complete example with comments intermixed. The comments are prefixed
62 with the '#' character.</t>
66 # First the (UTF-8) string is printed as a C octet string, with
67 # characters [A-Za-z .0-9] shown inline and other characters shown
68 # escaped with \xAB where AB is the hex sequence of that octet. The
69 # number of octets are also shown.
71 in: `foo\xC2\xADbar' (length 8)
73 # The input is also printed as Unicode codepoints.
75 input: U+0066 U+006F U+006F U+00AD U+0062 U+0061 U+0072
77 # After printing the input, the nameprep steps starts. When the
78 # string is modified, the specific operation that caused it is printed
79 # along with the new string of Unicode code points.
81 # 1) Map -- For each character in the input, check if it has a mapping
82 # and, if so, replace it with its mapping. This is described in
85 Table B.1 maps U+00AD to nothing.
86 U+0066 U+006f U+006f U+0062 U+0061 U+0072
88 # 2) Normalize -- Possibly normalize the result of step 1 using Unicode
89 # normalization. This is described in section 4.
91 # 3) Prohibit -- Check for any characters that are not allowed in the
92 # output. If any are found, return an error. This is described in
95 # 4) Check bidi -- Possibly check for right-to-left characters, and if
96 # any are found, make sure that the whole string satisfies the
97 # requirements for bidirectional strings. If the string does not
98 # satisfy the requirements for bidirectional strings, return an
99 # error. This is described in section 6.
101 # 1) The characters in section 5.8 MUST be prohibited.
103 # 2) If a string contains any RandALCat character, the string MUST NOT
104 # contain any LCat character.
106 # 3) If a string contains any RandALCat character, a RandALCat
107 # character MUST be the first character of the string, and a
108 # RandALCat character MUST be the last character of the string.
110 # The output is printed as Unicode codepoints.
112 output: U+0066 U+006f U+006f U+0062 U+0061 U+0072
114 # And finally the output is printed as UTF-8
116 out: `foobar' (length 6 bytes)
122 <section title="Nameprep Test Vectors">
124 <?rfc include="foo"?>
128 <section title="Security Considerations">
130 <t>The security considerations of Nameprep and IDNA are discussed in
131 those specifications.</t>
139 <note title="Acknowledgments">
145 <section title="Test vector in C syntax">
147 <t>In order to avoid having implementators type in the test vectors
148 above, a C structure with the data is provided. While it is specific
149 for one implementation, it should be trivial to adopt for any package.</t>
151 <t>The comment field is the section titles used in this document. The
152 in field contains UTF-8 encoded strings. The out field contains
153 expected output, or NULL if the expected result is an error. The
154 profile field can be ignored. The only significant setting for the
155 flags field is STRINGPREP_NO_UNASSIGNED which signals to the Nameprep
156 implementation that it should perform unassigned code point checking,
157 aka the "AllowUnassigned" flag. The rc field contains expected error
158 codes, where 0 indicates success and the other flags should be pretty
159 self explanatory.</t>
168 Stringprep_profile *profile;
176 "foo\xC2\xAD\xCD\x8F\xE1\xA0\x86\xE1\xA0\x8B"
177 "bar""\xE2\x80\x8B\xE2\x81\xA0""baz\xEF\xB8\x80\xEF\xB8\x88"
178 "\xEF\xB8\x8F\xEF\xBB\xBF", "foobarbaz"
181 "Case folding ASCII U+0043 U+0041 U+0046 U+0045",
185 "Case folding 8bit U+00DF (german sharp s)",
189 "Case folding U+0130 (turkish capital I with dot)",
190 "\xC4\xB0", "i\xcc\x87"
193 "Case folding multibyte U+0143 U+037A",
194 "\xC5\x83\xCD\xBA", "\xC5\x84 \xCE\xB9"
197 "Case folding U+2121 U+33C6 U+1D7BB",
198 "\xE2\x84\xA1\xE3\x8F\x86\xF0\x9D\x9E\xBB",
199 "telc\xE2\x88\x95""kg\xCF\x83"
202 "Normalization of U+006a U+030c U+00A0 U+00AA",
203 "\x6A\xCC\x8C\xC2\xA0\xC2\xAA", "\xC7\xB0 a"
206 "Case folding U+1FB7 and normalization",
207 "\xE1\xBE\xB7", "\xE1\xBE\xB6\xCE\xB9"
210 "Self-reverting case folding U+01F0 and normalization",
211 "\xC7\xF0", "\xC7\xB0"
214 "Self-reverting case folding U+0390 and normalization",
215 "\xCE\x90", "\xCE\x90"
218 "Self-reverting case folding U+03B0 and normalization",
219 "\xCE\xB0", "\xCE\xB0"
222 "Self-reverting case folding U+1E96 and normalization",
223 "\xE1\xBA\x96", "\xE1\xBA\x96"
226 "Self-reverting case folding U+1F56 and normalization",
227 "\xE1\xBD\x96", "\xE1\xBD\x96"
230 "ASCII space character U+0020",
234 "Non-ASCII 8bit space character U+00A0",
238 "Non-ASCII multibyte space character U+1680",
239 "\xE1\x9A\x80", NULL, stringprep_nameprep, 0,
240 STRINGPREP_CONTAINS_PROHIBITED
243 "Non-ASCII multibyte space character U+2000",
244 "\xE2\x80\x80", "\x20"
247 "Zero Width Space U+200b",
251 "Non-ASCII multibyte space character U+3000",
252 "\xE3\x80\x80", "\x20"
255 "ASCII control characters U+0010 U+007F",
256 "\x10\x7F", "\x10\x7F"
259 "Non-ASCII 8bit control character U+0085",
260 "\xC2\x85", NULL, stringprep_nameprep, 0,
261 STRINGPREP_CONTAINS_PROHIBITED
264 "Non-ASCII multibyte control character U+180E",
265 "\xE1\xA0\x8E", NULL, stringprep_nameprep, 0,
266 STRINGPREP_CONTAINS_PROHIBITED
269 "Zero Width No-Break Space U+FEFF",
273 "Non-ASCII control character U+1D175",
274 "\xF0\x9D\x85\xB5", NULL, stringprep_nameprep, 0,
275 STRINGPREP_CONTAINS_PROHIBITED
278 "Plane 0 private use character U+F123",
279 "\xEF\x84\xA3", NULL, stringprep_nameprep, 0,
280 STRINGPREP_CONTAINS_PROHIBITED
283 "Plane 15 private use character U+F1234",
284 "\xF3\xB1\x88\xB4", NULL, stringprep_nameprep, 0,
285 STRINGPREP_CONTAINS_PROHIBITED
288 "Plane 16 private use character U+10F234",
289 "\xF4\x8F\x88\xB4", NULL, stringprep_nameprep, 0,
290 STRINGPREP_CONTAINS_PROHIBITED
293 "Non-character code point U+8FFFE",
294 "\xF2\x8F\xBF\xBE", NULL, stringprep_nameprep, 0,
295 STRINGPREP_CONTAINS_PROHIBITED
298 "Non-character code point U+10FFFF",
299 "\xF4\x8F\xBF\xBF", NULL, stringprep_nameprep, 0,
300 STRINGPREP_CONTAINS_PROHIBITED
303 "Surrogate code U+DF42",
304 "\xED\xBD\x82", NULL, stringprep_nameprep, 0,
305 STRINGPREP_CONTAINS_PROHIBITED
308 "Non-plain text character U+FFFD",
309 "\xEF\xBF\xBD", NULL, stringprep_nameprep, 0,
310 STRINGPREP_CONTAINS_PROHIBITED
313 "Ideographic description character U+2FF5",
314 "\xE2\xBF\xB5", NULL, stringprep_nameprep, 0,
315 STRINGPREP_CONTAINS_PROHIBITED
318 "Display property character U+0341",
319 "\xCD\x81", "\xCC\x81"
322 "Left-to-right mark U+200E",
323 "\xE2\x80\x8E", "\xCC\x81", stringprep_nameprep, 0,
324 STRINGPREP_CONTAINS_PROHIBITED
328 "\xE2\x80\xAA", "\xCC\x81", stringprep_nameprep, 0,
329 STRINGPREP_CONTAINS_PROHIBITED
332 "Language tagging character U+E0001",
333 "\xF3\xA0\x80\x81", "\xCC\x81", stringprep_nameprep, 0,
334 STRINGPREP_CONTAINS_PROHIBITED
337 "Language tagging character U+E0042",
338 "\xF3\xA0\x81\x82", NULL, stringprep_nameprep, 0,
339 STRINGPREP_CONTAINS_PROHIBITED
342 "Bidi: RandALCat character U+05BE and LCat characters",
343 "foo\xD6\xBE""bar", NULL, stringprep_nameprep, 0,
344 STRINGPREP_BIDI_BOTH_L_AND_RAL
347 "Bidi: RandALCat character U+FD50 and LCat characters",
348 "foo\xEF\xB5\x90""bar", NULL, stringprep_nameprep, 0,
349 STRINGPREP_BIDI_BOTH_L_AND_RAL
352 "Bidi: RandALCat character U+FB38 and LCat characters",
353 "foo\xEF\xB9\xB6""bar", "foo \xd9\x8e""bar"
355 { "Bidi: RandALCat without trailing RandALCat U+0627 U+0031",
356 "\xD8\xA7\x31", NULL, stringprep_nameprep, 0,
357 STRINGPREP_BIDI_LEADTRAIL_NOT_RAL}
360 "Bidi: RandALCat character U+0627 U+0031 U+0628",
361 "\xD8\xA7\x31\xD8\xA8", "\xD8\xA7\x31\xD8\xA8"
364 "Unassigned code point U+E0002",
365 "\xF3\xA0\x80\x82", NULL, stringprep_nameprep, STRINGPREP_NO_UNASSIGNED,
366 STRINGPREP_CONTAINS_UNASSIGNED
369 "Larger test (shrinking)",
370 "X\xC2\xAD\xC3\xDF\xC4\xB0\xE2\x84\xA1\x6a\xcc\x8c\xc2\xa0\xc2"
371 "\xaa\xce\xb0\xe2\x80\x80", "xssi\xcc\x87""tel\xc7\xb0 a\xce\xb0 ",
375 "Larger test (expanding)",
376 "X\xC3\xDF\xe3\x8c\x96\xC4\xB0\xE2\x84\xA1\xE2\x92\x9F\xE3\x8c\x80",
377 "xss\xe3\x82\xad\xe3\x83\xad\xe3\x83\xa1\xe3\x83\xbc\xe3\x83\x88"
378 "\xe3\x83\xab""i\xcc\x87""tel\x28""d\x29\xe3\x82\xa2\xe3\x83\x91"
379 "\xe3\x83\xbc\xe3\x83\x88"
387 <references title="Normative References">
388 <?rfc include="reference.RFC.3454.xml"?>
391 <references title="Informative References">