2 draft-ietf-krb-wg-utf8-profile-01.txt
6 Preparation of Internationalized Strings Profile for Kerberos UTF-8 Strings
12 This document is an Internet-Draft and is in full conformance with all
13 provisions of Section 10 of RFC2026.
15 Internet-Drafts are working documents of the Internet Engineering Task
16 Force (IETF), its areas, and its working groups. Note that other groups
17 may also distribute working documents as Internet-Drafts.
19 Internet-Drafts are draft documents valid for a maximum of six months
20 and may be updated, replaced, or obsoleted by other documents at any
21 time. It is inappropriate to use Internet-Drafts as reference material
22 or to cite them other than as "work in progress."
24 To view the list Internet-Draft Shadow Directories, see
25 http://www.ietf.org/shadow.html.
30 This document describes how to prepare UTF-8 strings for use with Kerberos
31 protocols in order to increase the likelihood that name input and name comparison
32 work in ways that make sense for typical users throughout the world. This
33 is a profile of "Preparation of Internationalized Strings" [RFC3454].
37 This document specifies processing rules that will allow users to enter
38 Kerberos Principal Names and input to cryptographic String to Key functions.
39 It is a profile of stringprep [RFC3454].
41 This profile defines the following, as required by [RFC3454]
43 - The intended applicability of the profile: internationalized
46 - The character repertoire that is the input and output to stringprep:
49 - The list of unassigned code points for the repertoire: defined
52 - The mappings used: defined in Section 3.
54 - The Unicode normalization used: defined in Section 4
56 - The characters that are prohibited as output: Defined in section 5
61 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
62 "MAY" in this document are to be interpreted as described in RFC 2119
65 Examples in this document use the notation for code points and names
66 from the Unicode Standard [Unicode3.1] and ISO/IEC 10646 [ISO10646]. For
67 example, the letter "a" may be represented as either "U+0061" or "LATIN
68 SMALL LETTER A". In the lists of prohibited characters, the "U+" is left
69 off to make the lists easier to read. The comments for character ranges
70 are shown in square brackets (such as "[SYMBOLS]") and do not come from
74 2. Character Repertoire
76 Unicode 3.2 [Unicode3.2] is the repertoire used in this profile.
77 The reason Unicode 3.2 was chosen instead of a version of
78 ISO/IEC 10646 is that Unicode 3.2 is the basis for [RFC3454].
83 This profile specifies stringprep mapping using the mapping table
84 in Appendix C. That table includes all the steps described in this
87 Note that text in this section describe how Appendix C was formed. It is
88 there for people who want to understand more, but it should be ignored
89 by implementors. Implementations of this profile MUST map based on
90 Appendix C, not based on the descriptions in this section of how
91 Appendix C was created.
95 The following characters are simply deleted from the input (that is,
96 they are mapped to nothing) because their presence or absence should not
97 make two strings different.
99 Some characters are only useful in line-based text, and are otherwise
100 invisible and ignored.
103 1806; MONGOLIAN TODO SOFT HYPHEN
104 200B; ZERO WIDTH SPACE
106 FEFF; ZERO WIDTH NO-BREAK SPACE
108 Variation selectors and cursive connectors select different glyphs, but
109 do not bear semantics.
111 034F; COMBINING GRAPHEME JOINER
112 180B; MONGOLIAN FREE VARIATION SELECTOR ONE
113 180C; MONGOLIAN FREE VARIATION SELECTOR TWO
114 180D; MONGOLIAN FREE VARIATION SELECTOR THREE
115 200C; ZERO WIDTH NON-JOINER
116 200D; ZERO WIDTH JOINER
117 FE00; VARIATION SELECTOR-1
118 FE01; VARIATION SELECTOR-2
119 FE02; VARIATION SELECTOR-3
120 FE03; VARIATION SELECTOR-4
121 FE04; VARIATION SELECTOR-5
122 FE05; VARIATION SELECTOR-6
123 FE06; VARIATION SELECTOR-7
124 FE07; VARIATION SELECTOR-8
125 FE08; VARIATION SELECTOR-9
126 FE09; VARIATION SELECTOR-10
127 FE0A; VARIATION SELECTOR-11
128 FE0B; VARIATION SELECTOR-12
129 FE0C; VARIATION SELECTOR-13
130 FE0D; VARIATION SELECTOR-14
131 FE0E; VARIATION SELECTOR-15
132 FE0F; VARIATION SELECTOR-16
134 3.2 Space Character Conversions
136 Space characters can make accurate visual transcription of names
137 nearly impossible and could lead to user entry errors in many
138 ways. The following Unicode spaces are to be mapped to 0020; SPACE:
142 1680; OGHAM SPACE MARK
147 2004; THREE-PER-EM SPACE
148 2005; FOUR-PER-EM SPACE
149 2006; SIX-PER-EM SPACE
151 2008; PUNCTUATION SPACE
154 202F; NARROW NO-BREAK SPACE
155 205F; MEDIUM MATHEMATICAL SPACE
156 3000; IDEOGRAPHIC SPACE
160 This profile specifies using Unicode normalization form KC, as described
163 NOTE: There was some discussion on the mailing list that would suggest
164 that Unicode NFKC does not properly handle the composition of
165 normalized Hangul strings. Following the lead of the IDN working
166 group, the Kerberos working group will not attempt to second-guess the
167 the authors of Unicode 3.1 Annex 15 (formerly Technical Report 15)
168 [UAX15], which specifies the normalization methods, or the Ideographic
169 Rappaorteur Group (IRG), which is the formal subgroup of ISO/IEC
170 JTC1/SC2/WG2 charged with approving all CJKV elements of the Unicode
171 standards. Such issues are outside the working group's charter and
172 its area of expertise.
176 This profile specifies using the prohibition table in Appendix D.
178 Note that the subsections below describe how Appendix D was formed. They
179 are there for people who want to understand more, but they should be
180 ignored by implementors. Implementations of this profile MUST map based
181 on Appendix D, not based on the descriptions in this section of how
182 Appendix D was created.
184 The collected lists of prohibited code points can be found in Appendix D
185 of this document. The lists in Appendix D MUST be used by implementations
186 of this specification. If there are any discrepancies between the lists
187 in Appendix D and subsections below, the lists in Appendix D always takes
190 Some code points listed in one section would also appear in other
191 sections. Each code point is only listed once in the tables in Appendix
195 5.1 Control characters
197 Control characters (or characters with control function) cannot be seen
198 and can cause unpredictable results when displayed.
200 0000-001F; [CONTROL CHARACTERS]
202 0080-009F; [CONTROL CHARACTERS]
203 06DD; ARABIC END OF AYAH
204 070F; SYRIAC ABBREVIATION MARK
205 180E; MONGOLIAN VOWEL SEPARATOR
206 200C; ZERO WIDTH NON-JOINER
207 200D; ZERO WIDTH JOINER
209 2029; PARAGRAPH SEPARATOR
211 2061; FUNCTION APPLICATION
212 2062; INVISIBLE TIMES
213 2063; INVISIBLE SEPARATOR
214 206A-206F; [CONTROL CHARACTERS]
215 FEFF; ZERO WIDTH NO-BREAK SPACE
216 FFF9-FFFC; [CONTROL CHARACTERS]
217 1D173-1D17A; [MUSICAL CONTROL CHARACTERS]
219 5.2 Private use and replacement characters
221 Because private-use characters do not have defined meanings, they are
222 prohibited. The private-use characters are:
224 E000-F8FF; [PRIVATE USE, PLANE 0]
225 F0000-FFFFD; [PRIVATE USE, PLANE 15]
226 100000-10FFFD; [PRIVATE USE, PLANE 16]
228 5.3 Non-character code points
230 Non-character code points are code points that have been allocated in
231 ISO/IEC 10646 but are not characters. Because they are already assigned,
232 they are guaranteed not to later change into characters.
234 FDD0-FDEF; [NONCHARACTER CODE POINTS]
235 FFFE-FFFF; [NONCHARACTER CODE POINTS]
236 1FFFE-1FFFF; [NONCHARACTER CODE POINTS]
237 2FFFE-2FFFF; [NONCHARACTER CODE POINTS]
238 3FFFE-3FFFF; [NONCHARACTER CODE POINTS]
239 4FFFE-4FFFF; [NONCHARACTER CODE POINTS]
240 5FFFE-5FFFF; [NONCHARACTER CODE POINTS]
241 6FFFE-6FFFF; [NONCHARACTER CODE POINTS]
242 7FFFE-7FFFF; [NONCHARACTER CODE POINTS]
243 8FFFE-8FFFF; [NONCHARACTER CODE POINTS]
244 9FFFE-9FFFF; [NONCHARACTER CODE POINTS]
245 AFFFE-AFFFF; [NONCHARACTER CODE POINTS]
246 BFFFE-BFFFF; [NONCHARACTER CODE POINTS]
247 CFFFE-CFFFF; [NONCHARACTER CODE POINTS]
248 DFFFE-DFFFF; [NONCHARACTER CODE POINTS]
249 EFFFE-EFFFF; [NONCHARACTER CODE POINTS]
250 FFFFE-FFFFF; [NONCHARACTER CODE POINTS]
251 10FFFE-10FFFF; [NONCHARACTER CODE POINTS]
253 The non-character code points are listed the PropList.txt file from the
258 The following code points are permanently reserved for use as surrogate
259 code values in the UTF-16 encoding, will never be assigned to
260 characters, and are therefore prohibited:
262 D800-DFFF; [SURROGATE CODES]
264 5.5 Inappropriate for plain text
266 The following characters should not appear in regular text.
268 FFF9; INTERLINEAR ANNOTATION ANCHOR
269 FFFA; INTERLINEAR ANNOTATION SEPARATOR
270 FFFB; INTERLINEAR ANNOTATION TERMINATOR
271 FFFC; OBJECT REPLACEMENT CHARACTER
273 Although the replacement character (U+FFFD) might be used when a name is
274 displayed, it doesn't make sense for it to be part of the name itself.
275 It is often displayed by renderers to indicate "there would be
276 some character here, but it cannot be rendered". For example, on a
277 computer with no Asian fonts, a name with three ideographs might be
278 rendered with three replacement characters.
280 FFFD; REPLACEMENT CHARACTER
282 5.6 Inappropriate for canonical representation
284 The ideographic description characters allow different sequences of
285 characters to be rendered the same way, which makes them inappropriate
286 for host names that must have a single canonical representation.
288 2FF0-2FFB; [IDEOGRAPHIC DESCRIPTION CHARACTERS]
290 5.7 Change display properties
292 The following characters can cause changes in display or the order in
293 which characters appear when rendered, or are deprecated in Unicode.
295 0340; COMBINING GRAVE TONE MARK
296 0341; COMBINING ACUTE TONE MARK
297 200E; LEFT-TO-RIGHT MARK
298 200F; RIGHT-TO-LEFT MARK
299 202A; LEFT-TO-RIGHT EMBEDDING
300 202B; RIGHT-TO-LEFT EMBEDDING
301 202C; POP DIRECTIONAL FORMATTING
302 202D; LEFT-TO-RIGHT OVERRIDE
303 202E; RIGHT-TO-LEFT OVERRIDE
304 206A; INHIBIT SYMMETRIC SWAPPING
305 206B; ACTIVATE SYMMETRIC SWAPPING
306 206C; INHIBIT ARABIC FORM SHAPING
307 206D; ACTIVATE ARABIC FORM SHAPING
308 206E; NATIONAL DIGIT SHAPES
309 206F; NOMINAL DIGIT SHAPES
311 5.8 Tagging characters
313 The following characters are used for tagging text and are invisible.
316 E0020-E007F; [TAGGING CHARACTERS]
318 6. Bidirectional Characters
320 This profile specifies checking bidirectional strings as described
321 in [RFC3454] section 6.
324 7. Unassigned Code Points
326 This profile lists the unassigned code points for Unicode 3.2 in
327 Appendix E. The list in Appendix E MUST be used by implementations of
328 this specification. If there are any discrepancies between the list in
329 Appendix E and the Unicode 3.2 specification, the list Appendix E always
333 8. Security Considerations
335 ISO/IEC 10646 has many characters that look similar. In many cases,
336 users of security protocols might do visual matching, such as when
337 comparing the names of trusted third parties. This profile does nothing
338 to map similar-looking characters together.
340 Principal names and passwords are entered by users and used within the
341 Kerberos protocol. The
342 security of the Internet would be compromised if a user entering a
343 single internationalized string could be connected to different servers
344 or denied access based on different interpretations of
345 internationalized strings.
347 9. IANA Considerations
349 IANA is to register this profile as described in [RFC3454].
353 [CharModel] Unicode Technical Report;17, Character Encoding Model.
354 <http://www.unicode.org/unicode/reports/tr17/>.
356 [Glossary] Unicode Glossary, <http://www.unicode.org/glossary/>.
358 [ISO10646] ISO/IEC 10646-1:2000. International Standard -- Information
359 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
360 1: Architecture and Basic Multilingual Plane.
362 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
363 Requirement Levels", March 1997, RFC 2119.
365 [RFC3454] Paul Hoffman and Marc Blanchet, "Preparation of
366 Internationalized Strings ("stringprep")", draft-hoffman-stringprep,
369 [Unicode3.2] The Unicode Standard, Version 3.2.0: The Unicode
370 Consortium. The Unicode Standard, Version 3.0. Reading, MA,
371 Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5, as amended
372 by: Unicode Standard Annex #27: Unicode 3.1
373 <http://www.unicode.org/unicode/reports/tr27/>; and
374 by: Unicode Standard Annex #28: Unicode 3.2
375 <http://www.unicode.org/unicode/reports/tr28/>
377 [UAX9] The Unicode Consortium. Unicode Standard Annex #9, The
378 Bidirectional Algorithm, <http://www.unicode.org/unicode/reports/tr9/>.
380 [UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15:
381 Unicode Normalization Forms, Version 3.1.0.
382 <http://www.unicode.org/unicode/reports/tr15/tr15-21.html>
387 This draft is based upon the work of the IETF IDN Working Group's
388 IDN Nameprep design team.
390 This profile is the work of the Kerberos Working Group. Significant
391 contributions were provided by Jeffrey Hutzelman, Sam Hartman, Tom Yu,
392 Ken Raeburn, and Jeffrey Altman.
394 B. Editor Contact Information
397 Internet Access Methods
401 e-mail: jaltman@iamx.com
406 The following is the mapping table from Section 3. The table has three
408 - the character that is mapped from
409 - the zero or more characters that it is mapped to
410 - the reason for the mapping
411 The columns are separated by semicolons. Note that the second column may
412 be empty, or it may have one character, or it may have more than one
413 character, with each character separated by a space.
415 ----- Start Mapping Table -----
416 00A0; 0020; NO-BREAK SPACE
417 00AD; ; Map to nothing
418 034F; ; Map to nothing
419 1680; 0020; OGHAM SPACE MARK
420 1806; ; Map to nothing
421 180B; ; Map to nothing
422 180C; ; Map to nothing
423 180D; ; Map to nothing
428 2004; 0020; THREE-PER-EM SPACE
429 2005; 0020; FOUR-PER-EM SPACE
430 2006; 0020; SIX-PER-EM SPACE
431 2007; 0020; FIGURE SPACE
432 2008; 0020; PUNCTUATION SPACE
433 2009; 0020; THIN SPACE
434 200A; 0020; HAIR SPACE
435 200B; ; Map to nothing
436 200C; ; Map to nothing
437 200D; ; Map to nothing
438 2060; ; Map to nothing
439 202F; 0020; NARROW NO-BREAK SPACE
440 205F; 0020; MEDIUM MATHEMATICAL SPACE
441 3000; 0020; IDEOGRAPHIC SPACE
442 FE00; ; Map to nothing
443 FE01; ; Map to nothing
444 FE02; ; Map to nothing
445 FE03; ; Map to nothing
446 FE04; ; Map to nothing
447 FE05; ; Map to nothing
448 FE06; ; Map to nothing
449 FE07; ; Map to nothing
450 FE08; ; Map to nothing
451 FE09; ; Map to nothing
452 FE0A; ; Map to nothing
453 FE0B; ; Map to nothing
454 FE0C; ; Map to nothing
455 FE0D; ; Map to nothing
456 FE0E; ; Map to nothing
457 FE0F; ; Map to nothing
458 FEFF; ; Map to nothing
459 ----- End Mapping Table -----
462 D. Prohibited Code Point List
464 ----- Start Prohibited Table -----
465 0000-001F; [CONTROL CHARACTERS]
467 0080-009F; [CONTROL CHARACTERS]
468 0340; COMBINING GRAVE TONE MARK
469 0341; COMBINING ACUTE TONE MARK
470 06DD; ARABIC END OF AYAH
471 070F; SYRIAC ABBREVIATION MARK
472 100000-10FFFD; [PRIVATE USE, PLANE 16]
473 10FFFE-10FFFF; [NONCHARACTER CODE POINTS]
474 180E; MONGOLIAN VOWEL SEPARATOR
475 1D173-1D17A; [MUSICAL CONTROL CHARACTERS]
476 1FFFE-1FFFF; [NONCHARACTER CODE POINTS]
477 200C; ZERO WIDTH NON-JOINER
478 200D; ZERO WIDTH JOINER
479 200E; LEFT-TO-RIGHT MARK
480 200F; RIGHT-TO-LEFT MARK
482 2029; PARAGRAPH SEPARATOR
483 202A; LEFT-TO-RIGHT EMBEDDING
484 202B; RIGHT-TO-LEFT EMBEDDING
485 202C; POP DIRECTIONAL FORMATTING
486 202D; LEFT-TO-RIGHT OVERRIDE
487 202E; RIGHT-TO-LEFT OVERRIDE
489 2061; FUNCTION APPLICATION
490 2062; INVISIBLE TIMES
491 2063; INVISIBLE SEPARATOR
492 206A-206F; [CONTROL CHARACTERS]
493 206A; INHIBIT SYMMETRIC SWAPPING
494 206B; ACTIVATE SYMMETRIC SWAPPING
495 206C; INHIBIT ARABIC FORM SHAPING
496 206D; ACTIVATE ARABIC FORM SHAPING
497 206E; NATIONAL DIGIT SHAPES
498 206F; NOMINAL DIGIT SHAPES
499 2FF0-2FFB; [IDEOGRAPHIC DESCRIPTION CHARACTERS]
500 2FFFE-2FFFF; [NONCHARACTER CODE POINTS]
501 3FFFE-3FFFF; [NONCHARACTER CODE POINTS]
502 4FFFE-4FFFF; [NONCHARACTER CODE POINTS]
503 5FFFE-5FFFF; [NONCHARACTER CODE POINTS]
504 6FFFE-6FFFF; [NONCHARACTER CODE POINTS]
505 7FFFE-7FFFF; [NONCHARACTER CODE POINTS]
506 8FFFE-8FFFF; [NONCHARACTER CODE POINTS]
507 9FFFE-9FFFF; [NONCHARACTER CODE POINTS]
508 AFFFE-AFFFF; [NONCHARACTER CODE POINTS]
509 BFFFE-BFFFF; [NONCHARACTER CODE POINTS]
510 CFFFE-CFFFF; [NONCHARACTER CODE POINTS]
511 D800-DFFF; [SURROGATE CODES]
512 DFFFE-DFFFF; [NONCHARACTER CODE POINTS]
513 E000-F8FF; [PRIVATE USE, PLANE 0]
515 E0020-E007F; [TAGGING CHARACTERS]
516 EFFFE-EFFFF; [NONCHARACTER CODE POINTS]
517 F0000-FFFFD; [PRIVATE USE, PLANE 15]
518 FDD0-FDEF; [NONCHARACTER CODE POINTS]
519 FEFF; ZERO WIDTH NO-BREAK SPACE
520 FFF9-FFFC; [CONTROL CHARACTERS]
521 FFF9; INTERLINEAR ANNOTATION ANCHOR
522 FFFA; INTERLINEAR ANNOTATION SEPARATOR
523 FFFB; INTERLINEAR ANNOTATION TERMINATOR
524 FFFC; OBJECT REPLACEMENT CHARACTER
525 FFFD; REPLACEMENT CHARACTER
526 FFFE-FFFF; [NONCHARACTER CODE POINTS]
527 FFFFE-FFFFF; [NONCHARACTER CODE POINTS]
528 ----- End Prohibited Table -----
530 NOTE WELL: Software that follows this specification that will be used to
531 check names before they are put in authoritative name servers MUST add
532 all unassigned code points to the list of characters that are prohibited.
533 See Section 6 of [RFC3454] for more details.
536 E. Unassigned Code Point List
538 ----- Start Unassigned Table -----
935 ----- End Unassigned Table -----