draft-ietf-krb-wg-utf8-profile-00.txt

   1 Internet Draft                                              Jeffrey Altman
   2 draft-ietf-krb-wg-utf8-profile-00.txt                  Columbia University
   3 February 12, 2002
   4 Expires in six months
   5
   6         Stringprep Profile for Kerberos UTF-8 Strings
   7
   8 Status of this memo
   9
  10 This document is an Internet-Draft and is in full conformance with all
  11 provisions of Section 10 of RFC2026.
  12
  13 Internet-Drafts are working documents of the Internet Engineering Task
  14 Force (IETF), its areas, and its working groups. Note that other groups
  15 may also distribute working documents as Internet-Drafts.
  16
  17 Internet-Drafts are draft documents valid for a maximum of six months
  18 and may be updated, replaced, or obsoleted by other documents at any
  19 time. It is inappropriate to use Internet-Drafts as reference material
  20 or to cite them other than as "work in progress."
  21
  22 To view the list Internet-Draft Shadow Directories, see
  23 http://www.ietf.org/shadow.html.
  24
  25
  26 Abstract
  27
  28 This document describes how to prepare UTF-8 strings
  29 in order to increase the likelihood that name input and name comparison
  30 work in ways that make sense for typical users throughout the world. This
  31 is a profile of the stringprep protocol developed in the IDN working group.
  32
  33 1. Introduction
  34
  35 This document specifies processing rules that will allow users to enter
  36 Kerberos Principal Names and input to cryptographic String to Key functions.
  37 It is a profile of stringprep [STRINGPREP].
  38
  39 This profile defines the following, as required by [STRINGPREP]
  40
  41 - The intended applicability of the profile: internationalized
  42 host name parts
  43
  44 - The character repertoire that is the input and output to stringprep:
  45 defined in Section 2
  46
  47 - The list of unassigned code points for the repertoire: defined
  48 in Appendix F.
  49
  50 - The mappings used: defined in Section 3.
  51
  52 - The Unicode normalization used: defined in Section 4
  53
  54 - The characters that are prohibited as output: Defined in section 5
  55
  56
  57 1.2 Terminology
  58
  59 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
  60 "MAY" in this document are to be interpreted as described in RFC 2119
  61 [RFC2119].
  62
  63 Examples in this document use the notation for code points and names
  64 from the Unicode Standard [Unicode3.1] and ISO/IEC 10646 [ISO10646]. For
  65 example, the letter "a" may be represented as either "U+0061" or "LATIN
  66 SMALL LETTER A". In the lists of prohibited characters, the "U+" is left
  67 off to make the lists easier to read. The comments for character ranges
  68 are shown in square brackets (such as "[SYMBOLS]") and do not come from
  69 the standards.
  70
  71
  72 2. Character Repertoire
  73
  74 Unicode 3.1 [Unicode3.1] is the repertoire used in this profile.
  75 The reason Unicode 3.1 was chosen instead of a version of
  76 ISO/IEC 10646 is that ISO/IEC 10646 is expected to be updated soon after
  77 this document becomes an RFC. Unicode 3.1 has the exact repertoire that
  78 is expected in the next version of ISO/IEC 10646, and is therefore used
  79 here.
  80
  81
  82 3. Mapping
  83
  84 This profile specifies stringprep mapping using the mapping table
  85 in Appendix D. That table includes all the steps described in this
  86 section.
  87
  88 Note that text in this section describe how Appendix D was formed. It is
  89 there for people who want to understand more, but it should be ignored
  90 by implementors. Implementations of this profile MUST map based on
  91 Appendix D, not based on the descriptions in this section of how
  92 Appendix D was created.
  93
  94 3.1 Mapped out
  95
  96 The following characters are simply deleted from the input (that is,
  97 they are mapped to nothing) because their presence or absence should not
  98 make two strings different.
  99
 100 Some characters are only useful in line-based text, and are otherwise
 101 invisible and ignored.
 102
 103 00AD; SOFT HYPHEN
 104 1806; MONGOLIAN TODO SOFT HYPHEN
 105 200B; ZERO WIDTH SPACE
 106 FEFF; ZERO WIDTH NO-BREAK SPACE
 107
 108 Variation selectors and cursive connectors select different glyphs, but
 109 do not bear semantics.
 110
 111 180B; MONGOLIAN FREE VARIATION SELECTOR ONE
 112 180C; MONGOLIAN FREE VARIATION SELECTOR TWO
 113 180D; MONGOLIAN FREE VARIATION SELECTOR THREE
 114 200C; ZERO WIDTH NON-JOINER
 115 200D; ZERO WIDTH JOINER
 116
 117 3.2 Space Character Conversions
 118
 119 The following Unicode spaces are to be mapped to 0020; SPACE:
 120
 121 00A0; NO-BREAK SPACE
 122 2000; EN QUAD
 123 2001; EM QUAD
 124 2002; EN SPACE
 125 2003; EM SPACE
 126 2004; THREE-PER-EM SPACE
 127 2005; FOUR-PER-EM SPACE
 128 2006; SIX-PER-EM SPACE
 129 2007; FIGURE SPACE
 130 2008; PUNCTUATION SPACE
 131 2009; THIN SPACE
 132 200A; HAIR SPACE
 133 202F; NARROW NO-BREAK SPACE
 134 3000; IDEOGRAPHIC SPACE
 135
 136 4. Normalization
 137
 138 This profile specifies using Unicode normalization form KC, as described
 139 in [UAX15].
 140
 141 NOTE: There was some discussion on the mailing list that would suggest
 142 that Unicode NFKC does not properly handle the composition of
 143 normalized Hangul strings.  Following the lead of the IDN working
 144 group, the Kerberos working group will not attempt to second-guess the
 145 the authors of Unicode 3.1 Annex 15 (formerly Technical Report 15)
 146 [UAX15], which specifies the normalization methods, or the Ideographic
 147 Rappaorteur Group (IRG), which is the formal subgroup of ISO/IEC
 148 JTC1/SC2/WG2 charged with approving all CJKV elements of the Unicode
 149 standards.  Such issues are outside the working group's charter and
 150 its area of expertise.
 151
 152
 153 5. Prohibited Output
 154
 155 This profile specifies using the prohibition table in Appendix E.
 156
 157 Note that the subsections below describe how Appendix E was formed. They
 158 are there for people who want to understand more, but they should be
 159 ignored by implementors. Implementations of this profile MUST map based
 160 on Appendix E, not based on the descriptions in this section of how
 161 Appendix E was created.
 162
 163 The collected lists of prohibited code points can be found in Appendix E
 164 of this document. The lists in Appendix E MUST be used by implementations
 165 of this specification. If there are any discrepancies between the lists
 166 in Appendix E and subsections below, the lists in Appendix E always takes
 167 precedence.
 168
 169 Some code points listed in one section would also appear in other
 170 sections. Each code point is only listed once in the tables in Appendix
 171 E.
 172
 173
 174 5.1 Control characters
 175
 176 Control characters (or characters with control function) cannot be seen
 177 and can cause unpredictable results when displayed.
 178
 179 0000-001F; [CONTROL CHARACTERS]
 180 007F; DELETE
 181 0080-009F; [CONTROL CHARACTERS]
 182 070F; SYRIAC ABBREVIATION MARK
 183 180E; MONGOLIAN VOWEL SEPARATOR
 184 2028; LINE SEPARATOR
 185 2029; PARAGRAPH SEPARATOR
 186 206A-206F; [CONTROL CHARACTERS]
 187 FFF9-FFFC; [CONTROL CHARACTERS]
 188 1D173-1D17A; [MUSICAL CONTROL CHARACTERS]
 189
 190 5.2 Private use and replacement characters
 191
 192 Because private-use characters do not have defined meanings, they are
 193 prohibited. The private-use characters are:
 194
 195 E000-F8FF; [PRIVATE USE, PLANE 0]
 196 F0000-FFFFD; [PRIVATE USE, PLANE 15]
 197 100000-10FFFD; [PRIVATE USE, PLANE 16]
 198
 199 The replacement character (U+FFFD) has no known semantic definition in a
 200 name, and is often displayed by renderers to indicate "there would be
 201 some character here, but it cannot be rendered". For example, on a
 202 computer with no Asian fonts, a name with three ideographs might be
 203 rendered with three replacement characters.
 204
 205 FFFD; REPLACEMENT CHARACTER
 206
 207 5.3 Non-character code points
 208
 209 Non-character code points are code points that have been allocated in
 210 ISO/IEC 10646 but are not characters. Because they are already assigned,
 211 they are guaranteed not to later change into characters.
 212
 213 FDD0-FDEF; [NONCHARACTER CODE POINTS]
 214 FFFE-FFFF; [NONCHARACTER CODE POINTS]
 215 1FFFE-1FFFF; [NONCHARACTER CODE POINTS]
 216 2FFFE-2FFFF; [NONCHARACTER CODE POINTS]
 217 3FFFE-3FFFF; [NONCHARACTER CODE POINTS]
 218 4FFFE-4FFFF; [NONCHARACTER CODE POINTS]
 219 5FFFE-5FFFF; [NONCHARACTER CODE POINTS]
 220 6FFFE-6FFFF; [NONCHARACTER CODE POINTS]
 221 7FFFE-7FFFF; [NONCHARACTER CODE POINTS]
 222 8FFFE-8FFFF; [NONCHARACTER CODE POINTS]
 223 9FFFE-9FFFF; [NONCHARACTER CODE POINTS]
 224 AFFFE-AFFFF; [NONCHARACTER CODE POINTS]
 225 BFFFE-BFFFF; [NONCHARACTER CODE POINTS]
 226 CFFFE-CFFFF; [NONCHARACTER CODE POINTS]
 227 DFFFE-DFFFF; [NONCHARACTER CODE POINTS]
 228 EFFFE-EFFFF; [NONCHARACTER CODE POINTS]
 229 FFFFE-FFFFF; [NONCHARACTER CODE POINTS]
 230 10FFFE-10FFFF; [NONCHARACTER CODE POINTS]
 231
 232 The non-character code points are listed the PropList.txt file from the
 233 Unicode database.
 234
 235 5.4 Surrogate codes
 236
 237 The following code points are permanently reserved for use as surrogate
 238 code values in the UTF-16 encoding, will never be assigned to
 239 characters, and are therefore prohibited:
 240
 241 D800-DFFF; [SURROGATE CODES]
 242
 243 5.5 Inappropriate for plain text
 244
 245 The following characters should not appear in regular text.
 246
 247 FFF9; INTERLINEAR ANNOTATION ANCHOR
 248 FFFA; INTERLINEAR ANNOTATION SEPARATOR
 249 FFFB; INTERLINEAR ANNOTATION TERMINATOR
 250 FFFC; OBJECT REPLACEMENT CHARACTER
 251
 252 5.6 Inappropriate for canonical representation
 253
 254 The ideographic description characters allow different sequences of
 255 characters to be rendered the same way, which makes them inappropriate
 256 for host names that must have a single canonical representation.
 257
 258 2FF0-2FFB; [IDEOGRAPHIC DESCRIPTION CHARACTERS]
 259
 260 5.7 Change display properties
 261
 262 The following characters, some of which are deprecated in ISO/IEC 10646,
 263 can cause changes in display or the order in which characters appear
 264 when rendered.
 265
 266 200E; LEFT-TO-RIGHT MARK
 267 200F; RIGHT-TO-LEFT MARK
 268 202A; LEFT-TO-RIGHT EMBEDDING
 269 202B; RIGHT-TO-LEFT EMBEDDING
 270 202C; POP DIRECTIONAL FORMATTING
 271 202D; LEFT-TO-RIGHT OVERRIDE
 272 202E; RIGHT-TO-LEFT OVERRIDE
 273 206A; INHIBIT SYMMETRIC SWAPPING
 274 206B; ACTIVATE SYMMETRIC SWAPPING
 275 206C; INHIBIT ARABIC FORM SHAPING
 276 206D; ACTIVATE ARABIC FORM SHAPING
 277 206E; NATIONAL DIGIT SHAPES
 278 206F; NOMINAL DIGIT SHAPES
 279
 280 5.8 Tagging characters
 281
 282 The following characters are used for tagging text and are invisible.
 283
 284 E0001; LANGUAGE TAG
 285 E0020-E007F; [TAGGING CHARACTERS]
 286
 287
 288 6. Unassigned Code Points in Internationalized Host Names
 289
 290 This profile lists the unassigned code points for Unicode 3.1 in
 291 Appendix F. The list in Appendix F MUST be used by implementations of
 292 this specification. If there are any discrepancies between the list in
 293 Appendix F and the Unicode 3.1 specification, the list Appendix F always
 294 takes precedence.
 295
 296
 297 7. Security Considerations
 298
 299 ISO/IEC 10646 has many characters that look similar. In many cases,
 300 users of security protocols might do visual matching, such as when
 301 comparing the names of trusted third parties. This profile does nothing
 302 to map similar-looking characters together.
 303
 304 Principal names and passwords are entered by users and used within the
 305 Kerberos protocol. The
 306 security of the Internet would be compromised if a user entering a
 307 single internationalized string could be connected to different servers
 308 or denied access based on different interpretations of
 309 internationalized strings.
 310
 311 8. References
 312
 313 [CharModel] Unicode Technical Report;17, Character Encoding Model.
 314 <http://www.unicode.org/unicode/reports/tr17/>.
 315
 316 [Glossary] Unicode Glossary, <http://www.unicode.org/glossary/>.
 317
 318 [ISO10646] ISO/IEC 10646-1:2000. International Standard -- Information
 319 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
 320 1: Architecture and Basic Multilingual Plane.
 321
 322 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
 323 Requirement Levels", March 1997, RFC 2119.
 324
 325 [STRINGPREP] Paul Hoffman and Marc Blanchet, "Preparation of
 326 Internationalized Strings ("stringprep")", draft-hoffman-stringprep,
 327 work in progress
 328
 329 [Unicode3.1] The Unicode Standard, Version 3.1.0: The Unicode
 330 Consortium. The Unicode Standard, Version 3.0. Reading, MA,
 331 Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5, as amended
 332 by: Unicode Standard Annex #27: Unicode 3.1
 333 <http://www.unicode.org/unicode/reports/tr27/tr27-4.html>.
 334
 335 [UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15:
 336 Unicode Normalization Forms, Version 3.1.0.
 337 <http://www.unicode.org/unicode/reports/tr15/tr15-21.html>
 338
 339
 340 A. Acknowledgements
 341
 342 This draft is based upon the work of the IETF IDN Working Group's
 343 IDN Nameprep design team.
 344
 345 B. IANA Considerations
 346
 347 This is a profile of stringprep. When it becomes an RFC, it
 348 should be registered in the stringprep profile registry.
 349
 350 C. Author Contact Information
 351
 352 Jeffrey Altman
 353 jaltman@columbia.edu
 354 Columbia University
 355 612 West 115th Street
 356 New York NY 10025
 357
 358
 359 D. Mapping Tables
 360
 361 The following is the mapping table from Section 3. The table has three
 362 columns:
 363 - the character that is mapped from
 364 - the zero or more characters that it is mapped to
 365 - the reason for the mapping
 366 The columns are separated by semicolons. Note that the second column may
 367 be empty, or it may have one character, or it may have more than one
 368 character, with each character separated by a space.
 369
 370 ----- Start Mapping Table -----
 371 ... to be filled in ...
 372 ----- End Mapping Table -----
 373
 374
 375 E. Prohibited Code Point List
 376
 377 ----- Start Prohibited Table -----
 378 ... to be filled in ...
 379 ----- End Prohibited Table -----
 380
 381 NOTE WELL: Software that follows this specification that will be used to
 382 check names before they are put in authoritative name servers MUST add
 383 all unassigned code pints to the list of characters that are prohibited.
 384 See Section 6 of [STRINGPREP] for more details.
 385
 386
 387 F. Unassigned Code Point List
 388
 389 ----- Start Unassigned Table -----
 390 ... to be filled in ...
 391 ----- End Unassigned Table -----
 392
 393
 394
 395
 396  Jeffrey Altman * Sr.Software Designer      C-Kermit 8.0 available now!!!
 397  The Kermit Project @ Columbia University   includes Telnet, FTP and HTTP
 398  http://www.kermit-project.org/             secured with Kerberos, SRP, and
 399  kermit-support@columbia.edu                OpenSSL. Interfaces with OpenSSH