draft-ietf-idn-idna-14.txt

   1 Internet Draft                                     Patrik Faltstrom
   2 draft-ietf-idn-idna-14.txt                                    Cisco
   3 October 24, 2002                                       Paul Hoffman
   4 Expires in six months                                    IMC & VPNC
   5                                                    Adam M. Costello
   6                                                         UC Berkeley
   7
   8        Internationalizing Domain Names in Applications (IDNA)
   9
  10 Status of this Memo
  11
  12 This document is an Internet-Draft and is in full conformance with all
  13 provisions of Section 10 of RFC2026.
  14
  15 Internet-Drafts are working documents of the Internet Engineering Task
  16 Force (IETF), its areas, and its working groups. Note that other groups
  17 may also distribute working documents as Internet-Drafts.
  18
  19 Internet-Drafts are draft documents valid for a maximum of six months
  20 and may be updated, replaced, or obsoleted by other documents at any
  21 time. It is inappropriate to use Internet-Drafts as reference material
  22 or to cite them other than as "work in progress."
  23
  24 The list of current Internet-Drafts can be accessed at
  25 http://www.ietf.org/ietf/1id-abstracts.txt
  26
  27 The list of Internet-Draft Shadow Directories can be accessed at
  28 http://www.ietf.org/shadow.html.
  29
  30
  31 Abstract
  32
  33 Until now, there has been no standard method for domain names to use
  34 characters outside the ASCII repertoire. This document defines
  35 internationalized domain names (IDNs) and a mechanism called IDNA for
  36 handling them in a standard fashion. IDNs use characters drawn from a
  37 large repertoire (Unicode), but IDNA allows the non-ASCII characters to
  38 be represented using only the ASCII characters already allowed in
  39 so-called host names today.  This backward-compatible representation is
  40 required in existing protocols like DNS, so that IDNs can be introduced
  41 with no changes to the existing infrastructure. IDNA is only meant for
  42 processing domain names, not free text.
  43
  44
  45 1. Introduction
  46
  47 IDNA works by allowing applications to use certain ASCII name labels
  48 (beginning with a special prefix) to represent non-ASCII name labels.
  49 Lower-layer protocols need not be aware of this; therefore IDNA does not
  50 depend on changes to any infrastructure. In particular, IDNA does not
  51 depend on any changes to DNS servers, resolvers, or protocol elements,
  52 because the ASCII name service provided by the existing DNS is entirely
  53 sufficient for IDNA.
  54
  55 This document does not require any applications to conform to IDNA, but
  56 applications can elect to use IDNA in order to support IDN while
  57 maintaining interoperability with existing infrastructure. If an
  58 application wants to use non-ASCII characters in domain names, IDNA is
  59 the only currently-defined option. Adding IDNA support to an existing
  60 application entails changes to the application only, and leaves room for
  61 flexibility in the user interface.
  62
  63 A great deal of the discussion of IDN solutions has focused on
  64 transition issues and how IDN will work in a world where not all of the
  65 components have been updated. Proposals that were not chosen by the IDN
  66 Working Group would depend on user applications, resolvers, and DNS
  67 servers being updated in order for a user to use an internationalized
  68 domain name. Rather than rely on widespread updating of all components,
  69 IDNA depends on updates to user applications only; no changes are needed
  70 to the DNS protocol or any DNS servers or the resolvers on user's
  71 computers.
  72
  73 1.1 Problem Statement
  74
  75 The IDNA specification solves the problem of extending the repertoire of
  76 characters that can be used in domain names to include the Unicode
  77 repertoire (with some restrictions).
  78
  79 IDNA does not extend the service offered by DNS to the applications.
  80 Instead, the applications (and, by implication, the users) continue to
  81 see an exact-match lookup service. Either there is a single
  82 exactly-matching name or there is no match. This model has served the
  83 existing applications well, but it requires, with or without
  84 internationalized domain names, that users know the exact spelling of
  85 the domain names that the users type into applications such as web
  86 browsers and mail user agents. The introduction of the larger repertoire
  87 of characters potentially makes the set of misspellings larger,
  88 especially given that in some cases the same appearance, for example on
  89 a business card, might visually match several Unicode code points or
  90 several sequences of code points.
  91
  92 IDNA allows the graceful introduction of IDNs not only by avoiding
  93 upgrades to existing infrastructure (such as DNS servers and mail
  94 transport agents), but also by allowing some rudimentary use of IDNs in
  95 applications by using the ASCII representation of the non-ASCII name
  96 labels. While such names are very user-unfriendly to read and type, and
  97 hence are not suitable for user input, they allow (for instance)
  98 replying to email and clicking on URLs even though the domain name
  99 displayed is incomprehensible to the user. In order to allow
 100 user-friendly input and output of the IDNs, the applications need to be
 101 modified to conform to this specification.
 102
 103 IDNA uses the Unicode character repertoire, which avoids the significant
 104 delays that would be inherent in waiting for a different and specific
 105 character set be defined for IDN purposes by some other standards
 106 developing organization.
 107
 108 1.2 Limitations of IDNA
 109
 110 The IDNA protocol does not solve all linguistic issues with users
 111 inputting names in different scripts. Many important language-based and
 112 script-based mappings are not covered in IDNA and need to be handled
 113 outside the protocol. For example, names that are entered in a mix of
 114 traditional and simplified Chinese characters will not be mapped to a
 115 single canonical name. Another example is Scandinavian names that are
 116 entered with U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS) will not be
 117 mapped to U+00F8 (LATIN SMALL LETTER O WITH STROKE).
 118
 119 An example of an important issue that is not considered in detail in
 120 IDNA is how to provide a high probability that a user who is entering a
 121 domain name based on visual information (such as from a business card or
 122 billboard) or aural information (such as from a telephone or radio)
 123 would correctly enter the IDN. Similar issues exist for ASCII domain
 124 names, for example the possible visual confusion between the letter 'O'
 125 and the digit zero, but the introduction of the larger repertoire of
 126 characters creates more opportunities of similar looking and similar
 127 sounding names. Note that this is a complex issue relating to languages,
 128 input methods on computers, and so on.  Furthermore, the kind of
 129 matching and searching necessary for a high probability of success would
 130 not fit the role of the DNS and its exact matching function.
 131
 132 1.3 Brief overview for application developers
 133
 134 Applications can use IDNA to support internationalized domain names
 135 anywhere that ASCII domain names are already supported, including DNS
 136 master files and resolver interfaces. (Applications can also define
 137 protocols and interfaces that support IDNs directly using non-ASCII
 138 representations. IDNA does not prescribe any particular representation
 139 for new protocols, but it still defines which names are valid and how
 140 they are compared.)
 141
 142 The IDNA protocol is contained completely within applications. It is not
 143 a client-server or peer-to-peer protocol: everything is done inside the
 144 application itself. When used with a DNS resolver library, IDNA is
 145 inserted as a "shim" between the application and the resolver library.
 146 When used for writing names into a DNS zone, IDNA is used just before
 147 the name is committed to the zone.
 148
 149 There are two operations described in section 4 of this document:
 150
 151 - The ToASCII operation is used before sending an IDN to something that
 152 expects ASCII names (such as a resolver) or writing an IDN into a place
 153 that expects ASCII names (such as a DNS master file).
 154
 155 - The ToUnicode operation is used when displaying names to users, for
 156 example names obtained from a DNS zone.
 157
 158 It is important to note that the ToASCII operation can fail. If it fails
 159 when processing a domain name, that domain name cannot be used as an
 160 internationalized domain name and the application has to have some
 161 method of dealing with this failure.
 162
 163 IDNA requires that implementations process input strings with Nameprep
 164 [NAMEPREP], which is a profile of Stringprep [STRINGPREP], and then with
 165 Punycode [PUNYCODE]. Implementations of IDNA MUST fully implement
 166 Nameprep and Punycode; neither Nameprep nor Punycode are optional.
 167
 168
 169 2 Terminology
 170
 171 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
 172 "MAY" in this document are to be interpreted as described in RFC 2119
 173 [RFC2119].
 174
 175 A code point is an integer value associated with a character in a coded
 176 character set.
 177
 178 Unicode [UNICODE] is a coded character set containing tens of thousands
 179 of characters. A single Unicode code point is denoted by "U+" followed
 180 by four to six hexadecimal digits, while a range of Unicode code points
 181 is denoted by two hexadecimal numbers separated by "..", with no
 182 prefixes.
 183
 184 ASCII means US-ASCII [USASCII], a coded character set containing 128
 185 characters associated with code points in the range 0..7F. Unicode is an
 186 extension of ASCII: it includes all the ASCII characters and associates
 187 them with the same code points.
 188
 189 The term "LDH code points" is defined in this document to mean the code
 190 points associated with ASCII letters, digits, and the hyphen-minus; that
 191 is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an abbreviation for
 192 "letters, digits, hyphen".
 193
 194 [STD13] talks about "domain names" and "host names", but many people use
 195 the terms interchangeably. Further, because [STD13] was not terribly
 196 clear, many people who are sure they know the exact definitions of each
 197 of these terms disagree on the definitions. In this document the term
 198 "domain name" is used in general. This document explicitly cites [STD3]
 199 whenever referring to the host name syntax restrictions defined therein.
 200
 201 A label is an individual part of a domain name. Labels are usually shown
 202 separated by dots; for example, the domain name "www.example.com" is
 203 composed of three labels: "www", "example", and "com". (The zero-length
 204 root label described in [STD13], which can be explicit as in
 205 "www.example.com." or implicit as in "www.example.com", is not
 206 considered a label in this specification.) IDNA extends the set of
 207 usable characters in labels that are text. For the rest of this
 208 document, the term "label" is shorthand for "text label", and "every
 209 label" means "every text label".
 210
 211 An "internationalized label" is a label to which the ToASCII operation
 212 (see section 4) can be applied without failing (with the
 213 UseSTD13ASCIIRules flag unset). This implies that every ASCII label that
 214 satisfies the [STD13] length restriction is an internationalized label.
 215 Therefore the term "internationalized label" is a generalization,
 216 embracing both old ASCII labels and new non-ASCII labels. Although most
 217 Unicode characters can appear in internationalized labels, ToASCII will
 218 fail for some input strings, and such strings are not valid
 219 internationalized labels.
 220
 221 An "internationalized domain name" (IDN) is a domain name in which every
 222 label is an internationalized label. This implies that every ASCII
 223 domain name is an IDN (which implies that it is possible for a name to
 224 be an IDN without it containing any non-ASCII characters). This document
 225 does not attempt to define an "internationalized host name". Just as has
 226 been the case with ASCII names, some DNS zone administrators may impose
 227 restrictions, beyond those imposed by DNS or IDNA, on the characters or
 228 strings that may be registered as labels in their zones. Such
 229 restrictions have no impact on the syntax or semantics of DNS protocol
 230 messages; a query for a name that matches no records will yield the same
 231 response regardless of the reason why it is not in the zone. Clients
 232 issuing queries or interpreting responses cannot be assumed to have any
 233 knowledge of zone-specific restrictions or conventions.
 234
 235 In IDNA, equivalence of labels is defined in terms of the ToASCII
 236 operation, which constructs an ASCII form for a given label, whether or
 237 not the label was already an ASCII label. Labels are defined to be
 238 equivalent if and only if their ASCII forms produced by ToASCII match
 239 using a case-insensitive ASCII comparison. ASCII labels already have a
 240 notion of equivalence: upper case and lower case are considered
 241 equivalent. The IDNA notion of equivalence is an extension of that older
 242 notion. Equivalent labels in IDNA are treated as alternate forms of the
 243 same label, just as "foo" and "Foo" are treated as alternate forms of
 244 the same label.
 245
 246 To allow internationalized labels to be handled by existing
 247 applications, IDNA uses an "ACE label" (ACE stands for ASCII Compatible
 248 Encoding). An ACE label is an internationalized label that can be
 249 rendered in ASCII and is equivalent to an internationalized label that
 250 cannot be rendered in ASCII. Given any internationalized label that
 251 cannot be rendered in ASCII, the ToASCII operation will convert it to an
 252 equivalent ACE label (whereas an ASCII label will be left unaltered by
 253 ToASCII). ACE labels are unsuitable for display to users.  The ToUnicode
 254 operation will convert any label to an equivalent non-ACE label. In
 255 fact, an ACE label is formally defined to be any label that the
 256 ToUnicode operation would alter (whereas non-ACE labels are left
 257 unaltered by ToUnicode). Every ACE label begins with the ACE prefix
 258 specified in section 5. The ToASCII and ToUnicode operations are
 259 specified in section 4.
 260
 261 The "ACE prefix" is defined in this document to be a string of ASCII
 262 characters that appears at the beginning of every ACE label. It is
 263 specified in section 5.
 264
 265 A "domain name slot" is defined in this document to be a protocol
 266 element or a function argument or a return value (and so on) explicitly
 267 designated for carrying a domain name. Examples of domain name slots
 268 include: the QNAME field of a DNS query; the name argument of the
 269 gethostbyname() library function; the part of an email address following
 270 the at-sign (@) in the From: field of an email message header; and the
 271 host portion of the URI in the src attribute of an HTML <IMG> tag.
 272 General text that just happens to contain a domain name is not a domain
 273 name slot; for example, a domain name appearing in the plain text body
 274 of an email message is not occupying a domain name slot.
 275
 276 An "IDN-aware domain name slot" is defined in this document to be a
 277 domain name slot explicitly designated for carrying an internationalized
 278 domain name as defined in this document. The designation may be static
 279 (for example, in the specification of the protocol or interface) or
 280 dynamic (for example, as a result of negotiation in an interactive
 281 session).
 282
 283 An "IDN-unaware domain name slot" is defined in this document to be any
 284 domain name slot that is not an IDN-aware domain name slot. Obviously,
 285 this includes any domain name slot whose specification predates IDNA.
 286
 287
 288 3. Requirements and applicability
 289
 290 3.1 Requirements
 291
 292 IDNA conformance means adherence to the following four requirements:
 293
 294 1) Whenever dots are used as label separators, the following characters
 295 MUST be recognized as dots: U+002E (full stop), U+3002 (ideographic full
 296 stop), U+FF0E (fullwidth full stop), U+FF61 (halfwidth ideographic full
 297 stop).
 298
 299 2) Whenever a domain name is put into an IDN-unaware domain name slot
 300 (see section 2), it MUST contain only ASCII characters.
 301 Given an internationalized domain name (IDN), an equivalent domain name
 302 satisfying this requirement can be obtained by applying the ToASCII
 303 operation (see section 4) to each label and, if dots are
 304 used as label separators, changing all the label separators to U+002E.
 305
 306 3) ACE labels obtained from domain name slots SHOULD be hidden from
 307 users when it is known that the environment can handle the non-ACE form,
 308 except when the ACE form is explicitly requested. When it is not known
 309 whether or not the environment can handle the non-ACE form, the
 310 application MAY use the non-ACE form (which might fail, such as by not
 311 being displayed properly), or it MAY use the ACE form (which will look
 312 unintelligle to the user). Given an internationalized domain name, an
 313 equivalent domain name containing no ACE labels can be obtained by
 314 applying the ToUnicode operation (see section 4) to each label.  When
 315 requirements 2 and 3 both apply, requirement 2 takes precedence.
 316
 317 4) Whenever two labels are compared, they MUST be considered to match if
 318 and only if they are equivalent, that is, their ASCII forms (obtained by
 319 applying ToASCII) match using a case-insensitive ASCII comparison.
 320 Whenever two names are compared, they MUST be considered to match if and
 321 only if their corresponding labels match, regardless of whether the
 322 names use the same forms of label separators.
 323
 324 3.2 Applicability
 325
 326 IDNA is applicable to all domain names in all domain name slots except
 327 where it is explicitly excluded.
 328
 329 This implies that IDNA is applicable to many protocols that predate
 330 IDNA.  Note that IDNs occupying domain name slots in those protocols
 331 MUST be in ASCII form (see section 3.1, requirement 2).
 332
 333 3.2.1. DNS resource records
 334
 335 IDNA does not apply to domain names in the NAME and RDATA fields of DNS
 336 resource records whose CLASS is not IN.  This exclusion applies to every
 337 non-IN class, present and future, except where future standards override
 338 this exclusion by explicitly inviting the use of IDNA.
 339
 340 There are currently no other exclusions on the applicability of IDNA to
 341 DNS resource records; it depends entirely on the CLASS, and not on the
 342 TYPE.  This will remain true, even as new types are defined, unless
 343 there is a compelling reason for a new type to complicate matters by
 344 imposing type-specific rules.
 345
 346 3.2.2. Non-domain-name data types stored in domain names
 347
 348 Although IDNA enables the representation of non-ASCII characters in
 349 domain names, that does not imply that IDNA enables the representation
 350 of non-ASCII characters in other data types that are stored in domain
 351 names.  For example, an email address local part is sometimes stored in
 352 a domain label (hostmaster@example.com would be represented as
 353 hostmaster.example.com in the RDATA field of an SOA record). IDNA does
 354 not update the existing email standards, which allow only ASCII
 355 characters in local parts.  Therefore, unless the email standards are
 356 revised to invite the use of IDNA for local parts, a domain label that
 357 holds the local part of an email address SHOULD NOT begin with the ACE
 358 prefix, and even if it does, it is to be interpreted literally as a
 359 local part that happens to begin with the ACE prefix.
 360
 361
 362 4. Conversion operations
 363
 364 An application converts a domain name put into an IDN-unaware slot or
 365 displayed to a user. This section specifies the steps to perform in the
 366 conversion, and the ToASCII and ToUnicode operations.
 367
 368 The input to ToASCII or ToUnicode is a single label that is a sequence
 369 of Unicode code points (remember that all ASCII code points are also
 370 Unicode code points). If a domain name is represented using a character
 371 set other than Unicode or US-ASCII, it will first need to be transcoded
 372 to Unicode.
 373
 374 Starting from a whole domain name, the steps that an application takes
 375 to do the conversions are:
 376
 377 1) Decide whether the domain name is a "stored string" or a "query
 378 string" as described in [STRINGPREP]. If this conversion follows the
 379 "queries" rule from [STRINGPREP], set the flag called "AllowUnassigned".
 380
 381 2) Split the domain name into individual labels as described in section
 382 3.1. The labels do not include the separator.
 383
 384 3) For each label, decide whether or not to enforce the restrictions on
 385 ASCII characters in host names [STD3]. (Applications already faced this
 386 choice before the introduction of IDNA, and can continue to make the
 387 decision the same way they always have; IDNA makes no new
 388 recommendations regarding this choice.) If the restrictions are to be
 389 enforced, set the flag called "UseSTD3ASCIIRules" for that label.
 390
 391 4) Process each label with either the ToASCII or the ToUnicode operation
 392 as appropriate. Typically, you use the ToASCII operation if you are
 393 about to put the name into an IDN-unaware slot, and you use the
 394 ToUnicode operation if you are displaying the name to a user; section
 395 3.1 gives greater detail on the applicable requirements.
 396
 397 5) If ToASCII was applied in step 4 and dots are used as label
 398 separators, change all the label separators to U+002E (full stop).
 399
 400 The following two subsections define the ToASCII and ToUnicode
 401 operations that are used in step 4.
 402
 403 This description of the protocol uses specific procedure names, names of
 404 flags, and so on, in order to facilitate the specification of the
 405 protocol. These names, as well as the actual steps of the procedures,
 406 are not required of an implementation. In fact, any implementation which
 407 has the same external behavior as specified in this document conforms to
 408 this specification.
 409
 410 4.1 ToASCII
 411
 412 The ToASCII operation takes a sequence of Unicode code points that make
 413 up one label and transforms it into a sequence of code points in the
 414 ASCII range (0..7F). If ToASCII succeeds, the original sequence and the
 415 resulting sequence are equivalent labels.
 416
 417 It is important to note that the ToASCII operation can fail. ToASCII
 418 fails if any step of it fails. If any step of the ToASCII operation
 419 fails on any label in a domain name, that domain name MUST NOT be used
 420 as an internationalized domain name. The method for deadling with this
 421 failure is application-specific.
 422
 423 The inputs to ToASCII are a sequence of code points, the AllowUnassigned
 424 flag, and the UseSTD3ASCIIRules flag. The output of ToASCII is either a
 425 sequence of ASCII code points or a failure condition.
 426
 427 ToASCII never alters a sequence of code points that are all in the ASCII
 428 range to begin with (although it could fail). Applying the ToASCII
 429 operation multiple times has exactly the same effect as applying it just
 430 once.
 431
 432 ToASCII consists of the following steps:
 433
 434     1. If all code points in the sequence are in the ASCII range (0..7F)
 435        then skip to step 3.
 436
 437     2. Perform the steps specified in [NAMEPREP] and fail if there is
 438        an error. The AllowUnassigned flag is used in [NAMEPREP].
 439
 440     3. If the UseSTD3ASCIIRules flag is set, then perform these checks:
 441
 442          (a) Verify the absence of non-LDH ASCII code points; that is,
 443              the absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.
 444
 445          (b) Verify the absence of leading and trailing hyphen-minus;
 446              that is, the absence of U+002D at the beginning and end of
 447              the sequence.
 448
 449     4. If all code points in the sequence are in the ASCII range
 450        (0..7F), then skip to step 8.
 451
 452     5. Verify that the sequence does NOT begin with the ACE prefix.
 453
 454     6. Encode the sequence using the encoding algorithm in [PUNYCODE]
 455        and fail if there is an error.
 456
 457     7. Prepend the ACE prefix.
 458
 459     8. Verify that the number of code points is in the range 1 to 63
 460        inclusive.
 461
 462 4.2 ToUnicode
 463
 464 The ToUnicode operation takes a sequence of Unicode code points that
 465 make up one label and returns a sequence of Unicode code points. If the
 466 input sequence is a label in ACE form, then the result is an equivalent
 467 internationalized label that is not in ACE form, otherwise the original
 468 sequence is returned unaltered.
 469
 470 ToUnicode never fails. If any step fails, then the original input
 471 sequence is returned immediately in that step.
 472
 473 The ToUnicode output never contains more code points than its input.
 474 Note that the number of octets needed to represent a sequence of code
 475 points depends on the particular character encoding used.
 476
 477 The inputs to ToUnicode are a sequence of code points, the
 478 AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output of
 479 ToUnicode is always a sequence of Unicode code points.
 480
 481     1. If all code points in the sequence are in the ASCII range (0..7F)
 482        then skip to step 3.
 483
 484     2. Perform the steps specified in [NAMEPREP] and fail if there is an
 485        error. (If step 3 of ToASCII is also performed here, it will not
 486        affect the overall behavior of ToUnicode, but it is not
 487        necessary.) The AllowUnassigned flag is used in [NAMEPREP].
 488
 489     3. Verify that the sequence begins with the ACE prefix, and save a
 490        copy of the sequence.
 491
 492     4. Remove the ACE prefix.
 493
 494     5. Decode the sequence using the decoding algorithm in [PUNYCODE]
 495        and fail if there is an error. Save a copy of the result of
 496        this step.
 497
 498     6. Apply ToASCII.
 499
 500     7. Verify that the result of step 6 matches the saved copy from
 501        step 3, using a case-insensitive ASCII comparison.
 502
 503     8. Return the saved copy from step 5.
 504
 505
 506 5. ACE prefix
 507
 508 [[ Note to the IESG and Internet Draft readers: The two uses of the
 509 string "IESG--" below are to be changed at time of publication to a
 510 prefix which fulfills the requirements in the first paragraph. IANA will
 511 assign this value. ]]
 512
 513 The ACE prefix, used in the conversion operations (section 4), is two
 514 alphanumeric ASCII characters followed by two hyphen-minuses. It cannot
 515 be any of the prefixes already used in earlier documents, which includes
 516 the following: "bl--", "bq--", "dq--", "lq--", "mq--", "ra--", "wq--"
 517 and "zq--". The ToASCII and ToUnicode operations MUST recognize the ACE
 518 prefix in a case-insensitive manner.
 519
 520 The ACE prefix for IDNA is "IESG--".
 521
 522 This means that an ACE label might be "IESG--de-jg4avhby1noc0d", where
 523 "de-jg4avhby1noc0d" is the part of the ACE label that is generated by
 524 the encoding steps in [PUNYCODE].
 525
 526 While all ACE labels begin with the ACE prefix, not all labels beginning
 527 with the ACE prefix are necessarily ACE labels.  Non-ACE labels that
 528 begin with the ACE prefix will confuse users and SHOULD NOT be allowed
 529 in DNS zones.
 530
 531
 532 6. Implications for typical applications using DNS
 533
 534 In IDNA, applications perform the processing needed to input
 535 internationalized domain names from users, display internationalized
 536 domain names to users, and process the inputs and outputs from DNS and
 537 other protocols that carry domain names.
 538
 539 The components and interfaces between them can be represented
 540 pictorially as:
 541
 542                      +------+
 543                      | User |
 544                      +------+
 545                         ^
 546                         | Input and display: local interface methods
 547                         | (pen, keyboard, glowing phosphorus, ...)
 548     +-------------------|-------------------------------+
 549     |                   v                               |
 550     |          +-----------------------------+          |
 551     |          |        Application          |          |
 552     |          |   (ToASCII and ToUnicode    |          |
 553     |          |      operations may be      |          |
 554     |          |        called here)         |          |
 555     |          +-----------------------------+          |
 556     |                   ^        ^                      | End system
 557     |                   |        |                      |
 558     | Call to resolver: |        | Application-specific |
 559     |              ACE  |        | protocol:            |
 560     |                   v        | ACE unless the       |
 561     |           +----------+     | protocol is updated  |
 562     |           | Resolver |     | to handle other      |
 563     |           +----------+     | encodings            |
 564     |                 ^          |                      |
 565     +-----------------|----------|----------------------+
 566         DNS protocol: |          |
 567                   ACE |          |
 568                       v          v
 569            +-------------+    +---------------------+
 570            | DNS servers |    | Application servers |
 571            +-------------+    +---------------------+
 572
 573 The box labeled "Application" is where the application splits a domain
 574 name into labels, sets the appropriate flags, and performs the ToASCII
 575 and ToUnicode operations. This is described in section 4.
 576
 577 6.1 Entry and display in applications
 578
 579 Applications can accept domain names using any character set or sets
 580 desired by the application developer, and can display domain names in
 581 any charset. That is, the IDNA protocol does not affect the interface
 582 between users and applications.
 583
 584 An IDNA-aware application can accept and display internationalized
 585 domain names in two formats: the internationalized character set(s)
 586 supported by the application, and as an ACE label. ACE labels that are
 587 displayed or input MUST always include the ACE prefix. Applications MAY
 588 allow input and display of ACE labels, but are not encouraged to do so
 589 except as an interface for special purposes, possibly for debugging, or
 590 to cope with display limitations as described in section 6.4.. ACE
 591 encoding is opaque and ugly, and should thus only be exposed to users
 592 who absolutely need it. Because name labels encoded as ACE name labels
 593 can be rendered either as the encoded ASCII characters or the proper
 594 decoded characters, the application MAY have an option for the user to
 595 select the preferred method of display; if it does, rendering the ACE
 596 SHOULD NOT be the default.
 597
 598 Domain names are often stored and transported in many places. For
 599 example, they are part of documents such as mail messages and web pages.
 600 They are transported in many parts of many protocols, such as both the
 601 control commands and the RFC 2822 body parts of SMTP, and the headers
 602 and the body content in HTTP. It is important to remember that domain
 603 names appear both in domain name slots and in the content that is passed
 604 over protocols.
 605
 606 In protocols and document formats that define how to handle
 607 specification or negotiation of charsets, labels can be encoded in any
 608 charset allowed by the protocol or document format. If a protocol or
 609 document format only allows one charset, the labels MUST be given in
 610 that charset.
 611
 612 In any place where a protocol or document format allows transmission of
 613 the characters in internationalized labels, internationalized labels
 614 SHOULD be transmitted using whatever character encoding and escape
 615 mechanism that the protocol or document format uses at that place.
 616
 617 All protocols that use domain name slots already have the capacity for
 618 handling domain names in the ASCII charset. Thus, ACE labels
 619 (internationalized labels that have been processed with the ToASCII
 620 operation) can inherently be handled by those protocols.
 621
 622 6.2 Applications and resolver libraries
 623
 624 Applications normally use functions in the operating system when they
 625 resolve DNS queries. Those functions in the operating system are often
 626 called "the resolver library", and the applications communicate with the
 627 resolver libraries through a programming interface (API).
 628
 629 Because these resolver libraries today expect only domain names in
 630 ASCII, applications MUST prepare labels that are passed to the resolver
 631 library using the ToASCII operation. Labels received from the resolver
 632 library contain only ASCII characters; internationalized labels that
 633 cannot be represented directly in ASCII use the ACE form. ACE labels
 634 always include the ACE prefix.
 635
 636 An operating system might have a set of libraries for performing the
 637 ToASCII operation. The input to such a library might be in one or more
 638 charsets that are used in applications (UTF-8 and UTF-16 are likely
 639 candidates for almost any operating system, and script-specific charsets
 640 are likely for localized operating systems).
 641
 642 IDNA-aware applications MUST be able to work with both
 643 non-internationalized labels (those that conform to [STD13] and [STD3])
 644 and internationalized labels.
 645
 646 It is expected that new versions of the resolver libraries in the future
 647 will be able to accept domain names in other charsets than ASCII, and
 648 application developers might one day pass not only domain names in
 649 Unicode, but also in local script to a new API for the resolver
 650 libraries in the operating system. Thus the ToASCII and ToUnicode
 651 operations might be performed inside these new versions of the resolver
 652 libraries.
 653
 654 Domain names passed to resolvers or put into the question
 655 section of DNS requests follow the rules for "queries" from
 656 [STRINGPREP].
 657
 658 6.3 DNS servers
 659
 660 Domain names stored in zones follow the rules for "stored strings" from
 661 [STRINGPREP].
 662
 663 For internationalized labels that cannot be represented directly in
 664 ASCII, DNS servers MUST use the ACE form produced by the ToASCII
 665 operation. All IDNs served by DNS servers MUST contain only ASCII
 666 characters.
 667
 668 If a signaling system which makes negotiation possible between old and
 669 new DNS clients and servers is standardized in the future, the encoding
 670 of the query in the DNS protocol itself can be changed from ACE to
 671 something else, such as UTF-8. The question whether or not this should
 672 be used is, however, a separate problem and is not discussed in this
 673 memo.
 674
 675 6.4 Avoiding exposing users to the raw ACE encoding
 676
 677 Any application that might show the user a domain name obtained from a
 678 domain name slot, such as from gethostbyaddr or part of a mail header,
 679 will need to be updated if it is to prevent users from seeing the ACE.
 680
 681 If an application decodes an ACE name using ToUnicode but cannot show
 682 all of the characters in the decoded name, such as if the name contains
 683 characters that the output system cannot display, the application SHOULD
 684 show the name in ACE format (which always includes the ACE prefix)
 685 instead of displaying the name with the replacement character (U+FFFD).
 686 This is to make it easier for the user to transfer the name correctly to
 687 other programs. Programs that by default show the ACE form when they
 688 cannot show all the characters in a name label SHOULD also have a
 689 mechanism to show the name that is produced by the ToUnicode operation
 690 with as many characters as possible and replacement characters in the
 691 positions where characters cannot be displayed.
 692
 693 The ToUnicode operation does not alter labels that are not valid ACE
 694 labels, even if they begin with the ACE prefix. After ToUnicode has been
 695 applied, if a label still begins with the ACE prefix, then it is not a
 696 valid ACE label, and is not equivalent to any of the intermediate
 697 Unicode strings constructed by ToUnicode.
 698
 699 6.5  DNSSEC authentication of IDN domain names
 700
 701 DNS Security [DNSSEC] is a method for supplying cryptographic
 702 verification information along with DNS messages. Public Key
 703 Cryptography is used in conjunction with digital signatures to provide a
 704 means for a requester of domain information to authenticate the source
 705 of the data. This ensures that it can be traced back to a trusted
 706 source, either directly, or via a chain of trust linking the source of
 707 the information to the top of the DNS hierarchy.
 708
 709 IDNA specifies that all internationalized domain names served by DNS
 710 servers that cannot be represented directly in ASCII must use the ACE
 711 form produced by the ToASCII operation. This operation must be performed
 712 prior to a zone being signed by the private key for that zone. Because
 713 of this ordering, it is important to recognize that DNSSEC authenticates
 714 the ASCII domain name, not the Unicode form or the mapping between the
 715 Unicode form and the ASCII form. In the presence of DNSSEC, this is the
 716 name that MUST be signed in the zone and MUST be validated against.
 717
 718 One consequence of this for sites deploying IDNA in the presence of
 719 DNSSEC is that any special purpose proxies or forwarders used to
 720 transform user input into IDNs must be earlier in the resolution flow
 721 than DNSSEC authenticating nameservers for DNSSEC to work.
 722
 723
 724 7. Name server considerations
 725
 726 Existing DNS servers do not know the IDNA rules for handling non-ASCII
 727 forms of IDNs, and therefore need to be shielded from them.  All
 728 existing channels through which names can enter a DNS server database
 729 (for example, master files [STD13] and DNS update messages [RFC2136])
 730 are IDN-unaware because they predate IDNA, and therefore requirement 2
 731 of section 3.1 of this document provides the needed shielding, by ensuring
 732 that internationalized domain names entering DNS server databases
 733 through such channels have already been converted to their equivalent
 734 ASCII forms.
 735
 736 It is imperative that there be only one ASCII encoding for a particular
 737 domain name. Because of the design of the ToASCII and ToUnicode
 738 operations, there are no ACE labels that decode to ASCII labels, and
 739 therefore name servers cannot contain multiple ASCII encodings of the
 740 same domain name.
 741
 742 [RFC2181] explicitly allows domain labels to contain octets beyond the
 743 ASCII range (0..7F), and this document does not change that. Note,
 744 however, that there is no defined interpretation of octets 80..FF as
 745 characters. If labels containing these octets are returned to
 746 applications, unpredictable behavior could result. The ASCII form
 747 defined by ToASCII is the only standard representation for
 748 internationalized labels in the current DNS protocol.
 749
 750
 751 8. Root server considerations
 752
 753 IDNs are likely to be somewhat longer than current domain names, so the
 754 bandwidth needed by the root servers is likely to go up by a small amount.
 755 Also, queries and responses for IDNs will probably be somewhat longer
 756 than typical queries today, so more queries and responses may be forced
 757 to go to TCP instead of UDP.
 758
 759
 760 9. References
 761
 762 9.1 Normative references
 763
 764 [PUNYCODE] Adam Costello, "Punycode: An encoding of Unicode for use with
 765 IDNA", draft-ietf-idn-punycode.
 766
 767 [NAMEPREP] Paul Hoffman and Marc Blanchet, "Nameprep: A Stringprep
 768 Profile for Internationalized Domain Names", draft-ietf-idn-nameprep.
 769
 770 [STD3] Bob Braden, "Requirements for Internet Hosts -- Communication
 771 Layers" (RFC 1122) and "Requirements for Internet Hosts -- Application
 772 and Support" (RFC 1123), STD 3, October 1989.
 773
 774 [STD13] Paul Mockapetris, "Domain names - concepts and facilities" (RFC
 775 1034) and "Domain names - implementation and specification" (RFC 1035),
 776 STD 13, November 1987.
 777
 778 [STRINGPREP] Paul Hoffman and Marc Blanchet, "Preparation of
 779 Internationalized Strings ("stringprep")", draft-hoffman-stringprep,
 780 work in progress
 781
 782 9.2 Informative references
 783
 784 [DNSSEC] Don Eastlake, "Domain Name System Security Extensions", RFC
 785 2535, March 1999.
 786
 787 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
 788 Requirement Levels", March 1997, RFC 2119.
 789
 790 [RFC2181] Robert Elz and Randy Bush, "Clarifications to the DNS
 791 Specification", RFC 2181, July 1997.
 792
 793 [UAX9] Unicode Standard Annex #9, The Bidirectional Algorithm,
 794 <http://www.unicode.org/unicode/reports/tr9/>.
 795
 796 [UNICODE] The Unicode Consortium. The Unicode Standard, Version 3.2.0 is
 797 defined by The Unicode Standard, Version 3.0 (Reading, MA,
 798 Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode
 799 Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/)
 800 and by the Unicode Standard Annex #28: Unicode 3.2
 801 (http://www.unicode.org/reports/tr28/).
 802
 803 [USASCII] Vint Cerf, "ASCII format for Network Interchange", October
 804 1969, RFC 20.
 805
 806
 807 10. Security considerations
 808
 809 Security on the Internet partly relies on the DNS. Thus, any change to
 810 the characteristics of the DNS can change the security of much of the
 811 Internet.
 812
 813 This memo describes an algorithm which encodes characters that are not
 814 valid according to STD3 and STD13 into octet values that are valid. No
 815 security issues such as string length increases or new allowed values
 816 are introduced by the encoding process or the use of these encoded
 817 values, apart from those introduced by the ACE encoding itself.
 818
 819 Domain names are used by users to identify and connect to Internet
 820 servers.  The security of the Internet is compromised if a user entering
 821 a single internationalized name is connected to different servers based
 822 on different interpretations of the internationalized domain name.
 823
 824 When systems use local character sets other than ASCII and Unicode, this
 825 specification leaves the the problem of transcoding between the local
 826 character set and Unicode up to the application. If different
 827 applications (or different versions of one application) implement
 828 different transcoding rules, they could interpret the same name
 829 differently and contact different servers. This problem is not solved by
 830 security protocols like TLS that do not take local character sets into
 831 account.
 832
 833 Because this document normatively refers to [NAMEPREP], [PUNYCODE], and
 834 [STRINGPREP], it includes the security considerations from those
 835 documents as well.
 836
 837 If or when this specification is updated to use a more recent Unicode
 838 normalization table, the new normalization table will need to be
 839 compared with the old to spot backwards incompatible changes.  If there
 840 are such changes, they will need to be handled somehow, or there will be
 841 security as well as operational implications.  Methods to handle the
 842 conflicts could include keeping the old normalization, or taking care of
 843 the conflicting characters by operational means, or some other method.
 844
 845 Implementations MUST NOT use more recent normalization tables than the
 846 one referenced from this document, even though more recent tables may be
 847 provided by operating systems.  If an application is unsure of which
 848 version of the normalization tables are in the operating system, the
 849 application needs to include the normalization tables itself.  Using
 850 normalization tables other than the one referenced from this
 851 specification could have security and operational implications.
 852
 853 To help prevent confusion between characters that are visually similar,
 854 it is suggested that implementations provide visual indications where a
 855 domain name contains multiple scripts. Such mechanisms can also be used
 856 to show when a name contains a mixture of simplified and traditional
 857 Chinese characters, or to distinguish zero and one from O and l.
 858 DNS zone adminstrators may impose restrictions (subject to the
 859 limitations in section 2) that try to minimize homographs.
 860
 861 Domain names (or portions of them) are sometimes compared against a set
 862 of privileged or anti-privileged domains. In such situations it is
 863 especially important that the comparisons be done properly, as specified
 864 in section 3.1 requirement 4. For labels already in ASCII form, the
 865 proper comparison reduces to the same case-insensitive ASCII comparison
 866 that has always been used for ASCII labels.
 867
 868 The introduction of IDNA means that any existing labels that start with
 869 the ACE prefix and would be altered by ToUnicode will automatically be
 870 ACE labels, and will be considered equivalent to non-ASCII labels,
 871 whether or not that was the intent of the zone adminstrator or
 872 registrant.
 873
 874
 875 11. IANA considerations
 876
 877 IANA will assign the ACE prefix in consultation with the IESG.
 878
 879
 880 12. Authors' addresses
 881
 882 Patrik Faltstrom
 883 Cisco Systems
 884 Arstaangsvagen 31 J
 885 S-117 43 Stockholm  Sweden
 886 paf@cisco.com
 887
 888 Paul Hoffman
 889 Internet Mail Consortium and VPN Consortium
 890 127 Segre Place
 891 Santa Cruz, CA  95060  USA
 892 phoffman@imc.org
 893
 894 Adam M. Costello
 895 University of California, Berkeley
 896 idna-spec.amc @ nicemice.net
 897
 898
 899 A. Changes from -13 to -14
 900 [[ To be removed by the RFC Editor before publishing ]]
 901
 902 Made changes based on messages from the Area Directors. See
 903 <http://www.imc.org/idn/mail-archive/msg07261.html> and
 904 <http://www.imc.org/idn/mail-archive/msg07263.html> for more detail.