source4/heimdal/lib/wind/rfc3490.txt

   1
   2
   3
   4
   5
   6
   7 Network Working Group                                       P. Faltstrom
   8 Request for Comments: 3490                                         Cisco
   9 Category: Standards Track                                     P. Hoffman
  10                                                               IMC & VPNC
  11                                                              A. Costello
  12                                                              UC Berkeley
  13                                                               March 2003
  14
  15
  16          Internationalizing Domain Names in Applications (IDNA)
  17
  18 Status of this Memo
  19
  20    This document specifies an Internet standards track protocol for the
  21    Internet community, and requests discussion and suggestions for
  22    improvements.  Please refer to the current edition of the "Internet
  23    Official Protocol Standards" (STD 1) for the standardization state
  24    and status of this protocol.  Distribution of this memo is unlimited.
  25
  26 Copyright Notice
  27
  28    Copyright (C) The Internet Society (2003).  All Rights Reserved.
  29
  30 Abstract
  31
  32    Until now, there has been no standard method for domain names to use
  33    characters outside the ASCII repertoire.  This document defines
  34    internationalized domain names (IDNs) and a mechanism called
  35    Internationalizing Domain Names in Applications (IDNA) for handling
  36    them in a standard fashion.  IDNs use characters drawn from a large
  37    repertoire (Unicode), but IDNA allows the non-ASCII characters to be
  38    represented using only the ASCII characters already allowed in so-
  39    called host names today.  This backward-compatible representation is
  40    required in existing protocols like DNS, so that IDNs can be
  41    introduced with no changes to the existing infrastructure.  IDNA is
  42    only meant for processing domain names, not free text.
  43
  44 Table of Contents
  45
  46    1. Introduction..................................................  2
  47       1.1 Problem Statement.........................................  3
  48       1.2 Limitations of IDNA.......................................  3
  49       1.3 Brief overview for application developers.................  4
  50    2. Terminology...................................................  5
  51    3. Requirements and applicability................................  7
  52       3.1 Requirements..............................................  7
  53       3.2 Applicability.............................................  8
  54          3.2.1. DNS resource records................................  8
  55
  56
  57
  58 Faltstrom, et al.           Standards Track                     [Page 1]
  59 \f
  60 RFC 3490                          IDNA                        March 2003
  61
  62
  63          3.2.2. Non-domain-name data types stored in domain names...  9
  64    4. Conversion operations.........................................  9
  65       4.1 ToASCII................................................... 10
  66       4.2 ToUnicode................................................. 11
  67    5. ACE prefix.................................................... 12
  68    6. Implications for typical applications using DNS............... 13
  69       6.1 Entry and display in applications......................... 14
  70       6.2 Applications and resolver libraries....................... 15
  71       6.3 DNS servers............................................... 15
  72       6.4 Avoiding exposing users to the raw ACE encoding........... 16
  73       6.5  DNSSEC authentication of IDN domain names................ 16
  74    7. Name server considerations.................................... 17
  75    8. Root server considerations.................................... 17
  76    9. References.................................................... 18
  77       9.1 Normative References...................................... 18
  78       9.2 Informative References.................................... 18
  79    10. Security Considerations...................................... 19
  80    11. IANA Considerations.......................................... 20
  81    12. Authors' Addresses........................................... 21
  82    13. Full Copyright Statement..................................... 22
  83
  84 1. Introduction
  85
  86    IDNA works by allowing applications to use certain ASCII name labels
  87    (beginning with a special prefix) to represent non-ASCII name labels.
  88    Lower-layer protocols need not be aware of this; therefore IDNA does
  89    not depend on changes to any infrastructure.  In particular, IDNA
  90    does not depend on any changes to DNS servers, resolvers, or protocol
  91    elements, because the ASCII name service provided by the existing DNS
  92    is entirely sufficient for IDNA.
  93
  94    This document does not require any applications to conform to IDNA,
  95    but applications can elect to use IDNA in order to support IDN while
  96    maintaining interoperability with existing infrastructure.  If an
  97    application wants to use non-ASCII characters in domain names, IDNA
  98    is the only currently-defined option.  Adding IDNA support to an
  99    existing application entails changes to the application only, and
 100    leaves room for flexibility in the user interface.
 101
 102    A great deal of the discussion of IDN solutions has focused on
 103    transition issues and how IDN will work in a world where not all of
 104    the components have been updated.  Proposals that were not chosen by
 105    the IDN Working Group would depend on user applications, resolvers,
 106    and DNS servers being updated in order for a user to use an
 107    internationalized domain name.  Rather than rely on widespread
 108    updating of all components, IDNA depends on updates to user
 109    applications only; no changes are needed to the DNS protocol or any
 110    DNS servers or the resolvers on user's computers.
 111
 112
 113
 114 Faltstrom, et al.           Standards Track                     [Page 2]
 115 \f
 116 RFC 3490                          IDNA                        March 2003
 117
 118
 119 1.1 Problem Statement
 120
 121    The IDNA specification solves the problem of extending the repertoire
 122    of characters that can be used in domain names to include the Unicode
 123    repertoire (with some restrictions).
 124
 125    IDNA does not extend the service offered by DNS to the applications.
 126    Instead, the applications (and, by implication, the users) continue
 127    to see an exact-match lookup service.  Either there is a single
 128    exactly-matching name or there is no match.  This model has served
 129    the existing applications well, but it requires, with or without
 130    internationalized domain names, that users know the exact spelling of
 131    the domain names that the users type into applications such as web
 132    browsers and mail user agents.  The introduction of the larger
 133    repertoire of characters potentially makes the set of misspellings
 134    larger, especially given that in some cases the same appearance, for
 135    example on a business card, might visually match several Unicode code
 136    points or several sequences of code points.
 137
 138    IDNA allows the graceful introduction of IDNs not only by avoiding
 139    upgrades to existing infrastructure (such as DNS servers and mail
 140    transport agents), but also by allowing some rudimentary use of IDNs
 141    in applications by using the ASCII representation of the non-ASCII
 142    name labels.  While such names are very user-unfriendly to read and
 143    type, and hence are not suitable for user input, they allow (for
 144    instance) replying to email and clicking on URLs even though the
 145    domain name displayed is incomprehensible to the user.  In order to
 146    allow user-friendly input and output of the IDNs, the applications
 147    need to be modified to conform to this specification.
 148
 149    IDNA uses the Unicode character repertoire, which avoids the
 150    significant delays that would be inherent in waiting for a different
 151    and specific character set be defined for IDN purposes by some other
 152    standards developing organization.
 153
 154 1.2 Limitations of IDNA
 155
 156    The IDNA protocol does not solve all linguistic issues with users
 157    inputting names in different scripts.  Many important language-based
 158    and script-based mappings are not covered in IDNA and need to be
 159    handled outside the protocol.  For example, names that are entered in
 160    a mix of traditional and simplified Chinese characters will not be
 161    mapped to a single canonical name.  Another example is Scandinavian
 162    names that are entered with U+00F6 (LATIN SMALL LETTER O WITH
 163    DIAERESIS) will not be mapped to U+00F8 (LATIN SMALL LETTER O WITH
 164    STROKE).
 165
 166
 167
 168
 169
 170 Faltstrom, et al.           Standards Track                     [Page 3]
 171 \f
 172 RFC 3490                          IDNA                        March 2003
 173
 174
 175    An example of an important issue that is not considered in detail in
 176    IDNA is how to provide a high probability that a user who is entering
 177    a domain name based on visual information (such as from a business
 178    card or billboard) or aural information (such as from a telephone or
 179    radio) would correctly enter the IDN.  Similar issues exist for ASCII
 180    domain names, for example the possible visual confusion between the
 181    letter 'O' and the digit zero, but the introduction of the larger
 182    repertoire of characters creates more opportunities of similar
 183    looking and similar sounding names.  Note that this is a complex
 184    issue relating to languages, input methods on computers, and so on.
 185    Furthermore, the kind of matching and searching necessary for a high
 186    probability of success would not fit the role of the DNS and its
 187    exact matching function.
 188
 189 1.3 Brief overview for application developers
 190
 191    Applications can use IDNA to support internationalized domain names
 192    anywhere that ASCII domain names are already supported, including DNS
 193    master files and resolver interfaces.  (Applications can also define
 194    protocols and interfaces that support IDNs directly using non-ASCII
 195    representations.  IDNA does not prescribe any particular
 196    representation for new protocols, but it still defines which names
 197    are valid and how they are compared.)
 198
 199    The IDNA protocol is contained completely within applications.  It is
 200    not a client-server or peer-to-peer protocol: everything is done
 201    inside the application itself.  When used with a DNS resolver
 202    library, IDNA is inserted as a "shim" between the application and the
 203    resolver library.  When used for writing names into a DNS zone, IDNA
 204    is used just before the name is committed to the zone.
 205
 206    There are two operations described in section 4 of this document:
 207
 208    -  The ToASCII operation is used before sending an IDN to something
 209       that expects ASCII names (such as a resolver) or writing an IDN
 210       into a place that expects ASCII names (such as a DNS master file).
 211
 212    -  The ToUnicode operation is used when displaying names to users,
 213       for example names obtained from a DNS zone.
 214
 215    It is important to note that the ToASCII operation can fail.  If it
 216    fails when processing a domain name, that domain name cannot be used
 217    as an internationalized domain name and the application has to have
 218    some method of dealing with this failure.
 219
 220    IDNA requires that implementations process input strings with
 221    Nameprep [NAMEPREP], which is a profile of Stringprep [STRINGPREP],
 222    and then with Punycode [PUNYCODE].  Implementations of IDNA MUST
 223
 224
 225
 226 Faltstrom, et al.           Standards Track                     [Page 4]
 227 \f
 228 RFC 3490                          IDNA                        March 2003
 229
 230
 231    fully implement Nameprep and Punycode; neither Nameprep nor Punycode
 232    are optional.
 233
 234 2. Terminology
 235
 236    The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
 237    and "MAY" in this document are to be interpreted as described in BCP
 238    14, RFC 2119 [RFC2119].
 239
 240    A code point is an integer value associated with a character in a
 241    coded character set.
 242
 243    Unicode [UNICODE] is a coded character set containing tens of
 244    thousands of characters.  A single Unicode code point is denoted by
 245    "U+" followed by four to six hexadecimal digits, while a range of
 246    Unicode code points is denoted by two hexadecimal numbers separated
 247    by "..", with no prefixes.
 248
 249    ASCII means US-ASCII [USASCII], a coded character set containing 128
 250    characters associated with code points in the range 0..7F.  Unicode
 251    is an extension of ASCII: it includes all the ASCII characters and
 252    associates them with the same code points.
 253
 254    The term "LDH code points" is defined in this document to mean the
 255    code points associated with ASCII letters, digits, and the hyphen-
 256    minus; that is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an
 257    abbreviation for "letters, digits, hyphen".
 258
 259    [STD13] talks about "domain names" and "host names", but many people
 260    use the terms interchangeably.  Further, because [STD13] was not
 261    terribly clear, many people who are sure they know the exact
 262    definitions of each of these terms disagree on the definitions.  In
 263    this document the term "domain name" is used in general.  This
 264    document explicitly cites [STD3] whenever referring to the host name
 265    syntax restrictions defined therein.
 266
 267    A label is an individual part of a domain name.  Labels are usually
 268    shown separated by dots; for example, the domain name
 269    "www.example.com" is composed of three labels: "www", "example", and
 270    "com".  (The zero-length root label described in [STD13], which can
 271    be explicit as in "www.example.com." or implicit as in
 272    "www.example.com", is not considered a label in this specification.)
 273    IDNA extends the set of usable characters in labels that are text.
 274    For the rest of this document, the term "label" is shorthand for
 275    "text label", and "every label" means "every text label".
 276
 277
 278
 279
 280
 281
 282 Faltstrom, et al.           Standards Track                     [Page 5]
 283 \f
 284 RFC 3490                          IDNA                        March 2003
 285
 286
 287    An "internationalized label" is a label to which the ToASCII
 288    operation (see section 4) can be applied without failing (with the
 289    UseSTD3ASCIIRules flag unset).  This implies that every ASCII label
 290    that satisfies the [STD13] length restriction is an internationalized
 291    label.  Therefore the term "internationalized label" is a
 292    generalization, embracing both old ASCII labels and new non-ASCII
 293    labels.  Although most Unicode characters can appear in
 294    internationalized labels, ToASCII will fail for some input strings,
 295    and such strings are not valid internationalized labels.
 296
 297    An "internationalized domain name" (IDN) is a domain name in which
 298    every label is an internationalized label.  This implies that every
 299    ASCII domain name is an IDN (which implies that it is possible for a
 300    name to be an IDN without it containing any non-ASCII characters).
 301    This document does not attempt to define an "internationalized host
 302    name".  Just as has been the case with ASCII names, some DNS zone
 303    administrators may impose restrictions, beyond those imposed by DNS
 304    or IDNA, on the characters or strings that may be registered as
 305    labels in their zones.  Such restrictions have no impact on the
 306    syntax or semantics of DNS protocol messages; a query for a name that
 307    matches no records will yield the same response regardless of the
 308    reason why it is not in the zone.  Clients issuing queries or
 309    interpreting responses cannot be assumed to have any knowledge of
 310    zone-specific restrictions or conventions.
 311
 312    In IDNA, equivalence of labels is defined in terms of the ToASCII
 313    operation, which constructs an ASCII form for a given label, whether
 314    or not the label was already an ASCII label.  Labels are defined to
 315    be equivalent if and only if their ASCII forms produced by ToASCII
 316    match using a case-insensitive ASCII comparison.  ASCII labels
 317    already have a notion of equivalence: upper case and lower case are
 318    considered equivalent.  The IDNA notion of equivalence is an
 319    extension of that older notion.  Equivalent labels in IDNA are
 320    treated as alternate forms of the same label, just as "foo" and "Foo"
 321    are treated as alternate forms of the same label.
 322
 323    To allow internationalized labels to be handled by existing
 324    applications, IDNA uses an "ACE label" (ACE stands for ASCII
 325    Compatible Encoding).  An ACE label is an internationalized label
 326    that can be rendered in ASCII and is equivalent to an
 327    internationalized label that cannot be rendered in ASCII.  Given any
 328    internationalized label that cannot be rendered in ASCII, the ToASCII
 329    operation will convert it to an equivalent ACE label (whereas an
 330    ASCII label will be left unaltered by ToASCII).  ACE labels are
 331    unsuitable for display to users.  The ToUnicode operation will
 332    convert any label to an equivalent non-ACE label.  In fact, an ACE
 333    label is formally defined to be any label that the ToUnicode
 334    operation would alter (whereas non-ACE labels are left unaltered by
 335
 336
 337
 338 Faltstrom, et al.           Standards Track                     [Page 6]
 339 \f
 340 RFC 3490                          IDNA                        March 2003
 341
 342
 343    ToUnicode).  Every ACE label begins with the ACE prefix specified in
 344    section 5.  The ToASCII and ToUnicode operations are specified in
 345    section 4.
 346
 347    The "ACE prefix" is defined in this document to be a string of ASCII
 348    characters that appears at the beginning of every ACE label.  It is
 349    specified in section 5.
 350
 351    A "domain name slot" is defined in this document to be a protocol
 352    element or a function argument or a return value (and so on)
 353    explicitly designated for carrying a domain name.  Examples of domain
 354    name slots include: the QNAME field of a DNS query; the name argument
 355    of the gethostbyname() library function; the part of an email address
 356    following the at-sign (@) in the From: field of an email message
 357    header; and the host portion of the URI in the src attribute of an
 358    HTML <IMG> tag.  General text that just happens to contain a domain
 359    name is not a domain name slot; for example, a domain name appearing
 360    in the plain text body of an email message is not occupying a domain
 361    name slot.
 362
 363    An "IDN-aware domain name slot" is defined in this document to be a
 364    domain name slot explicitly designated for carrying an
 365    internationalized domain name as defined in this document.  The
 366    designation may be static (for example, in the specification of the
 367    protocol or interface) or dynamic (for example, as a result of
 368    negotiation in an interactive session).
 369
 370    An "IDN-unaware domain name slot" is defined in this document to be
 371    any domain name slot that is not an IDN-aware domain name slot.
 372    Obviously, this includes any domain name slot whose specification
 373    predates IDNA.
 374
 375 3. Requirements and applicability
 376
 377 3.1 Requirements
 378
 379    IDNA conformance means adherence to the following four requirements:
 380
 381    1) Whenever dots are used as label separators, the following
 382       characters MUST be recognized as dots: U+002E (full stop), U+3002
 383       (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61
 384       (halfwidth ideographic full stop).
 385
 386    2) Whenever a domain name is put into an IDN-unaware domain name slot
 387       (see section 2), it MUST contain only ASCII characters.  Given an
 388       internationalized domain name (IDN), an equivalent domain name
 389       satisfying this requirement can be obtained by applying the
 390
 391
 392
 393
 394 Faltstrom, et al.           Standards Track                     [Page 7]
 395 \f
 396 RFC 3490                          IDNA                        March 2003
 397
 398
 399       ToASCII operation (see section 4) to each label and, if dots are
 400       used as label separators, changing all the label separators to
 401       U+002E.
 402
 403    3) ACE labels obtained from domain name slots SHOULD be hidden from
 404       users when it is known that the environment can handle the non-ACE
 405       form, except when the ACE form is explicitly requested.  When it
 406       is not known whether or not the environment can handle the non-ACE
 407       form, the application MAY use the non-ACE form (which might fail,
 408       such as by not being displayed properly), or it MAY use the ACE
 409       form (which will look unintelligle to the user).  Given an
 410       internationalized domain name, an equivalent domain name
 411       containing no ACE labels can be obtained by applying the ToUnicode
 412       operation (see section 4) to each label.  When requirements 2 and
 413       3 both apply, requirement 2 takes precedence.
 414
 415    4) Whenever two labels are compared, they MUST be considered to match
 416       if and only if they are equivalent, that is, their ASCII forms
 417       (obtained by applying ToASCII) match using a case-insensitive
 418       ASCII comparison.  Whenever two names are compared, they MUST be
 419       considered to match if and only if their corresponding labels
 420       match, regardless of whether the names use the same forms of label
 421       separators.
 422
 423 3.2 Applicability
 424
 425    IDNA is applicable to all domain names in all domain name slots
 426    except where it is explicitly excluded.
 427
 428    This implies that IDNA is applicable to many protocols that predate
 429    IDNA.  Note that IDNs occupying domain name slots in those protocols
 430    MUST be in ASCII form (see section 3.1, requirement 2).
 431
 432 3.2.1. DNS resource records
 433
 434    IDNA does not apply to domain names in the NAME and RDATA fields of
 435    DNS resource records whose CLASS is not IN.  This exclusion applies
 436    to every non-IN class, present and future, except where future
 437    standards override this exclusion by explicitly inviting the use of
 438    IDNA.
 439
 440    There are currently no other exclusions on the applicability of IDNA
 441    to DNS resource records; it depends entirely on the CLASS, and not on
 442    the TYPE.  This will remain true, even as new types are defined,
 443    unless there is a compelling reason for a new type to complicate
 444    matters by imposing type-specific rules.
 445
 446
 447
 448
 449
 450 Faltstrom, et al.           Standards Track                     [Page 8]
 451 \f
 452 RFC 3490                          IDNA                        March 2003
 453
 454
 455 3.2.2. Non-domain-name data types stored in domain names
 456
 457    Although IDNA enables the representation of non-ASCII characters in
 458    domain names, that does not imply that IDNA enables the
 459    representation of non-ASCII characters in other data types that are
 460    stored in domain names.  For example, an email address local part is
 461    sometimes stored in a domain label (hostmaster@example.com would be
 462    represented as hostmaster.example.com in the RDATA field of an SOA
 463    record).  IDNA does not update the existing email standards, which
 464    allow only ASCII characters in local parts.  Therefore, unless the
 465    email standards are revised to invite the use of IDNA for local
 466    parts, a domain label that holds the local part of an email address
 467    SHOULD NOT begin with the ACE prefix, and even if it does, it is to
 468    be interpreted literally as a local part that happens to begin with
 469    the ACE prefix.
 470
 471 4. Conversion operations
 472
 473    An application converts a domain name put into an IDN-unaware slot or
 474    displayed to a user.  This section specifies the steps to perform in
 475    the conversion, and the ToASCII and ToUnicode operations.
 476
 477    The input to ToASCII or ToUnicode is a single label that is a
 478    sequence of Unicode code points (remember that all ASCII code points
 479    are also Unicode code points).  If a domain name is represented using
 480    a character set other than Unicode or US-ASCII, it will first need to
 481    be transcoded to Unicode.
 482
 483    Starting from a whole domain name, the steps that an application
 484    takes to do the conversions are:
 485
 486    1) Decide whether the domain name is a "stored string" or a "query
 487       string" as described in [STRINGPREP].  If this conversion follows
 488       the "queries" rule from [STRINGPREP], set the flag called
 489       "AllowUnassigned".
 490
 491    2) Split the domain name into individual labels as described in
 492       section 3.1.  The labels do not include the separator.
 493
 494    3) For each label, decide whether or not to enforce the restrictions
 495       on ASCII characters in host names [STD3].  (Applications already
 496       faced this choice before the introduction of IDNA, and can
 497       continue to make the decision the same way they always have; IDNA
 498       makes no new recommendations regarding this choice.)  If the
 499       restrictions are to be enforced, set the flag called
 500       "UseSTD3ASCIIRules" for that label.
 501
 502
 503
 504
 505
 506 Faltstrom, et al.           Standards Track                     [Page 9]
 507 \f
 508 RFC 3490                          IDNA                        March 2003
 509
 510
 511    4) Process each label with either the ToASCII or the ToUnicode
 512       operation as appropriate.  Typically, you use the ToASCII
 513       operation if you are about to put the name into an IDN-unaware
 514       slot, and you use the ToUnicode operation if you are displaying
 515       the name to a user; section 3.1 gives greater detail on the
 516       applicable requirements.
 517
 518    5) If ToASCII was applied in step 4 and dots are used as label
 519       separators, change all the label separators to U+002E (full stop).
 520
 521    The following two subsections define the ToASCII and ToUnicode
 522    operations that are used in step 4.
 523
 524    This description of the protocol uses specific procedure names, names
 525    of flags, and so on, in order to facilitate the specification of the
 526    protocol.  These names, as well as the actual steps of the
 527    procedures, are not required of an implementation.  In fact, any
 528    implementation which has the same external behavior as specified in
 529    this document conforms to this specification.
 530
 531 4.1 ToASCII
 532
 533    The ToASCII operation takes a sequence of Unicode code points that
 534    make up one label and transforms it into a sequence of code points in
 535    the ASCII range (0..7F).  If ToASCII succeeds, the original sequence
 536    and the resulting sequence are equivalent labels.
 537
 538    It is important to note that the ToASCII operation can fail.  ToASCII
 539    fails if any step of it fails.  If any step of the ToASCII operation
 540    fails on any label in a domain name, that domain name MUST NOT be
 541    used as an internationalized domain name.  The method for dealing
 542    with this failure is application-specific.
 543
 544    The inputs to ToASCII are a sequence of code points, the
 545    AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output of
 546    ToASCII is either a sequence of ASCII code points or a failure
 547    condition.
 548
 549    ToASCII never alters a sequence of code points that are all in the
 550    ASCII range to begin with (although it could fail).  Applying the
 551    ToASCII operation multiple times has exactly the same effect as
 552    applying it just once.
 553
 554    ToASCII consists of the following steps:
 555
 556    1. If the sequence contains any code points outside the ASCII range
 557       (0..7F) then proceed to step 2, otherwise skip to step 3.
 558
 559
 560
 561
 562 Faltstrom, et al.           Standards Track                    [Page 10]
 563 \f
 564 RFC 3490                          IDNA                        March 2003
 565
 566
 567    2. Perform the steps specified in [NAMEPREP] and fail if there is an
 568       error.  The AllowUnassigned flag is used in [NAMEPREP].
 569
 570    3. If the UseSTD3ASCIIRules flag is set, then perform these checks:
 571
 572      (a) Verify the absence of non-LDH ASCII code points; that is, the
 573          absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.
 574
 575      (b) Verify the absence of leading and trailing hyphen-minus; that
 576          is, the absence of U+002D at the beginning and end of the
 577          sequence.
 578
 579    4. If the sequence contains any code points outside the ASCII range
 580       (0..7F) then proceed to step 5, otherwise skip to step 8.
 581
 582    5. Verify that the sequence does NOT begin with the ACE prefix.
 583
 584    6. Encode the sequence using the encoding algorithm in [PUNYCODE] and
 585       fail if there is an error.
 586
 587    7. Prepend the ACE prefix.
 588
 589    8. Verify that the number of code points is in the range 1 to 63
 590       inclusive.
 591
 592 4.2 ToUnicode
 593
 594    The ToUnicode operation takes a sequence of Unicode code points that
 595    make up one label and returns a sequence of Unicode code points.  If
 596    the input sequence is a label in ACE form, then the result is an
 597    equivalent internationalized label that is not in ACE form, otherwise
 598    the original sequence is returned unaltered.
 599
 600    ToUnicode never fails.  If any step fails, then the original input
 601    sequence is returned immediately in that step.
 602
 603    The ToUnicode output never contains more code points than its input.
 604    Note that the number of octets needed to represent a sequence of code
 605    points depends on the particular character encoding used.
 606
 607    The inputs to ToUnicode are a sequence of code points, the
 608    AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output of
 609    ToUnicode is always a sequence of Unicode code points.
 610
 611    1. If all code points in the sequence are in the ASCII range (0..7F)
 612       then skip to step 3.
 613
 614
 615
 616
 617
 618 Faltstrom, et al.           Standards Track                    [Page 11]
 619 \f
 620 RFC 3490                          IDNA                        March 2003
 621
 622
 623    2. Perform the steps specified in [NAMEPREP] and fail if there is an
 624       error.  (If step 3 of ToASCII is also performed here, it will not
 625       affect the overall behavior of ToUnicode, but it is not
 626       necessary.)  The AllowUnassigned flag is used in [NAMEPREP].
 627
 628    3. Verify that the sequence begins with the ACE prefix, and save a
 629       copy of the sequence.
 630
 631    4. Remove the ACE prefix.
 632
 633    5. Decode the sequence using the decoding algorithm in [PUNYCODE] and
 634       fail if there is an error.  Save a copy of the result of this
 635       step.
 636
 637    6. Apply ToASCII.
 638
 639    7. Verify that the result of step 6 matches the saved copy from step
 640       3, using a case-insensitive ASCII comparison.
 641
 642    8. Return the saved copy from step 5.
 643
 644 5. ACE prefix
 645
 646    The ACE prefix, used in the conversion operations (section 4), is two
 647    alphanumeric ASCII characters followed by two hyphen-minuses.  It
 648    cannot be any of the prefixes already used in earlier documents,
 649    which includes the following: "bl--", "bq--", "dq--", "lq--", "mq--",
 650    "ra--", "wq--" and "zq--".  The ToASCII and ToUnicode operations MUST
 651    recognize the ACE prefix in a case-insensitive manner.
 652
 653    The ACE prefix for IDNA is "xn--" or any capitalization thereof.
 654
 655    This means that an ACE label might be "xn--de-jg4avhby1noc0d", where
 656    "de-jg4avhby1noc0d" is the part of the ACE label that is generated by
 657    the encoding steps in [PUNYCODE].
 658
 659    While all ACE labels begin with the ACE prefix, not all labels
 660    beginning with the ACE prefix are necessarily ACE labels.  Non-ACE
 661    labels that begin with the ACE prefix will confuse users and SHOULD
 662    NOT be allowed in DNS zones.
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674 Faltstrom, et al.           Standards Track                    [Page 12]
 675 \f
 676 RFC 3490                          IDNA                        March 2003
 677
 678
 679 6. Implications for typical applications using DNS
 680
 681    In IDNA, applications perform the processing needed to input
 682    internationalized domain names from users, display internationalized
 683    domain names to users, and process the inputs and outputs from DNS
 684    and other protocols that carry domain names.
 685
 686    The components and interfaces between them can be represented
 687    pictorially as:
 688
 689                     +------+
 690                     | User |
 691                     +------+
 692                        ^
 693                        | Input and display: local interface methods
 694                        | (pen, keyboard, glowing phosphorus, ...)
 695    +-------------------|-------------------------------+
 696    |                   v                               |
 697    |          +-----------------------------+          |
 698    |          |        Application          |          |
 699    |          |   (ToASCII and ToUnicode    |          |
 700    |          |      operations may be      |          |
 701    |          |        called here)         |          |
 702    |          +-----------------------------+          |
 703    |                   ^        ^                      | End system
 704    |                   |        |                      |
 705    | Call to resolver: |        | Application-specific |
 706    |              ACE  |        | protocol:            |
 707    |                   v        | ACE unless the       |
 708    |           +----------+     | protocol is updated  |
 709    |           | Resolver |     | to handle other      |
 710    |           +----------+     | encodings            |
 711    |                 ^          |                      |
 712    +-----------------|----------|----------------------+
 713        DNS protocol: |          |
 714                  ACE |          |
 715                      v          v
 716           +-------------+    +---------------------+
 717           | DNS servers |    | Application servers |
 718           +-------------+    +---------------------+
 719
 720    The box labeled "Application" is where the application splits a
 721    domain name into labels, sets the appropriate flags, and performs the
 722    ToASCII and ToUnicode operations.  This is described in section 4.
 723
 724
 725
 726
 727
 728
 729
 730 Faltstrom, et al.           Standards Track                    [Page 13]
 731 \f
 732 RFC 3490                          IDNA                        March 2003
 733
 734
 735 6.1 Entry and display in applications
 736
 737    Applications can accept domain names using any character set or sets
 738    desired by the application developer, and can display domain names in
 739    any charset.  That is, the IDNA protocol does not affect the
 740    interface between users and applications.
 741
 742    An IDNA-aware application can accept and display internationalized
 743    domain names in two formats: the internationalized character set(s)
 744    supported by the application, and as an ACE label.  ACE labels that
 745    are displayed or input MUST always include the ACE prefix.
 746    Applications MAY allow input and display of ACE labels, but are not
 747    encouraged to do so except as an interface for special purposes,
 748    possibly for debugging, or to cope with display limitations as
 749    described in section 6.4..  ACE encoding is opaque and ugly, and
 750    should thus only be exposed to users who absolutely need it.  Because
 751    name labels encoded as ACE name labels can be rendered either as the
 752    encoded ASCII characters or the proper decoded characters, the
 753    application MAY have an option for the user to select the preferred
 754    method of display; if it does, rendering the ACE SHOULD NOT be the
 755    default.
 756
 757    Domain names are often stored and transported in many places.  For
 758    example, they are part of documents such as mail messages and web
 759    pages.  They are transported in many parts of many protocols, such as
 760    both the control commands and the RFC 2822 body parts of SMTP, and
 761    the headers and the body content in HTTP.  It is important to
 762    remember that domain names appear both in domain name slots and in
 763    the content that is passed over protocols.
 764
 765    In protocols and document formats that define how to handle
 766    specification or negotiation of charsets, labels can be encoded in
 767    any charset allowed by the protocol or document format.  If a
 768    protocol or document format only allows one charset, the labels MUST
 769    be given in that charset.
 770
 771    In any place where a protocol or document format allows transmission
 772    of the characters in internationalized labels, internationalized
 773    labels SHOULD be transmitted using whatever character encoding and
 774    escape mechanism that the protocol or document format uses at that
 775    place.
 776
 777    All protocols that use domain name slots already have the capacity
 778    for handling domain names in the ASCII charset.  Thus, ACE labels
 779    (internationalized labels that have been processed with the ToASCII
 780    operation) can inherently be handled by those protocols.
 781
 782
 783
 784
 785
 786 Faltstrom, et al.           Standards Track                    [Page 14]
 787 \f
 788 RFC 3490                          IDNA                        March 2003
 789
 790
 791 6.2 Applications and resolver libraries
 792
 793    Applications normally use functions in the operating system when they
 794    resolve DNS queries.  Those functions in the operating system are
 795    often called "the resolver library", and the applications communicate
 796    with the resolver libraries through a programming interface (API).
 797
 798    Because these resolver libraries today expect only domain names in
 799    ASCII, applications MUST prepare labels that are passed to the
 800    resolver library using the ToASCII operation.  Labels received from
 801    the resolver library contain only ASCII characters; internationalized
 802    labels that cannot be represented directly in ASCII use the ACE form.
 803    ACE labels always include the ACE prefix.
 804
 805    An operating system might have a set of libraries for performing the
 806    ToASCII operation.  The input to such a library might be in one or
 807    more charsets that are used in applications (UTF-8 and UTF-16 are
 808    likely candidates for almost any operating system, and script-
 809    specific charsets are likely for localized operating systems).
 810
 811    IDNA-aware applications MUST be able to work with both non-
 812    internationalized labels (those that conform to [STD13] and [STD3])
 813    and internationalized labels.
 814
 815    It is expected that new versions of the resolver libraries in the
 816    future will be able to accept domain names in other charsets than
 817    ASCII, and application developers might one day pass not only domain
 818    names in Unicode, but also in local script to a new API for the
 819    resolver libraries in the operating system.  Thus the ToASCII and
 820    ToUnicode operations might be performed inside these new versions of
 821    the resolver libraries.
 822
 823    Domain names passed to resolvers or put into the question section of
 824    DNS requests follow the rules for "queries" from [STRINGPREP].
 825
 826 6.3 DNS servers
 827
 828    Domain names stored in zones follow the rules for "stored strings"
 829    from [STRINGPREP].
 830
 831    For internationalized labels that cannot be represented directly in
 832    ASCII, DNS servers MUST use the ACE form produced by the ToASCII
 833    operation.  All IDNs served by DNS servers MUST contain only ASCII
 834    characters.
 835
 836    If a signaling system which makes negotiation possible between old
 837    and new DNS clients and servers is standardized in the future, the
 838    encoding of the query in the DNS protocol itself can be changed from
 839
 840
 841
 842 Faltstrom, et al.           Standards Track                    [Page 15]
 843 \f
 844 RFC 3490                          IDNA                        March 2003
 845
 846
 847    ACE to something else, such as UTF-8.  The question whether or not
 848    this should be used is, however, a separate problem and is not
 849    discussed in this memo.
 850
 851 6.4 Avoiding exposing users to the raw ACE encoding
 852
 853    Any application that might show the user a domain name obtained from
 854    a domain name slot, such as from gethostbyaddr or part of a mail
 855    header, will need to be updated if it is to prevent users from seeing
 856    the ACE.
 857
 858    If an application decodes an ACE name using ToUnicode but cannot show
 859    all of the characters in the decoded name, such as if the name
 860    contains characters that the output system cannot display, the
 861    application SHOULD show the name in ACE format (which always includes
 862    the ACE prefix) instead of displaying the name with the replacement
 863    character (U+FFFD).  This is to make it easier for the user to
 864    transfer the name correctly to other programs.  Programs that by
 865    default show the ACE form when they cannot show all the characters in
 866    a name label SHOULD also have a mechanism to show the name that is
 867    produced by the ToUnicode operation with as many characters as
 868    possible and replacement characters in the positions where characters
 869    cannot be displayed.
 870
 871    The ToUnicode operation does not alter labels that are not valid ACE
 872    labels, even if they begin with the ACE prefix.  After ToUnicode has
 873    been applied, if a label still begins with the ACE prefix, then it is
 874    not a valid ACE label, and is not equivalent to any of the
 875    intermediate Unicode strings constructed by ToUnicode.
 876
 877 6.5  DNSSEC authentication of IDN domain names
 878
 879    DNS Security [RFC2535] is a method for supplying cryptographic
 880    verification information along with DNS messages.  Public Key
 881    Cryptography is used in conjunction with digital signatures to
 882    provide a means for a requester of domain information to authenticate
 883    the source of the data.  This ensures that it can be traced back to a
 884    trusted source, either directly, or via a chain of trust linking the
 885    source of the information to the top of the DNS hierarchy.
 886
 887    IDNA specifies that all internationalized domain names served by DNS
 888    servers that cannot be represented directly in ASCII must use the ACE
 889    form produced by the ToASCII operation.  This operation must be
 890    performed prior to a zone being signed by the private key for that
 891    zone.  Because of this ordering, it is important to recognize that
 892    DNSSEC authenticates the ASCII domain name, not the Unicode form or
 893
 894
 895
 896
 897
 898 Faltstrom, et al.           Standards Track                    [Page 16]
 899 \f
 900 RFC 3490                          IDNA                        March 2003
 901
 902
 903    the mapping between the Unicode form and the ASCII form.  In the
 904    presence of DNSSEC, this is the name that MUST be signed in the zone
 905    and MUST be validated against.
 906
 907    One consequence of this for sites deploying IDNA in the presence of
 908    DNSSEC is that any special purpose proxies or forwarders used to
 909    transform user input into IDNs must be earlier in the resolution flow
 910    than DNSSEC authenticating nameservers for DNSSEC to work.
 911
 912 7. Name server considerations
 913
 914    Existing DNS servers do not know the IDNA rules for handling non-
 915    ASCII forms of IDNs, and therefore need to be shielded from them.
 916    All existing channels through which names can enter a DNS server
 917    database (for example, master files [STD13] and DNS update messages
 918    [RFC2136]) are IDN-unaware because they predate IDNA, and therefore
 919    requirement 2 of section 3.1 of this document provides the needed
 920    shielding, by ensuring that internationalized domain names entering
 921    DNS server databases through such channels have already been
 922    converted to their equivalent ASCII forms.
 923
 924    It is imperative that there be only one ASCII encoding for a
 925    particular domain name.  Because of the design of the ToASCII and
 926    ToUnicode operations, there are no ACE labels that decode to ASCII
 927    labels, and therefore name servers cannot contain multiple ASCII
 928    encodings of the same domain name.
 929
 930    [RFC2181] explicitly allows domain labels to contain octets beyond
 931    the ASCII range (0..7F), and this document does not change that.
 932    Note, however, that there is no defined interpretation of octets
 933    80..FF as characters.  If labels containing these octets are returned
 934    to applications, unpredictable behavior could result.  The ASCII form
 935    defined by ToASCII is the only standard representation for
 936    internationalized labels in the current DNS protocol.
 937
 938 8. Root server considerations
 939
 940    IDNs are likely to be somewhat longer than current domain names, so
 941    the bandwidth needed by the root servers is likely to go up by a
 942    small amount.  Also, queries and responses for IDNs will probably be
 943    somewhat longer than typical queries today, so more queries and
 944    responses may be forced to go to TCP instead of UDP.
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954 Faltstrom, et al.           Standards Track                    [Page 17]
 955 \f
 956 RFC 3490                          IDNA                        March 2003
 957
 958
 959 9. References
 960
 961 9.1 Normative References
 962
 963    [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
 964                 Requirement Levels", BCP 14, RFC 2119, March 1997.
 965
 966    [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
 967                 Internationalized Strings ("stringprep")", RFC 3454,
 968                 December 2002.
 969
 970    [NAMEPREP]   Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
 971                 Profile for Internationalized Domain Names (IDN)", RFC
 972                 3491, March 2003.
 973
 974    [PUNYCODE]   Costello, A., "Punycode: A Bootstring encoding of
 975                 Unicode for use with Internationalized Domain Names in
 976                 Applications (IDNA)", RFC 3492, March 2003.
 977
 978    [STD3]       Braden, R., "Requirements for Internet Hosts --
 979                 Communication Layers", STD 3, RFC 1122, and
 980                 "Requirements for Internet Hosts -- Application and
 981                 Support", STD 3, RFC 1123, October 1989.
 982
 983    [STD13]      Mockapetris, P., "Domain names - concepts and
 984                 facilities", STD 13, RFC 1034 and "Domain names -
 985                 implementation and specification", STD 13, RFC 1035,
 986                 November 1987.
 987
 988 9.2 Informative References
 989
 990    [RFC2535]    Eastlake, D., "Domain Name System Security Extensions",
 991                 RFC 2535, March 1999.
 992
 993    [RFC2181]    Elz, R. and R. Bush, "Clarifications to the DNS
 994                 Specification", RFC 2181, July 1997.
 995
 996    [UAX9]       Unicode Standard Annex #9, The Bidirectional Algorithm,
 997                 <http://www.unicode.org/unicode/reports/tr9/>.
 998
 999    [UNICODE]    The Unicode Consortium. The Unicode Standard, Version
1000                 3.2.0 is defined by The Unicode Standard, Version 3.0
1001                 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5),
1002                 as amended by the Unicode Standard Annex #27: Unicode
1003                 3.1 (http://www.unicode.org/reports/tr27/) and by the
1004                 Unicode Standard Annex #28: Unicode 3.2
1005                 (http://www.unicode.org/reports/tr28/).
1006
1007
1008
1009
1010 Faltstrom, et al.           Standards Track                    [Page 18]
1011 \f
1012 RFC 3490                          IDNA                        March 2003
1013
1014
1015    [USASCII]    Cerf, V., "ASCII format for Network Interchange", RFC
1016                 20, October 1969.
1017
1018 10. Security Considerations
1019
1020    Security on the Internet partly relies on the DNS.  Thus, any change
1021    to the characteristics of the DNS can change the security of much of
1022    the Internet.
1023
1024    This memo describes an algorithm which encodes characters that are
1025    not valid according to STD3 and STD13 into octet values that are
1026    valid.  No security issues such as string length increases or new
1027    allowed values are introduced by the encoding process or the use of
1028    these encoded values, apart from those introduced by the ACE encoding
1029    itself.
1030
1031    Domain names are used by users to identify and connect to Internet
1032    servers.  The security of the Internet is compromised if a user
1033    entering a single internationalized name is connected to different
1034    servers based on different interpretations of the internationalized
1035    domain name.
1036
1037    When systems use local character sets other than ASCII and Unicode,
1038    this specification leaves the the problem of transcoding between the
1039    local character set and Unicode up to the application.  If different
1040    applications (or different versions of one application) implement
1041    different transcoding rules, they could interpret the same name
1042    differently and contact different servers.  This problem is not
1043    solved by security protocols like TLS that do not take local
1044    character sets into account.
1045
1046    Because this document normatively refers to [NAMEPREP], [PUNYCODE],
1047    and [STRINGPREP], it includes the security considerations from those
1048    documents as well.
1049
1050    If or when this specification is updated to use a more recent Unicode
1051    normalization table, the new normalization table will need to be
1052    compared with the old to spot backwards incompatible changes.  If
1053    there are such changes, they will need to be handled somehow, or
1054    there will be security as well as operational implications.  Methods
1055    to handle the conflicts could include keeping the old normalization,
1056    or taking care of the conflicting characters by operational means, or
1057    some other method.
1058
1059    Implementations MUST NOT use more recent normalization tables than
1060    the one referenced from this document, even though more recent tables
1061    may be provided by operating systems.  If an application is unsure of
1062    which version of the normalization tables are in the operating
1063
1064
1065
1066 Faltstrom, et al.           Standards Track                    [Page 19]
1067 \f
1068 RFC 3490                          IDNA                        March 2003
1069
1070
1071    system, the application needs to include the normalization tables
1072    itself.  Using normalization tables other than the one referenced
1073    from this specification could have security and operational
1074    implications.
1075
1076    To help prevent confusion between characters that are visually
1077    similar, it is suggested that implementations provide visual
1078    indications where a domain name contains multiple scripts.  Such
1079    mechanisms can also be used to show when a name contains a mixture of
1080    simplified and traditional Chinese characters, or to distinguish zero
1081    and one from O and l.  DNS zone adminstrators may impose restrictions
1082    (subject to the limitations in section 2) that try to minimize
1083    homographs.
1084
1085    Domain names (or portions of them) are sometimes compared against a
1086    set of privileged or anti-privileged domains.  In such situations it
1087    is especially important that the comparisons be done properly, as
1088    specified in section 3.1 requirement 4.  For labels already in ASCII
1089    form, the proper comparison reduces to the same case-insensitive
1090    ASCII comparison that has always been used for ASCII labels.
1091
1092    The introduction of IDNA means that any existing labels that start
1093    with the ACE prefix and would be altered by ToUnicode will
1094    automatically be ACE labels, and will be considered equivalent to
1095    non-ASCII labels, whether or not that was the intent of the zone
1096    adminstrator or registrant.
1097
1098 11. IANA Considerations
1099
1100    IANA has assigned the ACE prefix in consultation with the IESG.
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122 Faltstrom, et al.           Standards Track                    [Page 20]
1123 \f
1124 RFC 3490                          IDNA                        March 2003
1125
1126
1127 12. Authors' Addresses
1128
1129    Patrik Faltstrom
1130    Cisco Systems
1131    Arstaangsvagen 31 J
1132    S-117 43 Stockholm  Sweden
1133
1134    EMail: paf@cisco.com
1135
1136
1137    Paul Hoffman
1138    Internet Mail Consortium and VPN Consortium
1139    127 Segre Place
1140    Santa Cruz, CA  95060  USA
1141
1142    EMail: phoffman@imc.org
1143
1144
1145    Adam M. Costello
1146    University of California, Berkeley
1147
1148    URL: http://www.nicemice.net/amc/
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178 Faltstrom, et al.           Standards Track                    [Page 21]
1179 \f
1180 RFC 3490                          IDNA                        March 2003
1181
1182
1183 13. Full Copyright Statement
1184
1185    Copyright (C) The Internet Society (2003).  All Rights Reserved.
1186
1187    This document and translations of it may be copied and furnished to
1188    others, and derivative works that comment on or otherwise explain it
1189    or assist in its implementation may be prepared, copied, published
1190    and distributed, in whole or in part, without restriction of any
1191    kind, provided that the above copyright notice and this paragraph are
1192    included on all such copies and derivative works.  However, this
1193    document itself may not be modified in any way, such as by removing
1194    the copyright notice or references to the Internet Society or other
1195    Internet organizations, except as needed for the purpose of
1196    developing Internet standards in which case the procedures for
1197    copyrights defined in the Internet Standards process must be
1198    followed, or as required to translate it into languages other than
1199    English.
1200
1201    The limited permissions granted above are perpetual and will not be
1202    revoked by the Internet Society or its successors or assigns.
1203
1204    This document and the information contained herein is provided on an
1205    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1206    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1207    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1208    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1209    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1210
1211 Acknowledgement
1212
1213    Funding for the RFC Editor function is currently provided by the
1214    Internet Society.
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234 Faltstrom, et al.           Standards Track                    [Page 22]
1235 \f