5 Network Working Group J. Klensin
7 Expires: October 14, 2006 P. Faltstrom
12 Review and Recommendations for Internationalized Domain Names (IDN)
13 draft-iab-idn-nextsteps-05.txt
17 By submitting this Internet-Draft, each author represents that any
18 applicable patent or other IPR claims of which he or she is aware
19 have been or will be disclosed, and any of which he or she becomes
20 aware will be disclosed, in accordance with Section 6 of BCP 79.
22 Internet-Drafts are working documents of the Internet Engineering
23 Task Force (IETF), its areas, and its working groups. Note that
24 other groups may also distribute working documents as Internet-
27 Internet-Drafts are draft documents valid for a maximum of six months
28 and may be updated, replaced, or obsoleted by other documents at any
29 time. It is inappropriate to use Internet-Drafts as reference
30 material or to cite them other than as "work in progress."
32 The list of current Internet-Drafts can be accessed at
33 http://www.ietf.org/ietf/1id-abstracts.txt.
35 The list of Internet-Draft Shadow Directories can be accessed at
36 http://www.ietf.org/shadow.html.
38 This Internet-Draft will expire on October 14, 2006.
42 Copyright (C) The Internet Society (2006).
46 This note describes issues raised by the deployment and use of
47 Internationalized Domain Names. It describes problems both at the
48 time of registration and those for use of those names for use in the
49 DNS. It recommends that IETF should update the IDN related RFCs and
50 a framework to be followed in doing so, as well as summarizing and
51 identifying some work that is required outside the IETF. In
52 particular, it proposes that some changes be investigated for the
56 Klensin & Faltstrom Expires October 14, 2006 [Page 1]
58 Internet-Draft IAB -- IDN Next Steps April 2006
61 IDNA standard and its supporting tables, based on experience gained
62 since those standards were completed.
67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
68 1.1. Status of this Document and its Recommendations . . . . . 4
69 1.2. The IDNA Standard . . . . . . . . . . . . . . . . . . . . 4
70 1.3. Unicode Documents . . . . . . . . . . . . . . . . . . . . 5
71 1.4. Definitions . . . . . . . . . . . . . . . . . . . . . . . 5
72 1.4.1. Language . . . . . . . . . . . . . . . . . . . . . . . 6
73 1.4.2. Script . . . . . . . . . . . . . . . . . . . . . . . . 6
74 1.4.3. Multilingual . . . . . . . . . . . . . . . . . . . . . 6
75 1.4.4. Localization . . . . . . . . . . . . . . . . . . . . . 6
76 1.4.5. Internationalization . . . . . . . . . . . . . . . . . 7
77 1.5. Statements and Guidelines . . . . . . . . . . . . . . . . 7
78 1.5.1. IESG Statement . . . . . . . . . . . . . . . . . . . . 7
79 1.5.2. ICANN statements . . . . . . . . . . . . . . . . . . . 8
80 2. Problems and Issues . . . . . . . . . . . . . . . . . . . . . 10
81 2.1. User conceptions, local character sets, and input
82 issues . . . . . . . . . . . . . . . . . . . . . . . . . . 11
83 2.2. Examples of issues . . . . . . . . . . . . . . . . . . . . 12
84 2.2.1. Language specific character matching . . . . . . . . . 12
85 2.2.2. Multiple scripts . . . . . . . . . . . . . . . . . . . 12
86 2.2.3. Normalization and Character Mappings . . . . . . . . . 13
87 2.2.4. URLs in Printed Form . . . . . . . . . . . . . . . . . 15
88 2.2.5. Bidirectional text . . . . . . . . . . . . . . . . . . 16
89 2.2.6. Confusable Character Issues . . . . . . . . . . . . . 16
90 2.2.7. The IESG Statement and IDNA issues . . . . . . . . . . 17
91 2.2.8. Versions of Unicode . . . . . . . . . . . . . . . . . 18
92 3. Framework for next steps in IDN development . . . . . . . . . 19
93 3.1. Issues within the scope of the IETF . . . . . . . . . . . 19
94 3.1.1. Review of IDNA . . . . . . . . . . . . . . . . . . . . 19
95 3.1.2. Non-DNS and Above-DNS Internationalization
96 Approaches . . . . . . . . . . . . . . . . . . . . . . 20
97 3.1.3. Security issues, certificates, etc. . . . . . . . . . 21
98 3.1.4. Non US-ASCII in local part of email addresses . . . . 22
99 3.1.5. Use of the Unicode Character Set in the IETF . . . . . 22
100 3.2. Issues that fall within the purview of ICANN . . . . . . . 23
101 3.2.1. Dispute resolution . . . . . . . . . . . . . . . . . . 23
102 3.2.2. Policy at registries . . . . . . . . . . . . . . . . . 23
103 3.2.3. IDN TLDs . . . . . . . . . . . . . . . . . . . . . . . 24
104 4. Specific Recommendations for Next Steps . . . . . . . . . . . 24
105 4.1. Reduction of permitted character list . . . . . . . . . . 24
106 4.1.1. Elimination of all non-language characters . . . . . . 25
107 4.1.2. Elimination of word-separation punctuation . . . . . . 25
108 4.2. Updating to new versions of Unicode . . . . . . . . . . . 25
112 Klensin & Faltstrom Expires October 14, 2006 [Page 2]
114 Internet-Draft IAB -- IDN Next Steps April 2006
117 4.3. Combining Characters and Character Components . . . . . . 26
118 4.4. Role and Uses of the DNS . . . . . . . . . . . . . . . . . 26
119 4.5. Databases of Registered Names . . . . . . . . . . . . . . 27
120 5. Security Considerations . . . . . . . . . . . . . . . . . . . 27
121 6. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 27
122 7. Change History . . . . . . . . . . . . . . . . . . . . . . . . 28
123 7.1. Changes for version -01 . . . . . . . . . . . . . . . . . 28
124 7.2. Changes for version -02 . . . . . . . . . . . . . . . . . 28
125 7.3. Changes for Version -03 . . . . . . . . . . . . . . . . . 29
126 7.4. Changes for version -04 . . . . . . . . . . . . . . . . . 29
127 7.5. Changes for version -05 . . . . . . . . . . . . . . . . . 29
128 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 29
129 8.1. Normative References . . . . . . . . . . . . . . . . . . . 29
130 8.2. Informative References . . . . . . . . . . . . . . . . . . 30
131 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34
132 Intellectual Property and Copyright Statements . . . . . . . . . . 35
168 Klensin & Faltstrom Expires October 14, 2006 [Page 3]
170 Internet-Draft IAB -- IDN Next Steps April 2006
175 1.1. Status of this Document and its Recommendations
177 This document reviews the IDN landscape from an IETF perspective and
178 presents the recommendations and conclusions of the IAB, based
179 partially on input from an ad hoc committee charged with reviewing
180 IDN issues and the path forward (See Section 6). Its recommendations
181 are recommendations to the IETF, or in a few cases to other bodies,
182 for topics to be examined and actions to be taken if those bodies,
183 after their examinations, consider those actions appropriate.
185 IMPORTANT: The IAB has not yet reached consensus that this document
186 is ready for final publication. While considerable input from the
187 members of the ad hoc committee went into the document, no claim is
188 made that it represents the consensus of that group. However, the
189 IAB concluded that it was appropriate to expose these versions, as
190 working drafts, for community comment and feedback. Such comments
191 should be sent to iab@iab.org.
193 1.2. The IDNA Standard
195 During 2002 IETF completed the following RFCs that, together, define
198 RFC 3454 Preparation of Internationalized Strings ("Stringprep")
200 Stringprep is a generic mechanism for taking a Unicode string and
201 converting it into a canonical format. Stringprep itself is just
202 a collection of rules, tables, and operations. Any protocol or
203 algorithm that uses it must define a "Stringprep profile", which
204 specifies which of those rules are applied, how, and with which
207 RFC 3490 Internationalizing Domain Names in Applications (IDNA)
209 IDNA is the base specification in this group. It specifies that
210 Nameprep is used as the Stringprep profile for domain names, and
211 that Punycode is the relevant encoding mechanism for use in
212 generating an ASCII-compatible ("ACE") form of the name. It also
213 applies some additional conversions and character filtering that
214 are not part of Nameprep.
216 RFC 3491 Nameprep: A Stringprep Profile for Internationalized Domain
217 Names (IDN) [RFC3491].
218 Nameprep is one such profile. It is designed to meet the specific
219 needs of IDNs and, in particular, to support case-folding for
220 scripts that support what are traditionally known as upper and
224 Klensin & Faltstrom Expires October 14, 2006 [Page 4]
226 Internet-Draft IAB -- IDN Next Steps April 2006
229 lower case forms of the same letters. The result of the Nameprep
230 algorithm is a string containing a subset of the Unicode Character
231 set, normalized and case folded so that case insensitive
232 comparison can be made.
234 RFC 3492 Punycode: A Bootstring encoding of Unicode for
235 Internationalized Domain Names in Applications (IDNA) [RFC3492].
236 Punycode is a mechanism for encoding a Unicode string in ASCII
237 characters. The characters used are the same the subset of
238 characters that are allowed in the hostname definition of DNS,
239 i.e., the "letter, digit, and hyphen" characters, sometimes known
242 1.3. Unicode Documents
244 Unicode is used as the base, and defining, character set for IDN.
245 Unicode is standardized by the Unicode Consortium, and synchronized
246 with ISO to create ISO/IEC 10646 [ISO10646]. At the time the RFCs
247 mentioned earlier were created, Unicode was at version 3.2. For
248 reasons explained later, it was necessary to pick a particular, then-
249 current, version of Unicode when IDNA was adopted. Consequently, the
250 RFCs are explicitly dependent on Unicode version 3.2 [Unicode32].
251 There is, at present, no established mechanism for modifying the IDNA
252 RFCs to use newer Unicode versions (see Section 2.2.8).
254 Unicode is a very large and complex character set. (The term
255 "character set" or "charset" is used in a way that is peculiar to the
256 IETF and may not be the same as the usage in other bodies and
257 contexts.) The Unicode Standard and related documents are created
258 and maintained by the Unicode Technical Committee (UTC), one of the
259 committees of the Unicode Consortium.
261 The Consortium first published The Unicode Standard [Unicode10] in
262 1991, and continues to develop standards based on that original work.
263 Unicode is developed in conjunction with the International
264 Organization for Standardization, and it shares its character
265 repertoire with ISO/IEC 10646. Unicode and ISO/IEC 10646 function
266 equivalently as character encodings, but The Unicode Standard
267 contains much more information for implementers, covering -- in depth
268 -- topics such as bitwise encoding, collation, and rendering. The
269 Unicode Standard enumerates a multitude of character properties,
270 including those needed for supporting bidirectional text. The
271 Unicode Consortium and ISO standards do use slightly different
276 The following terms and their meanings are critical to understanding
280 Klensin & Faltstrom Expires October 14, 2006 [Page 5]
282 Internet-Draft IAB -- IDN Next Steps April 2006
285 the rest of this document and to discussions of IDNs more generally.
286 These terms are derived from [RFC3536], which contains additional
287 discussion of some of them.
291 A language is a way that humans interact. The use of language occurs
292 in many forms, including speech, writing, and signing.
294 Some languages have a close relationship between the written and
295 spoken forms, while others have a looser relationship. RFC 3066
296 [RFC3066] discusses languages in more detail and provides identifiers
297 for languages for use in Internet protocols. Computer languages are
298 explicitly excluded from this definition. The most recent IETF work
299 in this area, and on script identification (see below), is documented
300 in [ltru-registry] and [ltru-initial].
304 A script is a set of graphic characters used for the written form of
305 one or more languages. This definition is the one used in
308 Examples of scripts are Arabic, Cyrillic, Greek, Han (the so-called
309 ideographs used in writing Chinese, Japanese, and Korean), and Latin
310 (more properly "Roman", see below), Arabic, Greek, and Latin are, of
311 course, also names of languages. Some issues with script
312 identification and relationships with other standards are discussed
317 The term "multilingual" has many widely-varying definitions and thus
318 is not recommended for use in standards. Some of the definitions
319 relate to the ability to handle international characters; other
320 definitions relate to the ability to handle multiple charsets; and
321 still others relate to the ability to handle multiple languages.
323 While this term has been deprecated for IETF-related uses and does
324 not otherwise appear in this document, a discussion here seemed
325 appropriate since the term is still widely used in some discussions
330 Localization is the process of adapting an internationalized
331 application platform or application to a specific cultural
332 environment. In localization, the same semantics are preserved while
336 Klensin & Faltstrom Expires October 14, 2006 [Page 6]
338 Internet-Draft IAB -- IDN Next Steps April 2006
341 the syntax or presentation forms may be changed.
343 Localization is the act of tailoring an application for a different
344 language or script or culture. Some internationalized applications
345 can handle a wide variety of languages. Typical users only
346 understand a small number of languages, so the program must be
347 tailored to interact with users in just the languages they know.
349 Somewhat different definitions for localization and
350 internationalization (see below) are used by groups other than the
351 IETF. See [W3C-Localization] for one example.
353 1.4.5. Internationalization
355 In the IETF, the term "internationalization" is used to describe
356 adding or improving the handling of non-ASCII text in a protocol.
357 Other bodies use the term in other ways, often ones that are subtly
358 different from each other. The term "internationalization" is often
361 Many protocols that handle text only handle the characters associated
362 with one script (often, a subset of the characters used in writing
363 English text), or leave the question of what character set is used up
364 to local guesswork (which leads, of course, to interoperability
365 problems). Adding non-ASCII text to such a protocol allows the
366 protocol to handle more scripts, with the intention of being able to
367 include all of the scripts that are useful in the world. It should
368 be noted that many English words cannot be written in ASCII, various
369 mythologies notwithstanding.
371 1.5. Statements and Guidelines
373 When the IDN RFCs were published, IESG and ICANN made statements that
374 were intended to guide deployment and future work. In recent months,
375 ICANN has updated its statement and others have also made
376 contributions. It is worth noting that the quality of understanding
377 of internationalization issues as applied to the DNS has evolved
378 considerably over the last few years. Organizations that took
379 specific positions a year or more ago might not make exactly the same
382 1.5.1. IESG Statement
384 The IESG made a statement on IDNA [IESG-IDN]:
392 Klensin & Faltstrom Expires October 14, 2006 [Page 7]
394 Internet-Draft IAB -- IDN Next Steps April 2006
397 IDNA, through its requirement of Nameprep [RFC3491], uses
398 equivalence tables that are based only on the characters
399 themselves; no attention is paid to the intended language (if any)
400 for the domain name. However, for many domain names, the intended
401 language of one or more parts of the domain name actually does
404 Similarly, many names cannot be presented and used without
405 ambiguity unless the scripts to which their characters belong are
406 known. In both cases, this additional information should be of
407 concern to the registry.
409 The statement is longer than this, but these paragraphs are the
410 important ones. The rest of the statement are explanations and
413 1.5.2. ICANN statements
415 1.5.2.1. Initial ICANN Guidelines
417 Soon after the IDNA standard was adopted, ICANN produced an initial
418 version of its "IDN Guidelines" [ICANNv1]. This document was
419 intended to serve two purposes. The first was to provide a basis for
420 releasing the gTLD registries that had been established by ICANN from
421 a contractual restriction on the registration of labels containing
422 hyphens in the third and fourth positions. The second was to provide
423 a general framework for the development of registry policies for the
424 implementation of IDN.
426 One of the key components of this framework was prescribing strict
427 compliance with RFCs 3490, 3491, and 3492. These specifications
428 established the ACE (ASCII-Compatible Encoding) scheme for IDN use,
429 known as "Punycode", and the various rules for its use. The
430 specifications designated Punycode, supported by those rules, as the
431 sole such encoding to be used with the DNS.
433 Limitations on the characters available for inclusion in IDNs were
434 mandated by two devices. The first was by requiring an "inclusion-
435 based approach (meaning that code points that are not explicitly
436 permitted by the registry are prohibited) for identifying permissible
437 code points from among the full Unicode repertoire." The second
438 device required the association of every IDN with a specific
439 language, with additional policies also being language based:
441 "In implementing the IDN standards, top-level domain registries will
442 (a) associate each registered internationalized domain name with one
443 language or set of languages,
444 (b) employ language-specific registration and administration rules
448 Klensin & Faltstrom Expires October 14, 2006 [Page 8]
450 Internet-Draft IAB -- IDN Next Steps April 2006
453 that are documented and publicly available, such as the reservation
454 of all domain names with equivalent character variants in the
455 languages associated with the registered domain name, and,
456 (c) where the registry finds that the registration and administration
457 rules for a given language would benefit from a character variants
458 table, allow registrations in that language only when an appropriate
459 table is available. ... In implementing the IDN standards, top-level
460 domain registries should, at least initially, limit any given domain
461 label (such as a second-level domain name) to the characters
462 associated with one language or set of languages only."
464 It was left to each TLD registry to define the character repertoire
465 it would associate with any given language. This led to significant
466 variation from registry to registry, with further heterogeneity in
467 the underlying language-based IDN policies. If the guidelines had
468 made provision for IDN policies also being based on script, a
469 substantial amount of the resulting ambiguity could have been
470 avoided. However, they did not, and the sequence of events leading
471 to the present review of IDNA was thus triggered.
473 1.5.2.2. ICANN Version 2 Guidelines
475 One of responses of the TLD registries to what was widely perceived
476 as a crisis situation, was to invoke the mechanism described in the
477 initial guidelines: "As the deployment of IDNs proceeds, ICANN and
478 the IDN registries will review these Guidelines at regular intervals,
479 and revise them as necessary based on experience."
481 The pivotal requirement was the modification of the guidelines to
482 permit script-based IDN policies. Further concern was expressed
483 about the need for realistically implementable mechanisms for the
484 propagation of TLD registry policies into the lower levels of their
485 name trees. In addition to the anticipated increase of constraint on
486 the protocol level, one obvious additional approach would be to
487 replace the guidelines by an instrument which itself had clear status
488 in the IETF's normative framework. A BCP was therefore seen as the
489 appropriate focus for longer-term effort. The most pressing issues
490 would be dealt with in the interim by incremental modification to the
491 guidelines, but no need was seen for the detailed further development
492 of those guidelines once that incremental modification was complete..
494 The outcome of this action was a version 2.0 of the guidelines
495 [ICANNv2] which was endorsed by the ICANN Board on November 8, 2005
496 for a period of nine months. The Board stated further that it "tasks
497 the IDN working group to continue its important work and return to
498 the board with specific IDN improvement recommendations before the
499 ICANN Meeting in Morocco" and "supports the working group's continued
500 action to reframe the guidelines completely in a manner appropriate
504 Klensin & Faltstrom Expires October 14, 2006 [Page 9]
506 Internet-Draft IAB -- IDN Next Steps April 2006
509 for further development as a Best Current Practices (BCP) document,
510 to ensure that the Guideline directions will be used deeper into the
511 DNS hierarchy and within TLD's where ICANN has a lesser policy
514 Retaining the inclusion-based approach established in version 1.0,
515 the crucial addition to the policy framework is that:
517 "All code points in a single label will be taken from the same script
518 as determined by the Unicode Standard Annex #24: Script Names at
519 http://www.unicode.org/reports/tr24. Exception to this is
520 permissible for languages with established orthographies and
521 conventions that require the commingled use of multiple scripts. In
522 such cases, visually confusable characters from different scripts
523 will not be allowed to co-exist in a single set of permissible
524 codepoints unless a corresponding policy and character table is
529 "Permissible code points will not include: (a) line symbol-drawing
530 characters (as those in the Unicode Box Drawing block), (b) symbols
531 and icons that are neither alphanumeric nor ideographic language
532 characters, such as typographic and pictographic dingbats, (c)
533 characters with well-established functions as protocol elements, (d)
534 punctuation marks used solely to indicate the structure of
537 Attention has been called to several points that are not adequately
538 dealt with (if at all) in the version 2.0 guidelines but which ought
539 to be included in the policy framework without waiting for the
540 production and release of a document based on a "best practices"
541 model. The term "BCP" above does not necessarily refer to an IETF
542 consensus document. The recommendations to be put to the ICANN Board
543 prior to its meeting in Morocco (in late June 2006) will therefore be
544 collated incrementally and appear in interim version 2.n releases of
548 2. Problems and Issues
550 This section intentionally mixes problems and issues of several
551 types. Each subsection outlines something that is perceived to be a
552 problem or issue "with IDNs", therefore needing correction. Some of
553 these issues can be at least partially resolved by making changes to
554 elements of the IDNA protocol or tables. Others will exist as long
555 as people have expectations of IDNs that are inconsistent with the
556 basic DNS architecture. It is important to identify this entire
560 Klensin & Faltstrom Expires October 14, 2006 [Page 10]
562 Internet-Draft IAB -- IDN Next Steps April 2006
565 range of problems because users, registrants, and policy makers often
566 do not understand the protocol and other technical issues but only
567 the difference between what they believe happens or should happen and
568 what actually happens. As long as those differences exist, there
569 will be demands for functionality or policy changes for IDN. Of
570 course, some of these demands will be less realistic than others but
571 even the realistic ones should be understood in the same context as
574 2.1. User conceptions, local character sets, and input issues
576 People use "words" when they think of things and wish others to think
577 of them too. For example "orange", "tree", "restaurant" or "Acme
578 Inc". Words are normally in a specific language, such as English or
579 Swedish. The DNS, however, supports character-string labels, not
580 "words". While it is useful, especially for mnemonic value or to
581 identify objects, for actual words to be used as DNS labels, other
582 constraints on the DNS make it impossible to guarantee that it will
583 be possible to represent every word in every language as a DNS label,
584 internationalized or not.
586 When writing or typing the label (or word), a script must be selected
587 and a charset must be picked for use with that script. That choice
588 of charset is typically not under the control of the user on a per
589 word or per document basis, but may depend on local input devices,
590 keyboard or terminal drivers, or other decisions made by operating
591 system or even hardware designers and implementers.
593 If that charset, or the local charset being used by the relevant
594 operating system or application software, is not Unicode, a further
595 conversion must be performed to produce Unicode. How often this is
596 an issue depends on estimates of how widely Unicode is deployed as
597 the native character set for hardware, operating systems, and
598 applications. Those estimates differ widely, with some Unicode
599 advocates claiming that it is used in the vast majority of systems
600 and applications today. Others are more skeptical, pointing out
603 o ISO 8859 versions [ISO.8859.2003] and even national variations of
604 ISO 646 [ISO.646.1991] are still widely used in parts of Europe;
605 o code-table switching methods, typically based on the techniques of
606 ISO 2022 [ISO.2022.1986] are still in general use in many parts of
607 the world, especially in Japan with Shift-JIS and its variations;
608 o that computing, systems, and communications in China tend to use
609 one or more of the national "GB" standards rather than native
616 Klensin & Faltstrom Expires October 14, 2006 [Page 11]
618 Internet-Draft IAB -- IDN Next Steps April 2006
623 Not all charsets define their characters in the same way and not all
624 pre-existing coding systems were incorporated into Unicode without
625 changes. Sometimes local distinctions were made that Unicode does
626 not make or vice versa. Consequently, conversion from other systems
627 to Unicode may potentially lose information.
629 The Unicode string that results from this processing --processing
630 that is trivial in a Unicode-native system but that may be
631 significant in others-- is then used as input to IDNA.
633 2.2. Examples of issues
635 While much of the discussion below is stated in terms of Unicode
636 codings and associated rules, the IAB believes that some of the
637 issues are actually not about the Unicode Character set per se, but
638 about how distributed matching systems operate in reality, and about
639 what implications the distributed delayed search for stored data that
640 characterizes the DNS have on the mapping algorithms.
642 2.2.1. Language specific character matching
644 There are similar words that can be expressed in multiple languages.
645 For example the name Torbjorn in Norwegian and Swedish. In Norwegian
646 it is spelled with the character U+00F8 (LATIN SMALL LETTER O WITH
647 STROKE) in the second syllable, while in Swedish it is spelled with
648 U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS). Those characters are
649 not treated as equivalent according to the Unicode consortium while
650 most people speaking Swedish, Danish and Norwegian probably think
653 It is neither possible nor desirable to make these characters
654 equivalent on a global basis. To do so would, for this example
655 rationalize the situation in Sweden while causing considerable
656 confusion in Germany, where the U+00F8 character is never used in the
657 German language. But the "variant" model introduced in [RFC3743] and
658 [RFC4290] can be used by a registry to prevent the worst consequence
659 of the possible confusion, either by ensuring that both names are
660 registered to same party in a given domain or that one of them is
661 completely prohibited.
663 2.2.2. Multiple scripts
665 There are languages in the world that can be expressed using multiple
666 scripts. For example some Eastern European and Central Asian
667 languages can be expressed in either Cyrillic or Roman characters or
668 some African and Southeast Asian languages can be expressed in either
672 Klensin & Faltstrom Expires October 14, 2006 [Page 12]
674 Internet-Draft IAB -- IDN Next Steps April 2006
677 Arabic or Roman characters A few languages can even be written in
678 three different scripts. In other cases, the language is typically
679 written in a combination of scripts (e.g., Kanji and Kana for
680 Japanese, Hangul and Hanji for Korean). Because of this, the same
681 word, in the same language, can be expressed in different ways. For
682 some languages, only a single script is normally used to write a
683 single word; for others, mixed scripts are required; and, for still
684 others, special circumstances may dictate mixing scripts in labels
685 although that is not normally done for "words". For IDN purposes,
686 these variations make the definition of "script" extremely sensitive,
687 especially since ICANN is now recommending that it be used as the
688 primary basis for registry policies. However essential it may be to
689 prohibit mixed-script labels, additional policy nuance is required
690 for "languages with established orthographies and conventions that
691 require the commingled use of multiple scripts".
693 2.2.3. Normalization and Character Mappings
695 Unicode contains several different models for representing
696 characters. The Chinese (Han)-derived characters of the "CJK"
697 languages are "unified", i.e., characters with common derivation and
698 similar appearances are assigned to the same code point. European
699 characters derived from a Greek-Roman base are separated into
700 separate code blocks for "Latin", Greek and Cyrillic even when
701 individual characters are identical in both form and semantics.
702 Separate code points based on font differences alone are generally
703 prohibited, but a large number of characters for "mathematical" use
704 have been assigned separate code points even though they differ from
705 base ASCII characters only by font attributes such as "script",
706 "bold", or "italic". Some characters that often appear together are
707 treated as typographical digraphs with specific code points assigned
708 to the combination, others require that the two-character sequences
709 be used, and still others are available in both forms. Some Roman-
710 based letters that were developed as decorated variations on the
711 basic Latin letter collection (e.g., by addition of diacritical
712 marks) are assigned code points as individual characters, others must
713 be built up as two (or more) character sequences using "composing
716 Many of these differences result from the desire to maintain backward
717 compatibility while the standard evolved historically, and are hence
718 understandable. However, the DNS requires precise knowledge of which
719 codes and code sequences represent the same character and which ones
720 do not. Limiting the potential difficulties with confusable
721 characters (see Section 2.2.6) requires even more knowledge of which
722 characters might look alike in some fonts but not in others. These
723 variations make it difficult or impossible to apply a single set of
724 rules to all of Unicode and, in doing so, satisfy everyone and their
728 Klensin & Faltstrom Expires October 14, 2006 [Page 13]
730 Internet-Draft IAB -- IDN Next Steps April 2006
733 perceived needs. Instead, more or less complex mapping tables,
734 defined on a character by character basis, are required to
735 "normalize" different representations of the same character to a
736 single form so that matching is possible.
738 Unless normalization rules, such as those that underlie Nameprep, are
739 applied, characters that are essentially identical will not match in
740 the DNS, creating many opportunities for problems. The most common
741 one is that, due to the process above before a word ends up being a
742 Unicode string, a single word can end up being expressed as more than
743 one unique Unicode string.
745 IDNA attempts to compensate for some of these problems by using a
746 normalization algorithm defined by the Unicode Consortium. This
747 algorithm can change a sequence of one or more Unicode characters to
748 another set of characters. One example is that the base character
749 U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING
750 DIAERESIS) is changed to the single Unicode character U+00E4 (LATIN
751 SMALL LETTER A WITH DIAERESIS).
753 This Unicode normalization process accounts only for simple character
754 equivalences, not equivalences that are language or script dependent.
755 For example, as mentioned above, the characters U+00F8 (LATIN SMALL
756 LETTER O WITH STROKE) and U+00F6 (LATIN SMALL LETTER O WITH
757 DIAERESIS) are considered to match in Swedish (and some other
758 languages), but not for all languages than use either of the
759 characters. Having these characters be treated as equivalent in some
760 contexts and not in others requires decisions and mechanisms that, in
761 turn, depend much more on context than either IDNA or the Unicode
762 character-based normalization tables can provide.
764 If we leave Roman-based scripts and examine those based on Chinese
765 characters, we see there is also an absence of specific, lexigraphic,
766 rules for transformations between Traditional and Simplified Chinese.
767 Even if there were such rules, unification of Japanese and Korean
768 characters with Chinese ones would make it impossible to normalize
769 Traditional Chinese into Simplified Chinese ones without causing
770 problems in Japanese and Korean use of the same characters.
772 More generally, while some mappings, such as those between
773 precomposed Roman-based characters and the equivalent multiple code
774 point composed character sequences, depend only on the characters
775 themselves, in many or most cases, such as the case with Swedish
776 above, the mapping is language or culturally dependent. There have
777 been discussions as to whether different canonicalization rules (in
778 addition to or instead of Unicode normalization) should be, or could
779 be, applied differently to different languages or scripts. The fact
780 that most scripts included in Unicode have been initially
784 Klensin & Faltstrom Expires October 14, 2006 [Page 14]
786 Internet-Draft IAB -- IDN Next Steps April 2006
789 incorporated by copying an existing standard more or less intact has
790 impact on the optimization of these algorithms and on forward
791 compatibility. Even if the language is known and language-specific
792 rules can be defined, dependencies on the language do not disappear.
793 Canonicalization operations are not possible unless they either
794 depend only on short sequences of text or have significant context
795 available that is not obvious from the text itself. DNS lookups and
796 many other operations do not have a way to capture and utilize the
797 language or other information that would be needed to provide that
800 These variations in languages and in user perceptions of characters
801 make it difficult or impossible to provide uniform algorithms for
802 matching Unicode strings in a way that no end users are ever
803 surprised by the result. For closely-related scripts or characters,
804 surprises may even be frequent. However, because uniform algorithms
805 are required for mappings that are applied when names are looked up
806 in the DNS, the rules that are chosen will always represent an
807 approximation that will be more or less successful in minimizing
808 those user surprises. The current Nameprep and Stringprep algorithms
809 use mapping tables to "normalize" different representations of the
810 same text to a single form so that matching is possible.
812 More details on the creation of the normalization algorithms can be
813 found in the Unicode Specification and the associated Technical
814 Reports [UTR] and Annexes. Technical Report #36 [UTR36] and [UTR39]
815 are specifically related to the IDN discussion.
817 2.2.4. URLs in Printed Form
819 URLs and other identifiers appear, not only in electronic forms from
820 which they can (at least in principle) be accurately copied and
821 "pasted" but in printed forms from which the user must transcribe
822 them into the computer system. This is often known as the "side of
823 the bus problem" because a particularly problematic version of it
824 requires that the user be able to observe and accurately remember a
825 URL that is quickly-glimpsed in a transient form -- a billboard seen
826 while driving, a sign on the side of a passing vehicle, a television
827 advertisement that is not frequently repeated or on-screen for a long
830 The difficulty, in short, is that two Unicode strings that are
831 actually different might look exactly the same, especially when there
832 is no time to study them. This is because, for example, some glyphs
833 in Cyrillic, Greek and Latin do look the same, but have been assigned
834 different codepoints in Unicode. Worse, one needs to be reasonably
835 familiar with a script and how it is used to understand how much
836 characters can reasonably vary as the result of artistic fonts and
840 Klensin & Faltstrom Expires October 14, 2006 [Page 15]
842 Internet-Draft IAB -- IDN Next Steps April 2006
845 typography. For example, there are a few fonts for Latin characters
846 that are sufficiently highly ornamented that an observer might easily
847 confuse some of the characters with characters in Thai script.
849 2.2.5. Bidirectional text
851 Some scripts (and because of that some words in some languages) are
852 written not left to right, but right to left. And, to complicate
853 things, one might have something written in Arabic characters right
854 to left that includes some characters in Latin characters, such as
855 European-style digits. The Latin character part is written left to
856 right, which implies some texts might have a mixed left to right AND
857 right to left order (even though in most implementations all texts
858 have a major direction, with the other as an exception). IDNA
859 prohibits these mixed-directional (or bidirectional) strings in IDN
860 labels, but the prohibition causes other problems such as the
861 rejection of some otherwise linguistically and culturally sensible
862 strings. As Unicode and conventions for handling so-called
863 bidirectional ("BIDI") strings evolve, the prohibition in IDNA should
864 be reviewed and reevaluated.
866 2.2.6. Confusable Character Issues
868 Similar-looking characters in identifiers can cause actual problems
869 on the Internet since they can result, deliberately or accidentally,
870 in people being directed to the wrong host or mailbox by believing
871 that they are typing, or clicking on, intended characters which are
872 different from those that actually appear in the domain name or
873 reference. See Section 3.1.3 for further discussion of this issue.
875 IDNs complicate these issues, not only by providing many additional
876 characters that look sufficiently alike to be potentially confused,
877 but by raising new policy questions. For example, if a language can
878 be written in two different scripts, is a label constructed from a
879 word written in one script equivalent to a label constructed from the
880 same word written in the other script? Is the answer the same for
881 words in two different languages that translate into each other?
883 It is now generally understood that, in addition to the collision
884 problems of possibly equivalent words and hence labels, it is
885 possible to utilize characters that look alike -- "confusable"
886 characters -- to spoof names in order to mislead or defraud users.
887 That issue, driven by particular attacks such as those known as
888 "phishing", has introduced stronger requirements for registry efforts
889 to prevent problems than were previously generally recognized as
892 One commonly-proposed approach is to have a registry establish
896 Klensin & Faltstrom Expires October 14, 2006 [Page 16]
898 Internet-Draft IAB -- IDN Next Steps April 2006
901 restrictions on the characters, and combinations of characters, it
902 will permit to be included in a string to be registered as a label.
903 Taking the Swedish top-level domain, .SE, as an example, a rule might
904 be adopted that the registry "only accepts registrations in Swedish,
905 using Roman script, and because of this, Unicode characters Latin-a,
906 -b, -c,...". But, because there is not a 1:1 mapping between country
907 and language, even a ccTLD like .SE might have to accept
908 registrations in other languages. For example, there may be a
909 requirement for Finnish (the second most-used language in Sweden).
910 What rules and codepoints are then defined for Finnish? Does it have
911 special mappings that collide with those that are defined for
912 Swedish? And what does one do in countries that use more than one
913 script? (Finnish and Swedish use the same script.) In all cases,
914 the dispute will ultimately be about whether two strings are the same
915 (or confusingly similar) or not. That, in turn, will generate a
916 discussion of how one defines "what is the same" and "what is similar
917 enough to be a problem".
919 These difficulties can never be completely eliminated by algorithmic
920 means. Some of the problem can be addressed by appropriate tuning of
921 the protocols and their tables, other parts by registry actions to
922 reduce confusion and conflicts, and still other parts can be
923 addressed by careful design of user interfaces in application
924 programs. But, ultimately, some responsibility to avoid being
925 tricked or harmfully confused will rest with the user.
927 Another registry technique that has been extensively explored
928 involves looking at confusable characters and confusion between
929 complete labels, restricting the labels that can be registered based
930 on relationships to what is registered already. Registries that
931 adopt this approach might establish special mapping rules such as:
933 1. If you register something with codepoint A, domain names with B
934 instead of A will be blocked from registration by others.
935 2. If you register something with codepoint A, you also get domain
936 name with B instead of A.
938 These approaches are discussed in more detail for "CJK" characters in
939 RFC 3743 [RFC3743] and more generally in RFC 4290 [RFC4290].
941 2.2.7. The IESG Statement and IDNA issues
943 The issues above, at least as they were understood at the time,
944 provided the background for the IESG statement included in
945 Section 1.5.1 which, in turn, was part of the basis for the initial
946 ICANN Guidelines) that a registry should have a policy about the
947 scripts, languages, codepoints and text directions for which
948 registrations will be accepted. While "accept all" might be an
952 Klensin & Faltstrom Expires October 14, 2006 [Page 17]
954 Internet-Draft IAB -- IDN Next Steps April 2006
957 acceptable policy, it implies there is also a dispute resolution
958 process that takes the problems listed above into account. The
959 dispute resolution process must be designed so that all types of
960 potential disputes must be able to be resolved: for example, issues
961 might arise between registrant and registry over a decision by the
962 registry on collisions with already registered domain names and
963 between registrant and trade mark holder (that a domain name
964 infringes on a trademark). In both cases the parties disagreeing
965 have different views on whether two strings are "equivalent" or not.
966 They may believe that a string that is not allowed to be registered
967 is actually different from one that is already registered. Or they
968 might believe that two strings are the same, even though the rules
969 adopted by the registry to prevent confusion define them as two
970 different domain names.
972 2.2.8. Versions of Unicode
974 While opinions differ about how important the issues are in practice,
975 the use of Unicode and its supporting tables to support IDNs appears
976 to be far more sensitive to subtle changes than typical Unicode
977 applications. This may be, at least in part, because many other
978 applications are internally sensitive only to the appearance of
979 characters and not to their representation. Or those applications
980 may be able to take effective advantage of script, language, or
981 character class identification. The working group that developed
982 IDNA concluded that attempting to encode any ancillary character
983 information into the DNS label would be impractical and unwise, and
984 the IAB, based in part on the comments in the ad hoc committee, saw
985 no reason to review that decision.
987 This sensitivity to changes has made it quite difficult to migrate
988 IDNA from one version of Unicode to the next if any changes are made
989 that are not strictly additive. A change in a code point assignment
990 or definition may be extremely disruptive if DNS labels have been
991 defined using the earlier form. Unicode normalization tables, tables
992 of scripts or languages and characters that belong to them, and even
993 tables of confusable characters as an adjunct to security
994 recommendations may be very helpful in designing registry
995 restrictions on registrations and applications provisions for
996 avoiding or identifying suspicious names. Ironically, they also
997 extend the sensitivity of IDNA and its implementations to all forms
998 of change between one version of Unicode and the next. Consequently,
999 they make Unicode version migration more difficult.
1001 An example of the type of change that appears to be just a small
1002 correction from one perspective but may be problematic from another
1003 was the correction to the normalization definition in 2004 [Unicode-
1004 PR29]. There was community input that the change would cause
1008 Klensin & Faltstrom Expires October 14, 2006 [Page 18]
1010 Internet-Draft IAB -- IDN Next Steps April 2006
1013 problems for Stringprep, but UTC decided, on balance, that the change
1014 was worthwhile. Because of difficulties with consistency, some
1015 deployed implementations have decided to adopt the change and others
1016 have not, leading to subtle incompatibilities.
1018 This situation leads to a dilemma. On the one hand, it is completely
1019 unacceptable to freeze Unicode at a version level that excludes more
1020 recently-defined characters and scripts which are important to those
1021 who use them. On the other hand, it is equally unacceptable to
1022 migrate from one version of Unicode to the next if such migration
1023 might invalidate an existing registered DNS name or some of its
1024 registered properties or might make the string or representation of
1025 that name ambiguous. If IDNA is to be modified to accommodate new
1026 versions of Unicode, the IETF will need to work with the Unicode
1027 Consortium and other relevant bodies to find an appropriate balance
1028 in this area, but progress will be possible only if all relevant
1029 parties are able to fairly consider and discuss possible decisions
1030 that may be very difficult and unpalatable.
1033 3. Framework for next steps in IDN development
1035 3.1. Issues within the scope of the IETF
1037 3.1.1. Review of IDNA
1039 The IETF should consider reviewing RFCs 3454, 3490, 3491 and/or 3492,
1040 and update, replace or supplement them to meet the criteria of this
1041 paragraph (one or more of them may prove impractical after further
1042 study). Any new versions or additional specifications should be
1043 adapted to the version of Unicode that is current when they are
1044 created. Ideally, they should specify a path for adapting to future
1045 versions of Unicode (some suggestions below may facilitate this).
1046 The IETF should also consider whether there are significant
1047 advantages to mapping some groups of characters, such as code points
1048 assigned to font variations, into others or whether clarity and
1049 comprehensibility for the user would be better served by simply
1050 prohibiting those characters. More generally, it appears that it
1051 would be worthwhile for the IETF to review whether the Unicode
1052 normalization rules now invoked by the Stringprep profile in Nameprep
1053 are optimal for the DNS or whether more restrictive rules, or an even
1054 more restrictive set of permitted character combinations, would
1055 provide better support for DNS internationalization.
1057 The IAB has concluded that there is a consensus within the broader
1058 community that lists of codepoints should be specified by the use of
1059 an inclusion based mechanism (i.e., identifying the characters that
1060 are permitted), rather than by excluding a small number of characters
1064 Klensin & Faltstrom Expires October 14, 2006 [Page 19]
1066 Internet-Draft IAB -- IDN Next Steps April 2006
1069 from the total Unicode set as Stringprep and Nameprep do today. That
1070 conclusion should be reviewed by the IETF community and action taken
1073 We suggest that the individuals doing the review of the codepoints
1074 should work as a specialized design team. To the extent possible,
1075 that work should be done jointly by people with experience from the
1076 IETF and deep knowledge of the constraints of the DNS and application
1077 design, participants from the Unicode Consortium, and other people
1078 necessary to be able to reach a generally-accepted result. Because
1079 any work along these lines would be modifications and updates to
1080 standards-track documents, final review and approval of any proposals
1081 would necesarily follow normal IETF processes.
1083 It is worth noting that sufficiently extreme changes to IDNA would
1084 require a new Punycode prefix, probably with long-term support for
1085 both the old prefix or the new one in both registration arrangements
1086 and applications. An alternative, which is almost certainly
1087 impractical, would be some sort of "flag day", i.e., a date on which
1088 the old rules are simultaneously abandoned by everyone and the new
1089 ones adopted. However, preliminary analysis indicates that few, if
1090 any, of the changes recommended for consideration elsewhere in this
1091 document would require this type of version change. For example,
1092 additional restrictions on what can be registered may require policy
1093 decisions about actions to be taken with regard to labels that
1094 conformed to earlier rules but not to new ones, but not changes in
1095 the protocol or prefix.
1097 3.1.2. Non-DNS and Above-DNS Internationalization Approaches
1099 The IETF should once again examine the extent to which it is
1100 appropriate to try to solve internationalization problems via the DNS
1101 and what place the many varieties of so-called "keyword systems" or
1102 other Internet navigational techniques might have. Those techniques
1103 can be designed to impose fewer constraints, or at least different
1104 constraints, than IDNA and the DNS. As discussed elsewhere in this
1105 document, IDNA cannot support information about scripts, languages,
1106 or Unicode versions on lookup. As a consequence of the nature of DNS
1107 lookups, characters and labels either match or do not match; a near-
1108 match is simply not a possible concept in the DNS. By contrast,
1109 observation of near-matching is common in human communication and in
1110 matching operations performed by people, especially when they have a
1111 particular script or language context in mind. The DNS is further
1112 constrained by a fairly rigid internal aliasing system (via CNAME and
1113 DNAME resource records), while some applications of international
1114 naming may require more flexibility. Finally, the rigid hierarchy of
1115 the DNS --and the tendency in practice for it to become flat at
1116 levels nearest the root-- and the need for names to be unique are
1120 Klensin & Faltstrom Expires October 14, 2006 [Page 20]
1122 Internet-Draft IAB -- IDN Next Steps April 2006
1125 more suitable for some purposes than others and may not be a good
1126 match for some purposes for which people wish to use IDNs. Each of
1127 these constraints can be relaxed or changed by one or more systems
1128 that would provide alternatives to direct use of the DNS by users.
1129 Some of the issues involved are discussed further in Section 4.4 and
1130 various ideas have been discussed in detail in the IETF or IRTF.
1131 Many of those ideas have even been described in Internet Drafts or
1132 other documents. As experience with IDNs and with expectations for
1133 them accumulates, it will probably become appropriate for the IETF or
1134 IRTF to revisit the underlying questions and possibilities.
1136 3.1.3. Security issues, certificates, etc.
1138 Some characters look like others, often as the result of common
1139 origins. The problem with these "confusable" characters, often
1140 incorrectly called homographs, has always existed when characters are
1141 presented to humans that interpret what is displayed and then make
1142 decisions based on what the person sees. This is not a problem that
1143 exists only when working with internationalized domain names, but it
1144 makes the problem worse. The result of a survey that would explain
1145 what the problems are might be interesting. Many of these issues are
1146 mentioned in Unicode Technical Report #36 [UTR36].
1148 In this and other issues associated with IDNs, precise use of
1149 terminology is important lest even more confusion result. The
1150 definition of the term 'homograph' that normally appears in
1151 dictionaries and linguistic texts states that homographs are
1152 different words which are spelled identically (for example, the
1153 adjective 'brief' meaning short, the noun 'brief' meaning a document,
1154 and the verb 'brief' meaning to inform). By definition, letters in
1155 two different alphabets are not the same, regardless of similarities
1156 in appearance. This means that sequences of letters from two
1157 different scripts that appear to be identical on a computer display
1158 cannot be homographs in the accepted sense, even if they are both
1159 words in the dictionary of some language. Assuming that there is a
1160 language written with Cyrillic script in which "cap" is a word,
1161 regardless of what it might mean, it is not a homograph of the Latin-
1162 script English word "cap".
1164 When the security implications of visually confusable characters were
1165 brought to the forefront in 2005, the term homograph was used to
1166 designate any instance of graphic similarity, even when comparing
1167 individual characters. This usage is not only incorrect, but risks
1168 introducing even more confusion and hence should be avoided. The
1169 current preferred terminology is to describe these similar-looking
1170 characters as "confusable characters" or even "confusables".
1172 Many people have suggested that confusable characters are a problem
1176 Klensin & Faltstrom Expires October 14, 2006 [Page 21]
1178 Internet-Draft IAB -- IDN Next Steps April 2006
1181 that must be addressed, at least in part, as part of the user
1182 interfaces of application software. While it should almost certainly
1183 be part of a complete solution, that approach creates it own set of
1184 difficulties. For example, a user switching between systems, or even
1185 between applications on the same system, may be surprised by
1186 different types of behavior and different levels of protection. In
1187 addition, it is unclear how a secure setup for the end user should be
1188 designed. Today, in the web browser, a padlock is a traditional way
1189 of describing some level of security for the end user. Is this
1190 binary signaling enough? Should there be any connection between a
1191 risk for a displayed string including confusable characters and the
1192 padlock or similar signaling to the user?
1194 Many web browsers have adopted the convention, based on a "whitelist"
1195 or similar techniques, that IDNs within top-level domains that are
1196 deemed to practice safe practices about registration of confusable
1197 labels are displayed as native characters, while IDNs from other
1198 domains are displayed as Punycode. These techniques clearly are not
1199 sensitive to different policies between top-level domains and their
1200 subdomains and, while clearly helpful, may not be adequate. Are
1201 other methods of dealing with confusable characters possible? Would
1202 other methods of identifying and listing policies about avoiding
1203 confusing registrations be feasible and helpful?
1205 It would be interesting to see a more coordinated effort to have
1206 guidelines in the form of user interface guidelines.
1208 3.1.4. Non US-ASCII in local part of email addresses
1210 Work is going on in the IETF related to the local part of email
1211 addresses. It should be noted that the local part of email addresses
1212 has much different syntax and constraints than a domain name label,
1213 so to directly apply IDNA on the local part is not possible.
1215 3.1.5. Use of the Unicode Character Set in the IETF
1217 Unicode, and the closely-related ISO 10646, are the only coded
1218 character set that aspires to include all of the world's characters.
1219 As such, they permit use of international characters without having
1220 to identify particular character coding standards or tables. The
1221 requirement for a single character set is particularly important for
1222 use with the DNS since there is no place to put character set
1223 identification. The decision to use Unicode as the base for IETF
1224 protocols going forward is discussed in [RFC2277]. The IAB does not
1225 see any reason to revisit the decision to use Unicode in IETF
1232 Klensin & Faltstrom Expires October 14, 2006 [Page 22]
1234 Internet-Draft IAB -- IDN Next Steps April 2006
1237 3.2. Issues that fall within the purview of ICANN
1239 3.2.1. Dispute resolution
1241 IDN creates new types of collisions between trademarks and domain
1242 names as well as collisions between domain names. These have impact
1243 on dispute resolution processes used by registries and otherwise. It
1244 is important that deployment of IDN evolve in parallel with review
1245 and updating of ICANN or registry-specific dispute resolution
1248 3.2.2. Policy at registries
1250 The IAB recommends that registries use an inclusion based model when
1251 choosing what characters to allow at the time of registration. This
1252 list of characters is in turn to be a subset of what is allowed
1253 according to the updated IDNA standard. The IAB further recommends
1254 that registries develop their inclusion based models in parallel with
1255 dispute resolution process at the registry itself.
1257 Most established policies for dealing with claimed or apparent
1258 confusion or conflicts of names are based on "dispute resolution".
1259 Decisions about legitimate use or registration of one or more names
1260 are resolved at or after the time of registration on a case-by-case
1261 basis and using policies that are specific to the particular DNS zone
1262 or jurisdiction involved. These policies have generally not been
1263 extended below the level of the DNS that is directly controlled by
1264 the top-level registry.
1266 Because of the much larger number of conflicts that can be generated
1267 by the larger number of available and confusable characters in
1268 Unicode, we recommend that registration-restriction and dispute
1269 resolution policies be developed to constrain IDN registrations by
1270 registries and zone administrators at all levels of the DNS tree. Of
1271 course, many of these policies will be less formal than others and
1272 there is no requirement for complete global consistency, but the
1273 arguments for reduction of confusable characters and other issues in
1274 TLDs should apply to all zones below that specific TLD.
1276 Consistency across all zones can obviously only be accomplished by
1277 changes to the protocols. Such changes should be considered by the
1278 IETF if particular restrictions are identified that are important and
1279 consistent enough to be applied globally.
1281 Policy changes that would not permit existing, registered, names to
1282 be registered under the newer rules should be considered carefully,
1283 balancing their importance against possible disruption and the issues
1284 of invalidating older names against the importance of consistency as
1288 Klensin & Faltstrom Expires October 14, 2006 [Page 23]
1290 Internet-Draft IAB -- IDN Next Steps April 2006
1297 The IAB has concluded that there is not one IDN TLD issue but at
1298 least three very separate ones:
1300 o Assuming there are to be IDN entries in the root zone at all, a
1301 decision must be made as to what TLDs are to be created and how
1302 they are to be named. This decision falls within the traditional
1303 IANA scope and is an ICANN issue today.
1304 o There has been discussion of permitting some or all existing TLDs
1305 to be referenced by multiple labels, with those labels presumably
1306 representing some understanding of the "name" of the TLD in
1307 different languages. If actual aliases of this type are desired
1308 for existing domains, the IETF may need to consider whether the
1309 use of DNAME records in the root is appropriate to meet that need,
1310 what constraints, if any, are needed, whether alternate
1311 approaches, such as those of [RFC4185], are appropriate or whether
1312 further alternatives should be investigated. But, to the extent
1313 to which aliases are considered desirable and feasible, decisions
1314 presumably must be made as to which, if any, root IDN labels
1315 should be associated with DNAME records and which ones should be
1316 handled by normal delegation records or other mechanisms. That
1317 decision is one of DNS root-level namespace policy and hence falls
1318 to ICANN although we would expect ICANN to pay careful attention
1319 to any technical, operational, or security recommendations that
1320 may be produced by other bodies.
1321 o Finally, if IDN labels are to be placed in the root zone, there
1322 are issues associated with how they are to be encoded and
1323 deployed. This area may have implications for work that has been
1324 done, or should be done, in the IETF.
1327 4. Specific Recommendations for Next Steps
1329 Consistent with the framework described above, the IAB offers these
1330 recommendations as steps for further consideration in the identified
1333 4.1. Reduction of permitted character list
1335 Generalize from the original "hostname" rules to non-ASCII
1336 characters, permitting as few characters as possible to do that job.
1337 This would represent a restriction of the model of characters
1338 permitted in IDN labels, and it contrasts with the approach used to
1339 develop the original IDNA/Nameprep tables: that approach was to
1340 include all Unicode characters that there was not a clear reason to
1344 Klensin & Faltstrom Expires October 14, 2006 [Page 24]
1346 Internet-Draft IAB -- IDN Next Steps April 2006
1351 The specific recommendation here is to specify such internationalized
1352 hostnames. Such an activity would fall to the IETF, although the
1353 task of developing the appropriate list of permitted characters will
1354 require effort both in the IETF and elsewhere. The effort should be
1355 as linguistically and culturally sensitive as possible, but smooth
1356 and effective operation of the DNS, including minimizing of
1357 complexity, should be primary goals. The following should be
1358 considered as possible mechanisms for achieving an appropriate
1359 minimum number of characters.
1361 4.1.1. Elimination of all non-language characters
1363 Unicode characters that are not needed to write words or numbers in
1364 any of the world's languages should be eliminated from the list of
1365 characters that are appropriate in DNS labels. In addition to such
1366 characters as those used for box-drawing and sentence punctuation,
1367 this should exclude punctuation for word structure and other
1368 delimiters: while DNS labels may conveniently be used to express
1369 words in many circumstances, the goal is not to express words (or
1370 sentences or phrases), but to permit the creation of unambiguous
1371 labels with good mnemonic value.
1373 4.1.2. Elimination of word-separation punctuation
1375 The inclusion of the hyphen in the original hostname rules is a
1376 historical artifact from an older, flat, name space. The community
1377 should consider whether it is appropriate to treat it as a simple
1378 legacy property of ASCII names and not attempt to generalize it to
1379 other scripts. We might, for example, not permit claimed equivalents
1380 to the hyphen from other scripts to be used in IDNs. We might even
1381 consider banning use of the hyphen itself in non-ASCII strings or,
1382 less restrictively, strings that contained non-Roman characters.
1384 4.2. Updating to new versions of Unicode
1386 As new scripts, to support new languages, continue to be added to
1387 Unicode, it is important that IDNA track updates. If it does not do
1388 so, but remains "stuck" at 3.2 or some single later version, it will
1389 not be possible to include labels in the DNS that are derived from
1390 words in languages that require characters that are available only in
1391 later versions. Making those upgrades is difficult, and will
1392 continue to be difficult, as long as new versions require, not just
1393 addition of characters, but changes to canonicalization conventions,
1394 normalization tables, or matching procedures (see Section 2.2.8).
1395 Anything that can be done to lower complexity and simplify forward
1396 transitions should be seriously considered.
1400 Klensin & Faltstrom Expires October 14, 2006 [Page 25]
1402 Internet-Draft IAB -- IDN Next Steps April 2006
1405 4.3. Combining Characters and Character Components
1407 One thing that increases IDNA complexity and the need for
1408 normalization is that combining characters are permitted. Without
1409 them, complexity might be reduced enough to permit more easy
1410 transitions to new versions. The community should consider whether
1411 combining characters should be prohibited entirely from IDNs. A
1412 consequence of this, of course, is that each new language or script,
1413 and several existing ones, would require that all of its characters
1414 have Unicode assignments to specific, precomposed, code points.
1416 Note that this is not currently permitted within Unicode for Roman-
1417 based scripts. For non-Roman scripts, some such code points have
1418 been defined. The decisions that govern the assignment of such code
1419 points are managed entirely within the Unicode Consortium. Were the
1420 IETF to choose to reduce IDNA complexity by excluding combining
1421 characters, no doubt there would be additional input to the Unicode
1422 Consortium from users and proponents of scripts requiring composing
1423 characters. The IAB and the IETF should examine whether it is
1424 appropriate to press the Unicode Consortium to revise these policies
1425 or otherwise to recommend actions that would reduce the need for
1426 normalization and the related complexities. However, we have been
1427 told that the Technical Committee does not believe it is reasonable
1428 or feasible to add all possible precomposed characters to Unicode.
1429 If Unicode cannot be modified to contain the precomposed characters
1430 necessary to support existing languages and scripts, much less new
1431 ones, this option for IDN restrictions will not be feasible.
1433 Retaining combining characters without further global restrictions
1434 may leave us "stuck" at Unicode 3.2, leading either to
1435 incompatibility differences in applications that otherwise use a
1436 modern version of Unicode (while IDN remains at Unicode 3.2) or to
1437 painful transitions to new versions.
1439 4.4. Role and Uses of the DNS
1441 We wish to remind the community that there are boundaries to the
1442 appropriate uses of the DNS. It was designed and implemented to
1443 serve some specific purposes. There are additional things that it
1444 does well, other things that it does badly, and still other things it
1445 cannot do at all. No amount of protocol work on IDNs will solve
1446 problems with alternate spellings, near-matches, searching for
1447 appropriate names, and so on. Registration restrictions and
1448 carefully-designed user interfaces can be used to reduce the risk and
1449 pain of attempts to do some of these things gone wrong, as well as
1450 reducing the risks of various sort of deliberate bad behavior, but,
1451 beyond a certain point, use of the DNS simply because it is available
1452 becomes a bad tradeoff. The tradeoff may be particularly unfortunate
1456 Klensin & Faltstrom Expires October 14, 2006 [Page 26]
1458 Internet-Draft IAB -- IDN Next Steps April 2006
1461 when the use of IDNs does not actually solve the proposed problem.
1462 For example, internationalization of DNS names does not eliminate the
1463 ASCII protocol identifiers and structure of URIs [RFC3986] and even
1464 IRIs [RFC3987]. Hence, DNS internationalization itself, at any or
1465 all levels of the DNS tree, is not a sufficient response to the
1466 desire of populations to use the Internet entirely in their own
1467 languages and the characters associated with those languages.
1469 These issues are discussed at more length, and alternatives
1470 presented, in [RFC2825], [RFC3467], [INDNS], and [DNS-Choices].
1472 4.5. Databases of Registered Names
1474 In addition to their presence in the DNS, IDNs introduce issues in
1475 other contexts in which domain names are used. In particular, the
1476 design and content of databases that bind registered names to
1477 information about the registrant (commonly described as "whois"
1478 databases) will require review and updating. For example, the whois
1479 protocol itself [RFC3912] is ASCII-only: with a conforming
1480 implementation of the Whois protocol, one cannot search for, or
1481 report, either a DNS name or contact information that is not in ASCII
1482 characters . This may provide some additional impetus for a switch
1483 to IRIS [RFC3981] [RFC3982] but also raises a number of other
1484 questions about what information, and in what languages and scripts,
1485 should be included or permitted in such databases.
1488 5. Security Considerations
1490 This document is simply a discussion of IDNs and IDN issues; it
1491 raises no new security concerns. However, if some of its
1492 recommendations to reduce IDNA complexity, the number of available
1493 characters, and various approaches to constraining the use of
1494 confusable characters, are followed and prove successful, the risks
1495 of name spoofing and other problems may be reduced.
1500 The contributions to this report from members of the IAB-IDN ad hoc
1501 committee are gratefully acknowledged. Of course, not all of the
1502 members of that group endorse every comment and suggestion of this
1503 report. In particular, this report does not claim to reflect the
1504 views of the Unicode Consortium as a whole or those of particular
1505 participants in the work of that Consortium. The members of the ad
1508 Rob Austein, Leslie Daigle, Tina Dam, Mark Davis, Patrik Faltstrom,
1512 Klensin & Faltstrom Expires October 14, 2006 [Page 27]
1514 Internet-Draft IAB -- IDN Next Steps April 2006
1517 Scott Hollenbeck, Cary Karp, John Klensin, Gervase Markham, David
1518 Meyer, Thomas Narten, Michael Suignard, Sam Weiler, Bert Wijnen, Kurt
1519 Zeilenga and Lixia Zhang.
1521 Special thanks are due to Cary Karp and Tina Dam for contributions of
1522 considerable specific text, to Marcos Sanz and Paul Hoffman for
1523 careful late-stage reading and extensive comments, and to Pete
1524 Resnick for many contributions and comments, both in conjunction with
1525 his former IAB service and subsequently.
1527 Members of the IAB at the time of approval of this document were:
1528 [[anchor39: To be supplied]]
1533 [[anchor41: RFC Editor: this section is to be removed before
1536 7.1. Changes for version -01
1538 1. Added discussion and reference to Unicode PR-29
1539 2. Replaced the discussion of the ICANN Guidelines (with thanks to
1540 Tina Dam and Cary Karp).
1541 3. Revised the Bidi text to make the potential recommendation more
1543 4. Removed any claims (actual or implied) of endorsement by the
1544 members of the ad hoc committee.
1545 5. Several small editorial changes, etc.
1547 7.2. Changes for version -02
1549 1. Added some additional references, e.g., to W3C
1550 internationalization work and to UTR39.
1551 2. Adjusted some terminology to correct errors and avoid unnecessary
1553 3. Extended the discussion of related characters in Swedish and
1554 Norwegian to clarify at least one of the possibilities
1555 4. Introduced new Section 4.5 to discuss IDN issues in other than
1556 the DNS itself and point to IRIS.
1557 5. Rewrote the introduction to the "problem" section and its first
1559 6. Small changes made to the "definitions" section including
1560 explaining why "multilingual" is there and rewriting the "script"
1561 definition to clarify slightly and put the example script names
1562 into alphabetical order.
1568 Klensin & Faltstrom Expires October 14, 2006 [Page 28]
1570 Internet-Draft IAB -- IDN Next Steps April 2006
1573 7. Section 3.2.3, has been fairly extensively rewritten for clarity,
1574 and a large number of less extensive clarifications have been
1575 made, although no substantive changes have been (intentionally)
1578 7.3. Changes for Version -03
1580 1. Made a number of further tuning changes to better reflect the
1581 role of the document and corrected several references.
1582 2. Removed the reference to Vietnamese.
1583 3. Added a discussion of IDNA versioning and new prefixes.
1585 7.4. Changes for version -04
1587 1. Corrected many small typographical and editorial errors.
1588 2. Clarified that elimination of non-language characters was not
1589 intended to eliminate digits.
1591 7.5. Changes for version -05
1593 1. Revised section 4.3 to further clarify the suggestion.
1594 2. Revised the Acknowledgments section
1599 8.1. Normative References
1602 International Organization for Standardization,
1603 "Information Technology - Universal Multiple- Octet Coded
1604 Character Set (UCS) - Part 1: Architecture and Basic
1605 Multilingual Plane"", ISO/IEC 10646-1:2000, October 2000.
1607 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
1608 Internationalized Strings ("stringprep")", RFC 3454,
1611 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
1612 "Internationalizing Domain Names in Applications (IDNA)",
1613 RFC 3490, March 2003.
1615 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1616 Profile for Internationalized Domain Names (IDN)",
1617 RFC 3491, March 2003.
1619 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
1620 for Internationalized Domain Names in Applications
1624 Klensin & Faltstrom Expires October 14, 2006 [Page 29]
1626 Internet-Draft IAB -- IDN Next Steps April 2006
1629 (IDNA)", RFC 3492, March 2003.
1632 The Unicode Consortium, "The Unicode Standard, Version
1635 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5).
1636 Version 3.2 consists of the definition in that book as
1637 amended by the Unicode Standard Annex #27: Unicode 3.1
1638 (http://www.unicode.org/reports/tr27/) and by the Unicode
1639 Standard Annex #28: Unicode 3.2
1640 (http://www.unicode.org/reports/tr28/).
1642 8.2. Informative References
1645 Faltstrom, P., "Design Choices When Expanding DNS",
1646 draft-iab-dns-choices-02 (work in progress), June 2005.
1648 [ICANNv1] ICANN, "Guidelines for the Implementation of
1649 Internationalized Domain Names, Version 1.0", March 2003,
1650 <http://www.icann.org/general/idn-guidelines-20jun03.htm>.
1652 [ICANNv2] ICANN, "Guidelines for the Implementation of
1653 Internationalized Domain Names, Version 2.0",
1655 <http://www.icann.org/general/idn-guidelines-20sep05.htm>.
1658 Internet Engineering Steering Group (IESG), "IESG
1659 Statement on IDN", IESG Statements IDN Statement,
1661 <http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt>.
1663 [INDNS] National Research Council, "Signposts in Cyberspace: The
1664 Domain Name System and Internet Navigation", National
1665 Academy Press ISBN 0309-09640-5 (Book) 0309-54979-5 (PDF),
1667 <http://www7.nationalacademies.org/cstb/pub_dns.html>.
1670 International Organization for Standardization,
1671 "Information Processing: ISO 7-bit and 8-bit coded
1672 character sets: Code extension techniques", ISO Standard
1676 International Organization for Standardization,
1680 Klensin & Faltstrom Expires October 14, 2006 [Page 30]
1682 Internet-Draft IAB -- IDN Next Steps April 2006
1685 "Information technology - ISO 7-bit coded character set
1686 for information interchange", ISO Standard 646, 1991.
1689 International Organization for Standardization,
1690 "Information processing - 8-bit single-byte coded graphic
1691 character sets - Part 1: Latin alphabet No. 1 (1998) -
1692 Part 2: Latin alphabet No. 2 (1999) - Part 3: Latin
1693 alphabet No. 3 (1999) - Part 4: Latin alphabet No. 4
1694 (1998) - Part 5: Latin/Cyrillic alphabet (1999) - Part 6:
1695 Latin/Arabic alphabet (1999) - Part 7: Latin/Greek
1696 alphabet (2003) - Part 8: Latin/Hebrew alphabet (1999) -
1697 Part 9: Latin alphabet No. 5 (1999) - Part 10: Latin
1698 alphabet No. 6 (1998) - Part 11: Latin/Thai alphabet
1699 (2001) - Part 13: Latin alphabet No. 7 (1998) - Part 14:
1700 Latin alphabet No. 8 (Celtic) (1998) - Part 15: Latin
1701 alphabet No. 9 (1999) - Part 16: Part 16: Latin alphabet
1702 No. 10 (2001)", ISO Standard 8859, 2003.
1704 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
1705 Languages", BCP 18, RFC 2277, January 1998.
1707 [RFC2825] IAB and L. Daigle, "A Tangled Web: Issues of I18N, Domain
1708 Names, and the Other Internet protocols", RFC 2825,
1711 [RFC3066] Alvestrand, H., "Tags for the Identification of
1712 Languages", BCP 47, RFC 3066, January 2001.
1714 [RFC3467] Klensin, J., "Role of the Domain Name System (DNS)",
1715 RFC 3467, February 2003.
1717 [RFC3536] Hoffman, P., "Terminology Used in Internationalization in
1718 the IETF", RFC 3536, May 2003.
1720 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
1721 Engineering Team (JET) Guidelines for Internationalized
1722 Domain Names (IDN) Registration and Administration for
1723 Chinese, Japanese, and Korean", RFC 3743, April 2004.
1725 [RFC3912] Daigle, L., "WHOIS Protocol Specification", RFC 3912,
1728 [RFC3981] Newton, A. and M. Sanz, "IRIS: The Internet Registry
1729 Information Service (IRIS) Core Protocol", RFC 3981,
1732 [RFC3982] Newton, A. and M. Sanz, "IRIS: A Domain Registry (dreg)
1736 Klensin & Faltstrom Expires October 14, 2006 [Page 31]
1738 Internet-Draft IAB -- IDN Next Steps April 2006
1741 Type for the Internet Registry Information Service
1742 (IRIS)", RFC 3982, January 2005.
1744 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1745 Resource Identifier (URI): Generic Syntax", STD 66,
1746 RFC 3986, January 2005.
1748 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
1749 Identifiers (IRIs)", RFC 3987, January 2005.
1751 [RFC4185] Klensin, J., "National and Local Characters for DNS Top
1752 Level Domain (TLD) Names", RFC 4185, October 2005.
1754 [RFC4290] Klensin, J., "Suggested Practices for Registration of
1755 Internationalized Domain Names (IDN)", RFC 4290,
1758 [UTR] Unicode Consortium, "Unicode Technical Reports",
1759 <http://www.unicode.org/reports/>.
1761 [UTR36] Davis, M. and M. Suignard, "Unicode Technical Report #36:
1762 Unicode Security Considerations", November 2005,
1763 <http://www.unicode.org/draft/reports/tr36/tr36.html>.
1765 Working Draft for Proposed Update
1767 [UTR39] Davis, M. and M. Suignard, "Unicode Technical Standard #39
1768 (proposed): Unicode Security Considerations", July 2005,
1769 <http://www.unicode.org/draft/reports/tr39/tr39.html>.
1771 Working Draft for Proposed Draft
1774 The Unicode Consortium, "Public Review Issue #29:
1775 Normalization Issue", Unicode PR 29, February 2004.
1778 The Unicode Consortium, "The Unicode Standard, Version
1782 Ishida, R. and S. Miller, "Localization vs.
1783 Internationalization", W3C International/questions/
1784 qa-i18n.txt, December 2005.
1787 Ewell, D., Ed., "Initial Language Subtag Registry",
1788 draft-ietf-ltru-initial-06 (work in progress),
1792 Klensin & Faltstrom Expires October 14, 2006 [Page 32]
1794 Internet-Draft IAB -- IDN Next Steps April 2006
1799 This document is awaiting publication as an Informational
1803 Phillips, A., Ed. and M. Davis, Ed., "Tags for Identifying
1804 Languages", draft-ietf-ltru-registry-14 (work in
1805 progress), October 2004.
1807 This document has been approved as a Proposed Standard and
1808 is awaiting publication as an RFC.
1848 Klensin & Faltstrom Expires October 14, 2006 [Page 33]
1850 Internet-Draft IAB -- IDN Next Steps April 2006
1856 1770 Massachusetts Ave, #322
1860 Phone: +1 617 491 5735
1861 Email: john-ietf@jck.com
1867 Email: paf@cisco.com
1904 Klensin & Faltstrom Expires October 14, 2006 [Page 34]
1906 Internet-Draft IAB -- IDN Next Steps April 2006
1909 Intellectual Property Statement
1911 The IETF takes no position regarding the validity or scope of any
1912 Intellectual Property Rights or other rights that might be claimed to
1913 pertain to the implementation or use of the technology described in
1914 this document or the extent to which any license under such rights
1915 might or might not be available; nor does it represent that it has
1916 made any independent effort to identify any such rights. Information
1917 on the procedures with respect to rights in RFC documents can be
1918 found in BCP 78 and BCP 79.
1920 Copies of IPR disclosures made to the IETF Secretariat and any
1921 assurances of licenses to be made available, or the result of an
1922 attempt made to obtain a general license or permission for the use of
1923 such proprietary rights by implementers or users of this
1924 specification can be obtained from the IETF on-line IPR repository at
1925 http://www.ietf.org/ipr.
1927 The IETF invites any interested party to bring to its attention any
1928 copyrights, patents or patent applications, or other proprietary
1929 rights that may cover technology that may be required to implement
1930 this standard. Please address the information to the IETF at
1934 Disclaimer of Validity
1936 This document and the information contained herein are provided on an
1937 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1938 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1939 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1940 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1941 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1942 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1947 Copyright (C) The Internet Society (2006). This document is subject
1948 to the rights, licenses and restrictions contained in BCP 78, and
1949 except as set forth therein, the authors retain all their rights.
1954 Funding for the RFC Editor function is currently provided by the
1960 Klensin & Faltstrom Expires October 14, 2006 [Page 35]