1 Network Working Group J. Klensin
2 Request for Comments: 4690 P. Faltstrom
3 Category: Informational Cisco Systems
4 C. Karp
5 Swedish Museum of Natural History
6 IAB
7 September 2006
8
9
10 Review and Recommendations for Internationalized Domain Names (IDNs)
11
12 Status of This Memo
13
14 This memo provides information for the Internet community. It does
15 not specify an Internet standard of any kind. Distribution of this
16 memo is unlimited.
17
18 Copyright Notice
19
20 Copyright (C) The Internet Society (2006).
21
22 Abstract
23
24 This note describes issues raised by the deployment and use of
25 Internationalized Domain Names. It describes problems both at the
26 time of registration and for use of those names in the DNS. It
27 recommends that IETF should update the RFCs relating to IDNs and a
28 framework to be followed in doing so, as well as summarizing and
29 identifying some work that is required outside the IETF. In
30 particular, it proposes that some changes be investigated for the
31 Internationalizing Domain Names in Applications (IDNA) standard and
32 its supporting tables, based on experience gained since those
33 standards were completed.
34
35 Table of Contents
36
37 1. Introduction ....................................................3
38 1.1. The Role of IDNs and This Document .........................3
39 1.2. Status of This Document and Its Recommendations ............4
40 1.3. The IDNA Standard ..........................................4
41 1.4. Unicode Documents ..........................................5
42 1.5. Definitions ................................................5
43 1.5.1. Language ............................................6
44 1.5.2. Script ..............................................6
45 1.5.3. Multilingual ........................................6
46 1.5.4. Localization ........................................7
47 1.5.5. Internationalization ................................7
48
49
50
51
52 Klensin, et al. Informational [Page 1]
53 RFC 4690 IAB -- IDN Next Steps September 2006
54
55
56 1.6. Statements and Guidelines ..................................7
57 1.6.1. IESG Statement ......................................8
58 1.6.2. ICANN Statements ....................................8
59 2. General Problems and Issues ....................................11
60 2.1. User Conceptions, Local Character Sets, and Input issues ..11
61 2.2. Examples of Issues ........................................13
62 2.2.1. Language-Specific Character Matching ...............13
63 2.2.2. Multiple Scripts ...................................13
64 2.2.3. Normalization and Character Mappings ...............14
65 2.2.4. URLs in Printed Form ...............................16
66 2.2.5. Bidirectional Text .................................17
67 2.2.6. Confusable Character Issues ........................17
68 2.2.7. The IESG Statement and IDNA issues .................19
69 3. Migrating to New Versions of Unicode ...........................20
70 3.1. Versions of Unicode .......................................20
71 3.2. Version Changes and Normalization Issues ..................21
72 3.2.1. Unnormalized Combining Sequences ...................21
73 3.2.2. Combining Characters and Character Components ......22
74 3.2.3. When does normalization occur? .....................23
75 4. Framework for Next Steps in IDN Development ....................24
76 4.1. Issues within the Scope of the IETF .......................24
77 4.1.1. Review of IDNA .....................................24
78 4.1.2. Non-DNS and Above-DNS Internationalization
79 Approaches .........................................25
80 4.1.3. Security Issues, Certificates, etc. ................25
81 4.1.4. Protocol Changes and Policy Implications ...........27
82 4.1.5. Non-US-ASCII in Local Part of Email Addresses ......27
83 4.1.6. Use of the Unicode Character Set in the IETF .......27
84 4.2. Issues That Fall within the Purview of ICANN ..............28
85 4.2.1. Dispute Resolution .................................28
86 4.2.2. Policy at Registries ...............................28
87 4.2.3. IDNs at the Top Level of the DNS ...................29
88 5. Specific Recommendations for Next Steps ........................29
89 5.1. Reduction of Permitted Character List .....................29
90 5.1.1. Elimination of All Non-Language Characters .........30
91 5.1.2. Elimination of Word-Separation Punctuation .........30
92 5.2. Updating to New Versions of Unicode .......................30
93 5.3. Role and Uses of the DNS ..................................31
94 5.4. Databases of Registered Names .............................31
95 6. Security Considerations ........................................31
96 7. Acknowledgements ...............................................32
97 8. References .....................................................32
98 8.1. Normative References ......................................32
99 8.2. Informative References ....................................33
100
101
102
103
104
105
106
107 Klensin, et al. Informational [Page 2]
108 RFC 4690 IAB -- IDN Next Steps September 2006
109
110
111 1. Introduction
112
113 1.1. The Role of IDNs and This Document
114
115 While IDNs have been advocated as the solution to a wide range of
116 problems, this document is written from the perspective that they are
117 no more and no less than DNS names, reflecting the same requirements
118 for use, stability, and accuracy as traditional "hostnames", but
119 using a much larger collection of permitted characters. In
120 particular, while IDNs represent a step toward an Internet that is
121 equally accessible from all languages and scripts, they, at best,
122 address only a small part of that very broad objective. There has
123 been controversy since IDNs were first suggested about how important
124 they will actually turn out to be; that controversy will probably
125 continue. Accessibility from all languages is an important
126 objective, hence it is important that our standards and definitions
127 for IDNs be smoothly adaptable to additional scripts as they are
128 added to the Unicode character set.
129
130 The utility of IDNs must be evaluated in terms of their application
131 by users and in protocols: the ability to simply put a name into the
132 DNS and retrieve it is not, in and of itself, important. From this
133 point of view, IDNs will be useful and effective if they provide
134 stable and predictable references -- references that are no less
135 stable and predictable, and no less secure, than their ASCII
136 counterparts.
137
138 This combination of objectives and criteria has proven very difficult
139 to satisfy. Experience in developing the IDNA standard and during
140 the initial years of its implementation and deployment suggests that
141 it may be impossible to fully satisfy all of them and that
142 engineering compromises are needed to yield a result that is
143 workable, even if not completely satisfactory. Based on that
144 experience and issues that have been raised, it is now appropriate to
145 review some of the implications of IDNs, the decisions made in
146 defining them, and the foundation on which they rest and determine
147 whether changes are needed and, if so, which ones.
148
149 The design of the DNS itself imposes some additional constraints. If
150 the DNS is to remain globally interoperable, there are specific
151 characteristics that no implementation of IDNs, or the DNS more
152 generally, can change. For example, because the DNS is a global
153 hierarchal administrative namespace with only a single name at any
154 given node, there is one and only one owner of each domain name.
155 Also, when strings are looked up in the DNS, positive responses can
156 only reflect exact matches: if there is no exact match, then one gets
157 an error reply, not a list of near matches or other supplemental
158 information. Searches and approximate matchings are not possible.
159
160
161
162 Klensin, et al. Informational [Page 3]
163 RFC 4690 IAB -- IDN Next Steps September 2006
164
165
166 Finally, because the DNS is a distributed system where any server
167 might cache responses, and later use those cached responses to
168 attempt to satisfy queries before a global lookup is done, every
169 server must use the same matching criteria.
170
171 1.2. Status of This Document and Its Recommendations
172
173 This document reviews the IDN landscape from an IETF perspective and
174 presents the recommendations and conclusions of the IAB, based
175 partially on input from an ad hoc committee charged with reviewing
176 IDN issues and the path forward (see Section 7). Its recommendations
177 are advice to the IETF, or in a few cases to other bodies, for topics
178 to be investigated and actions to be taken if those bodies, after
179 their examinations, consider those actions appropriate.
180
181 1.3. The IDNA Standard
182
183 During 2002, the IETF completed the following RFCs that, together,
184 define IDNs:
185
186 RFC 3454 Preparation of Internationalized Strings ("Stringprep")
187 [RFC3454].
188 Stringprep is a generic mechanism for taking a Unicode string and
189 converting it into a canonical format. Stringprep itself is just
190 a collection of rules, tables, and operations. Any protocol or
191 algorithm that uses it must define a "Stringprep profile", which
192 specifies which of those rules are applied, how, and with which
193 characteristics.
194
195 RFC 3490 Internationalizing Domain Names in Applications (IDNA)
196 [RFC3490].
197 IDNA is the base specification in this group. It specifies that
198 Nameprep is used as the Stringprep profile for domain names, and
199 that Punycode is the relevant encoding mechanism for use in
200 generating an ASCII-compatible ("ACE") form of the name. It also
201 applies some additional conversions and character filtering that
202 are not part of Nameprep.
203
204 RFC 3491 Nameprep: A Stringprep Profile for Internationalized Domain
205 Names (IDN) [RFC3491].
206 Nameprep is designed to meet the specific needs of IDNs and, in
207 particular, to support case-folding for scripts that support what
208 are traditionally known as upper- and lowercase forms of the same
209 letters. The result of the Nameprep algorithm is a string
210 containing a subset of the Unicode Character set, normalized and
211 case-folded so that case-insensitive comparison can be made.
212
213
214
215
216
217 Klensin, et al. Informational [Page 4]
218 RFC 4690 IAB -- IDN Next Steps September 2006
219
220
221 RFC 3492 Punycode: A Bootstring encoding of Unicode for
222 Internationalized Domain Names in Applications (IDNA) [RFC3492].
223 Punycode is a mechanism for encoding a Unicode string in ASCII
224 characters. The characters used are the same the subset of
225 characters that are allowed in the hostname definition of DNS,
226 i.e., the "letter, digit, and hyphen" characters, sometimes known
227 as "LDH".
228
229 1.4. Unicode Documents
230
231 Unicode is used as the base, and defining, character set for IDNs.
232 Unicode is standardized by the Unicode Consortium, and synchronized
233 with ISO to create ISO/IEC 10646 [ISO10646]. At the time the RFCs
234 mentioned earlier were created, Unicode was at Version 3.2. For
235 reasons explained later, it was necessary to pick a particular,
236 then-current, version of Unicode when IDNA was adopted.
237 Consequently, the RFCs are explicitly dependent on Unicode Version
238 3.2 [Unicode32]. There is, at present, no established mechanism for
239 modifying the IDNA RFCs to use newer Unicode versions (see
240 Section 3.1).
241
242 Unicode is a very large and complex character set. (The term
243 "character set" or "charset" is used in a way that is peculiar to the
244 IETF and may not be the same as the usage in other bodies and
245 contexts.) The Unicode Standard and related documents are created
246 and maintained by the Unicode Technical Committee (UTC), one of the
247 committees of the Unicode Consortium.
248
249 The Consortium first published The Unicode Standard [Unicode10] in
250 1991, and continues to develop standards based on that original work.
251 Unicode is developed in conjunction with the International
252 Organization for Standardization, and it shares its character
253 repertoire with ISO/IEC 10646. Unicode and ISO/IEC 10646 function
254 equivalently as character encodings, but The Unicode Standard
255 contains much more information for implementers, covering -- in depth
256 -- topics such as bitwise encoding, collation, and rendering. The
257 Unicode Standard enumerates a multitude of character properties,
258 including those needed for supporting bidirectional text. The
259 Unicode Consortium and ISO standards do use slightly different
260 terminology.
261
262 1.5. Definitions
263
264 The following terms and their meanings are critical to understanding
265 the rest of this document and to discussions of IDNs more generally.
266 These terms are derived from [RFC3536], which contains additional
267 discussion of some of them.
268
269
270
271
272 Klensin, et al. Informational [Page 5]
273 RFC 4690 IAB -- IDN Next Steps September 2006
274
275
276 1.5.1. Language
277
278 A language is a way that humans interact. The use of language occurs
279 in many forms, including speech, writing, and signing.
280
281 Some languages have a close relationship between the written and
282 spoken forms, while others have a looser relationship. RFC 3066
283 [RFC3066] discusses languages in more detail and provides identifiers
284 for languages for use in Internet protocols. Computer languages are
285 explicitly excluded from this definition. The most recent IETF work
286 in this area, and on script identification (see below), is documented
287 in [RFC4645] and [RFC4646].
288
289 1.5.2. Script
290
291 A script is a set of graphic characters used for the written form of
292 one or more languages. This definition is the one used in
293 [ISO10646].
294
295 Examples of scripts are Arabic, Cyrillic, Greek, Han (the so-called
296 ideographs used in writing Chinese, Japanese, and Korean), and
297 "Latin". Arabic, Greek, and Latin are, of course, also names of
298 languages.
299
300 Historically, the script that is known as "Latin" in Unicode and most
301 contexts associated with information technology standards is known in
302 the linguistic community as "Roman" or "Roman-derived". The latter
303 terminology distinguishes between the Latin language and the
304 characters used to write it, especially in Republican times, from the
305 much richer and more decorated script derived and adapted from those
306 characters. Since IDNA is defined using Unicode and that standard
307 used the term "LATIN" in its character names and descriptions, that
308 terminology will be used in this document as well except when
309 "Roman-derived" is needed for clarity. However, readers approaching
310 this document from a cultural or linguistic standpoint should be
311 aware that the use of, or references to, "Latin script" in this
312 document refers to the entire collection of Roman-derived characters,
313 not just the characters used to write the Latin language. Some other
314 issues with script identification and relationships with other
315 standards are discussed in [RFC4646].
316
317 1.5.3. Multilingual
318
319 The term "multilingual" has many widely-varying definitions and thus
320 is not recommended for use in standards. Some of the definitions
321 relate to the ability to handle international characters; other
322 definitions relate to the ability to handle multiple charsets; and
323 still others relate to the ability to handle multiple languages.
324
325
326
327 Klensin, et al. Informational [Page 6]
328 RFC 4690 IAB -- IDN Next Steps September 2006
329
330
331 While this term has been deprecated for IETF-related uses and does
332 not otherwise appear in this document, a discussion here seemed
333 appropriate since the term is still widely used in some discussions
334 of IDNs.
335
336 1.5.4. Localization
337
338 Localization is the process of adapting an internationalized
339 application platform or application to a specific cultural
340 environment. In localization, the same semantics are preserved while
341 the syntax or presentation forms may be changed.
342
343 Localization is the act of tailoring an application for a different
344 language or script or culture. Some internationalized applications
345 can handle a wide variety of languages. Typical users understand
346 only a small number of languages, so the program must be tailored to
347 interact with users in just the languages they know.
348
349 Somewhat different definitions for localization and
350 internationalization (see below) are used by groups other than the
351 IETF. See [W3C-Localization] for one example.
352
353 1.5.5. Internationalization
354
355 In the IETF, the term "internationalization" is used to describe
356 adding or improving the handling of non-ASCII text in a protocol.
357 Other bodies use the term in other ways, often with subtle variation
358 in meaning. The term "internationalization" is often abbreviated
359 "i18n" (and localization as "l10n").
360
361 Many protocols that handle text only handle the characters associated
362 with one script (often, a subset of the characters used in writing
363 English text), or leave the question of what character set is used up
364 to local guesswork (which leads to interoperability problems).
365 Adding non-ASCII text to such a protocol allows the protocol to
366 handle more scripts, with the intention of being able to include all
367 of the scripts that are useful in the world. It is naive (sic) to
368 believe that all English words can be written in ASCII, various
369 mythologies notwithstanding.
370
371 1.6. Statements and Guidelines
372
373 When the IDNA RFCs were published, the IESG and ICANN made statements
374 that were intended to guide deployment and future work. In recent
375 months, ICANN has updated its statement and others have also made
376 contributions. It is worth noting that the quality of understanding
377 of internationalization issues as applied to the DNS has evolved
378
379
380
381
382 Klensin, et al. Informational [Page 7]
383 RFC 4690 IAB -- IDN Next Steps September 2006
384
385
386 considerably over the last few years. Organizations that took
387 specific positions a year or more ago might not make exactly the same
388 statements today.
389
390 1.6.1. IESG Statement
391
392 The IESG made a statement on IDNA [IESG-IDN]:
393
394 IDNA, through its requirement of Nameprep [RFC3491], uses
395 equivalence tables that are based only on the characters
396 themselves; no attention is paid to the intended language (if any)
397 for the domain name. However, for many domain names, the intended
398 language of one or more parts of the domain name actually does
399 matter to the users.
400
401 Similarly, many names cannot be presented and used without
402 ambiguity unless the scripts to which their characters belong are
403 known. In both cases, this additional information should be of
404 concern to the registry.
405
406 The statement is longer than this, but these paragraphs are the
407 important ones. The rest of the statement consists of explanations
408 and examples.
409
410 1.6.2. ICANN Statements
411
412 1.6.2.1. Initial ICANN Guidelines
413
414 Soon after the IDNA standards were adopted, ICANN produced an initial
415 version of its "IDN Guidelines" [ICANNv1]. This document was
416 intended to serve two purposes. The first was to provide a basis for
417 releasing the Generic Top Level Domain (gTLD) registries that had
418 been established by ICANN from a contractual restriction on the
419 registration of labels containing hyphens in the third and fourth
420 positions. The second was to provide a general framework for the
421 development of registry policies for the implementation of IDNs.
422
423 One of the key components of this framework prescribed strict
424 compliance with RFCs 3490, 3491, and 3492. With the framework, ICANN
425 specified that IDNA was to be the sole mechanism to be used in the
426 DNS to represent IDNs.
427
428 Limitations on the characters available for inclusion in IDNs were
429 mandated by two mechanisms. The first was by requiring an
430 "inclusion-based approach (meaning that code points that are not
431 explicitly permitted by the registry are prohibited) for identifying
432 permissible
433
434
435
436
437 Klensin, et al. Informational [Page 8]
438 RFC 4690 IAB -- IDN Next Steps September 2006
439
440
441 code points from among the full Unicode repertoire." The second
442 mechanism required the association of every IDN with a specific
443 language, with additional policies also being language based:
444
445 "In implementing the IDN standards, top-level domain registries will
446 (a) associate each registered internationalized domain name with one
447 language or set of languages,
448 (b) employ language-specific registration and administration rules
449 that are documented and publicly available, such as the reservation
450 of all domain names with equivalent character variants in the
451 languages associated with the registered domain name, and,
452 (c) where the registry finds that the registration and administration
453 rules for a given language would benefit from a character variants
454 table, allow registrations in that language only when an appropriate
455 table is available. ... In implementing the IDN standards, top-level
456 domain registries should, at least initially, limit any given domain
457 label (such as a second-level domain name) to the characters
458 associated with one language or set of languages only."
459
460 It was left to each TLD registry to define the character repertoire
461 it would associate with any given language. This led to significant
462 variation from registry to registry, with further heterogeneity in
463 the underlying language-based IDN policies. If the guidelines had
464 made provision for IDN policies also being based on script, a
465 substantial amount of the resulting ambiguity could have been
466 avoided. However, they did not, and the sequence of events leading
467 to the present review of IDNA was thus triggered.
468
469 1.6.2.2. ICANN Version 2 Guidelines
470
471 One of the responses of the TLD registries to what was widely
472 perceived as a crisis situation was to invoke the mechanism described
473 in the initial guidelines: "As the deployment of IDNs proceeds, ICANN
474 and the IDN registries will review these Guidelines at regular
475 intervals, and revise them as necessary based on experience."
476
477 The pivotal requirement was the modification of the guidelines to
478 permit script-based policies for IDNs. Further concern was expressed
479 about the need for realistically implementable mechanisms for the
480 propagation of TLD registry policies into the lower levels of their
481 name trees. In addition to the anticipated increase of constraint on
482 the protocol level, one obvious additional approach would be to
483 replace the guidelines by an instrument that itself had clear status
484 in the IETF's normative framework. A BCP was therefore seen as the
485 appropriate focus for longer-term effort. The most pressing issues
486 would be dealt with in the interim by incremental modification to the
487 guidelines, but no need was seen for the detailed further development
488 of those guidelines once that incremental modification was complete.
489
490
491
492 Klensin, et al. Informational [Page 9]
493 RFC 4690 IAB -- IDN Next Steps September 2006
494
495
496 The outcome of this action was a version 2.0 of the guidelines
497 [ICANNv2], which was endorsed by the ICANN Board on November 8, 2005
498 for a period of nine months. The Board stated further that it "tasks
499 the IDN working group to continue its important work and return to
500 the board with specific IDN improvement recommendations before the
501 ICANN Meeting in Morocco" and "supports the working group's continued
502 action to reframe the guidelines completely in a manner appropriate
503 for further development as a Best Current Practices (BCP) document,
504 to ensure that the Guideline directions will be used deeper into the
505 DNS hierarchy and within TLD's where ICANN has a lesser policy
506 relationship."
507
508 Retaining the inclusion-based approach established in version 1.0,
509 the crucial addition to the policy framework is that:
510
511 "All code points in a single label will be taken from the same script
512 as determined by the Unicode Standard Annex #24: Script Names at
513 http://www.unicode.org/reports/tr24. Exception to this is
514 permissible for languages with established orthographies and
515 conventions that require the commingled use of multiple scripts. In
516 such cases, visually confusable characters from different scripts
517 will not be allowed to coexist in a single set of permissible
518 codepoints unless a corresponding policy and character table is
519 clearly defined."
520
521 Additionally:
522
523 "Permissible code points will not include: (a) line symbol-drawing
524 characters (as those in the Unicode Box Drawing block), (b) symbols
525 and icons that are neither alphanumeric nor ideographic language
526 characters, such as typographic and pictographic dingbats, (c)
527 characters with well-established functions as protocol elements, (d)
528 punctuation marks used solely to indicate the structure of
529 sentences."
530
531 Attention has been called to several points that are not adequately
532 dealt with (if at all) in the version 2.0 guidelines but that ought
533 to be included in the policy framework without waiting for the
534 production and release of a document based on a "best practices"
535 model. The term "BCP" above does not necessarily refer to an IETF
536 consensus document.
537
538 The intention in November 2005 was for the recommended major revision
539 to be put to the ICANN Board prior to its meeting in Morocco (in late
540 June 2006), but for the changes to be collated incrementally and
541 appear in interim version 2.n releases of the guidelines. The IAB's
542 understanding is that, while there has been some progress with this,
543
544
545
546
547 Klensin, et al. Informational [Page 10]
548 RFC 4690 IAB -- IDN Next Steps September 2006
549
550
551 other issues relating to IDNs subsequently diverted much of the
552 energy that was intended to be devoted to the more extensive
553 treatment of the guidelines.
554
555 2. General Problems and Issues
556
557 This section interweaves problems and issues of several types. Each
558 subsection outlines something that is perceived to be a problem or
559 issue "with IDNs", therefore needing correction. Some of these
560 issues can be at least partially resolved by making changes to
561 elements of the IDNA protocol or tables. Others will exist as long
562 as people have expectations of IDNs that are inconsistent with the
563 basic DNS architecture. It is important to identify this entire
564 range of problems because users, registrants, and policy makers often
565 do not understand the protocol and other technical issues but only
566 the difference between what they believe happens or should happen and
567 what actually happens. As long as those differences exist, there
568 will be demands for functionality or policy changes for IDNs. Of
569 course, some of these demands will be less realistic than others, but
570 even the realistic ones should be understood in the same context as
571 the others.
572
573 Most of the issues that have been raised, and that are discussed in
574 this document, exist whether IDNA remains tied to Unicode 3.2 or
575 whether migration to new Unicode versions is contemplated. A
576 migration path is necessary to accommodate newly-coded scripts and to
577 permit the maximum number of languages and scripts to be represented
578 in domain names. However, the migration issues are largely separate
579 from those involving a single Unicode version or Version 3.2 in
580 particular, so they have been separated into this section and
581 Section 3.
582
583 2.1. User Conceptions, Local Character Sets, and Input issues
584
585 The labels of the DNS are just strings of characters that are not
586 inherently tied to a particular language. As mentioned briefly in
587 the Introduction, DNS labels that could not lexically be words in any
588 language are possible and indeed common. There appears to be no
589 reason to impose protocol restrictions on IDNs that would restrict
590 them more than all-ASCII hostname labels have been restricted. For
591 that reason, even describing DNS labels or strings of them as "names"
592 is something of a misnomer, one that has probably added to user
593 confusion about what to expect.
594
595 Ordinarily, people use "words" when they think of things and wish
596 others to think of them too, for example, "orange", "tree",
597 "restaurant" or "Acme Inc". Words are normally in a specific
598 language, such as English or Swedish. The character-string labels
599
600
601
602 Klensin, et al. Informational [Page 11]
603 RFC 4690 IAB -- IDN Next Steps September 2006
604
605
606 supported by the DNS are, as suggested above, not inherently "words".
607 While it is useful, especially for mnemonic value or to identify
608 objects, for actual words to be used as DNS labels, other constraints
609 on the DNS make it impossible to guarantee that it will be possible
610 to represent every word in every language as a DNS label,
611 internationalized or not.
612
613 When writing or typing the label (or word), a script must be selected
614 and a charset must be picked for use with that script. The choice of
615 charset is typically not under the control of the user on a per-word
616 or per-document basis, but may depend on local input devices,
617 keyboard or terminal drivers, or other decisions made by operating
618 system or even hardware designers and implementers.
619
620 If that charset, or the local charset being used by the relevant
621 operating system or application software, is not Unicode, a further
622 conversion must be performed to produce Unicode. How often this is
623 an issue depends on estimates of how widely Unicode is deployed as
624 the native character set for hardware, operating systems, and
625 applications. Those estimates differ widely, but it should be noted
626 that, among other difficulties:
627
628 o ISO 8859 versions [ISO.8859.2003] and even national variations of
629 ISO 646 [ISO.646.1991], are still widely used in parts of Europe;
630
631 o code-table switching methods, typically based on the techniques of
632 ISO 2022 [ISO.2022.1986] are still in general use in many parts of
633 the world, especially in Japan with Shift-JIS and its variations;
634 and
635
636 o computing, systems, and communications in China tend to use one or
637 more of the national "GB" standards rather than native Unicode.
638
639 Additionally, not all charsets define their characters in the same
640 way and not all preexisting coding systems were incorporated into
641 Unicode without changes. Sometimes local distinctions were made that
642 Unicode does not make or vice versa. Consequently, conversion from
643 other systems to Unicode may potentially lose information.
644
645 The Unicode string that results from this processing -- processing
646 that is trivial in a Unicode-native system but that may be
647 significant in others -- is then used as input to IDNA.
648
649
650
651
652
653
654
655
656
657 Klensin, et al. Informational [Page 12]
658 RFC 4690 IAB -- IDN Next Steps September 2006
659
660
661 2.2. Examples of Issues
662
663 While much of the discussion below is stated in terms of Unicode
664 codings and associated rules, the IAB believes that some of the
665 issues are actually not about the Unicode character set per se, but
666 about how distributed matching systems operate in reality, and about
667 what implications the distributed delayed search for stored data that
668 characterizes the DNS has on the mapping algorithms.
669
670 2.2.1. Language-Specific Character Matching
671
672 There are similar words that can be expressed in multiple languages.
673 Consider, for example, the name Torbjorn in Norwegian and Swedish.
674 In Norwegian it is spelled with the character U+00F8 (LATIN SMALL
675 LETTER O WITH STROKE) in the second syllable, while in Swedish it is
676 spelled with U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS). Those
677 characters are not treated as equivalent according to the Unicode
678 Standard and its Annexes while most people speaking Swedish, Danish,
679 or Norwegian probably think they are equivalent.
680
681 It is neither possible nor desirable to make these characters
682 equivalent on a global basis. To do so would, for this example,
683 rationalize the situation in Sweden while causing considerable
684 confusion in Germany because the U+00F8 character is never used in
685 the German language. But the "variant" model introduced in [RFC3743]
686 and [RFC4290] can be used by a registry to prevent the worst
687 consequence of the possible confusion, by ensuring either that both
688 names are registered to the same party in a given domain or that one
689 of them is completely prohibited.
690
691 2.2.2. Multiple Scripts
692
693 There are languages in the world that can be expressed using multiple
694 scripts. For example, some Eastern European and Central Asian
695 languages can be expressed in either Cyrillic or Latin (see
696 Section 1.5.2) characters, or some African and Southeast Asian
697 languages can be expressed in either Arabic or Latin characters. A
698 few languages can even be written in three different scripts. In
699 other cases, the language is typically written in a combination of
700 scripts (e.g., Kanji, Kana, and Romaji for Japanese; Hangul and Hanji
701 for Korean). Because of this, the same word, in the same language,
702 can be expressed in different ways. For some languages, only a
703 single script is normally used to write a single word; for others,
704 mixed scripts are required; and, for still others, special
705 circumstances may dictate mixing scripts in labels although that is
706 not normally done for "words". For IDN purposes, these variations
707 make the definition of "script" extremely sensitive, especially since
708 ICANN is now recommending that it be used as the primary basis for
709
710
711
712 Klensin, et al. Informational [Page 13]
713 RFC 4690 IAB -- IDN Next Steps September 2006
714
715
716 registry policies. However essential it may be to prohibit mixed-
717 script labels, additional policy nuance is required for "languages
718 with established orthographies and conventions that require the
719 commingled use of multiple scripts".
720
721 2.2.3. Normalization and Character Mappings
722
723 Unicode contains several different models for representing
724 characters. The Chinese (Han)-derived characters of the "CJK"
725 (Chinese, Japanese, and Korean) languages are "unified", i.e.,
726 characters with common derivation and similar appearances are
727 assigned to the same code point. European characters derived from a
728 Greek-Latin base are separated into separate code blocks for Latin,
729 Greek, and Cyrillic even when individual characters are identical in
730 both form and semantics. Separate code points based on font
731 differences alone are generally prohibited, but a large number of
732 characters for "mathematical" use have been assigned separate code
733 points even though they differ from base ASCII characters only by
734 font attributes such as "script", "bold", or "italic". Some
735 characters that often appear together are treated as typographical
736 digraphs with specific code points assigned to the combination,
737 others require that the two-character sequences be used, and still
738 others are available in both forms. Some Roman-derived letters that
739 were developed as decorated variations on the basic Latin letter
740 collection (e.g., by addition of diacritical marks) are assigned code
741 points as individual characters, others must be built up as two (or
742 more) character sequences using "combining characters".
743
744 Many of these differences result from the desire to maintain backward
745 compatibility while the standard evolved historically, and are hence
746 understandable. However, the DNS requires precise knowledge of which
747 codes and code sequences represent the same character and which ones
748 do not. Limiting the potential difficulties with confusable
749 characters (see Section 2.2.6) requires even more knowledge of which
750 characters might look alike in some fonts but not in others. These
751 variations make it difficult or impossible to apply a single set of
752 rules to all of Unicode and, in doing so, satisfy everyone and their
753 perceived needs. Instead, more or less complex mapping tables,
754 defined on a character-by-character basis, are required to
755 "normalize" different representations of the same character to a
756 single form so that matching is possible.
757
758 Unless normalization rules, such as those that underlie Nameprep, are
759 applied, characters that are essentially identical will not match in
760 the DNS, creating many opportunities for problems. The most common
761 of these problems is that, due to the processing applied (and
762 discussed above) before a word is represented as a Unicode string, a
763 single word can end up being expressed as several different Unicode
764
765
766
767 Klensin, et al. Informational [Page 14]
768 RFC 4690 IAB -- IDN Next Steps September 2006
769
770
771 strings. Even if normalization rules are applied, some strings that
772 are considered identical by users will not compare equal. That
773 problem is discussed in more detail elsewhere in this document,
774 particularly in Section 3.2.1.
775
776 IDNA attempts to compensate for these problems by using a
777 normalization algorithm defined by the Unicode Consortium. This
778 algorithm can change a sequence of one or more Unicode characters to
779 another set of characters. One example is that the base character
780 U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING
781 DIAERESIS) is changed to the single Unicode character U+00E4 (LATIN
782 SMALL LETTER A WITH DIAERESIS).
783
784 This Unicode normalization process accounts only for simple character
785 equivalences, not equivalences that are language or script dependent.
786 For example, as mentioned above, the characters U+00F8 (LATIN SMALL
787 LETTER O WITH STROKE) and U+00F6 (LATIN SMALL LETTER O WITH
788 DIAERESIS) are considered to match in Swedish (and some other
789 languages), but not for all languages that use either of the
790 characters. Having these characters be treated as equivalent in some
791 contexts and not in others requires decisions and mechanisms that, in
792 turn, depend much more on context than either IDNA or the Unicode
793 character-based normalization tables can provide.
794
795 Additional complications occur if the sequences are more complicated
796 or if an attacker is making a deliberate effort to confuse the
797 normalization process. For example, if the sequence U+0069 U+0307
798 (LATIN SMALL LETTER I followed by COMBINING DOT ABOVE) appears, the
799 Unicode Normalization Method known as NFKC maps it into U+00EF (LATIN
800 SMALL LETTER I WITH DIAERESIS), which is what one would predict. But
801 consider U+0131 U+0308 (LATIN SMALL LETTER DOTLESS I and COMBINING
802 DIAERESIS): is that the same character? Is U+0131 U+0307 U+0307
803 (dotless i and two combining dot-above characters) equivalent to
804 U+00EF or U+0069, or neither? NFKC does not appear to tell us, nor
805 does the definition of U+0307 appear to tell us what happens when it
806 is combined with other "symbol above" arrangements (unlike some of
807 the "accent above" combining characters, which more or less specify
808 kerning). Similar issues arise when U+00EF is combined with various
809 dot-above combining characters. Each of these questions provides
810 some opportunities for spoofing if different display implementations
811 interpret the rules in different ways.
812
813 If we leave Latin scripts and examine those based on Chinese
814 characters, we see there is also an absence of specific, lexigraphic,
815 rules for transformations between Traditional and Simplified Chinese.
816 Even if there were such rules, unification of Japanese and Korean
817
818
819
820
821
822 Klensin, et al. Informational [Page 15]
823 RFC 4690 IAB -- IDN Next Steps September 2006
824
825
826 characters with Chinese ones would make it impossible to normalize
827 Traditional Chinese into Simplified Chinese ones without causing
828 problems in Japanese and Korean use of the same characters.
829
830 More generally, while some mappings, such as those between
831 precomposed Latin script characters and the equivalent multiple code
832 point composed character sequences, depend only on the characters
833 themselves, in many or most cases, such as the case with Swedish
834 above, the mapping is language or culturally dependent. There have
835 been discussions as to whether different canonicalization rules (in
836 addition to or instead of Unicode normalization) should be, or could
837 be, applied differently to different languages or scripts. The fact
838 that most scripts included in Unicode have been initially
839 incorporated by copying an existing standard more or less intact has
840 impact on the optimization of these algorithms and on forward
841 compatibility. Even if the language is known and language-specific
842 rules can be defined, dependencies on the language do not disappear.
843 Canonicalization operations are not possible unless they either
844 depend only on short sequences of text or have significant context
845 available that is not obvious from the text itself. DNS lookups and
846 many other operations do not have a way to capture and utilize the
847 language or other information that would be needed to provide that
848 context.
849
850 These variations in languages and in user perceptions of characters
851 make it difficult or impossible to provide uniform algorithms for
852 matching Unicode strings in a way that no end users are ever
853 surprised by the result. For closely-related scripts or characters,
854 surprises may even be frequent. However, because uniform algorithms
855 are required for mappings that are applied when names are looked up
856 in the DNS, the rules that are chosen will always represent an
857 approximation that will be more or less successful in minimizing
858 those user surprises. The current Nameprep and Stringprep algorithms
859 use mapping tables to "normalize" different representations of the
860 same text to a single form so that matching is possible.
861
862 More details on the creation of the normalization algorithms can be
863 found in the Unicode Specification and the associated Technical
864 Reports [UTR] and Annexes. Technical Report #36 [UTR36] and [UTR39]
865 are specifically related to the IDN discussion.
866
867 2.2.4. URLs in Printed Form
868
869 URLs and other identifiers appear, not only in electronic forms from
870 which they can (at least in principle) be accurately copied and
871 "pasted" but in printed forms from which the user must transcribe
872 them into the computer system. This is often known as the "side-of-
873 the-bus problem" because a particularly problematic version of it
874
875
876
877 Klensin, et al. Informational [Page 16]
878 RFC 4690 IAB -- IDN Next Steps September 2006
879
880
881 requires that the user be able to observe and accurately remember a
882 URL that is quickly glimpsed in a transient form -- a billboard seen
883 while driving, a sign on the side of a passing vehicle, a television
884 advertisement that is not frequently repeated or on-screen for a long
885 time, and so on.
886
887 The difficulty, in short, is that two Unicode strings that are
888 actually different might look exactly the same, especially when there
889 is no time to study them. This is because, for example, some glyphs
890 in Cyrillic, Greek, and Latin do look the same, but have been
891 assigned different code points in Unicode. Worse, one needs to be
892 reasonably familiar with a script and how it is used to understand
893 how much characters can reasonably vary as the result of artistic
894 fonts and typography. For example, there are a few fonts for Latin
895 characters that are sufficiently highly ornamented that an observer
896 might easily confuse some of the characters with characters in Thai
897 script. Uppercase ITC Blackadder (a registered trademark of
898 International Typeface Corporation) and Curlz MT are two fairly
899 obvious examples; these fonts use loops at the end of serifs,
900 creating a resemblance to Thai (in some fonts) for some characters.
901
902 2.2.5. Bidirectional Text
903
904 Some scripts (and because of that some words in some languages) are
905 written not left to right, but right to left. And, to complicate
906 things, one might have something written in Arabic script right to
907 left that includes some characters that are read from left to right,
908 such as European-style digits. This implies that some texts might
909 have a mixed left-to-right AND right-to-left order (even though in
910 most implementations, and in IDNA, all texts have a major direction,
911 with the other as an exception).
912
913 IDNA permits the inclusion of European digits in a label that is
914 otherwise a sequence of right-to-left characters, but prohibits most
915 other mixed-directional (or bidirectional) strings. This prohibition
916 can cause other problems such as the rejection of some otherwise
917 linguistically and culturally sensible strings. As Unicode and
918 conventions for handling so-called bidirectional ("BIDI") strings
919 evolve, the prohibition in IDNA should be reviewed and reevaluated.
920
921 2.2.6. Confusable Character Issues
922
923 Similar-looking characters in identifiers can cause actual problems
924 on the Internet since they can result, deliberately or accidentally,
925 in people being directed to the wrong host or mailbox by believing
926 that they are typing, or clicking on, intended characters that are
927 different from those that actually appear in the domain name or
928 reference. See Section 4.1.3 for further discussion of this issue.
929
930
931
932 Klensin, et al. Informational [Page 17]
933 RFC 4690 IAB -- IDN Next Steps September 2006
934
935
936 IDNs complicate these issues, not only by providing many additional
937 characters that look sufficiently alike to be potentially confused,
938 but also by raising new policy questions. For example, if a language
939 can be written in two different scripts, is a label constructed from
940 a word written in one script equivalent to a label constructed from
941 the same word written in the other script? Is the answer the same
942 for words in two different languages that translate into each other?
943
944 It is now generally understood that, in addition to the collision
945 problems of possibly equivalent words and hence labels, it is
946 possible to utilize characters that look alike -- "confusable"
947 characters -- to spoof names in order to mislead or defraud users.
948 That issue, driven by particular attacks such as those known as
949 "phishing", has introduced stronger requirements for registry efforts
950 to prevent problems than were previously generally recognized as
951 important.
952
953 One commonly-proposed approach is to have a registry establish
954 restrictions on the characters, and combinations of characters, it
955 will permit to be included in a string to be registered as a label.
956 Taking the Swedish top-level domain, .SE, as an example, a rule might
957 be adopted that the registry "only accepts registrations in Swedish,
958 using Latin script, and because of this, Unicode characters Latin-a,
959 -b, -c,...". But, because there is not a 1:1 mapping between country
960 and language, even a Country Code Top Level Domain (ccTLD) like .SE
961 might have to accept registrations in other languages. For example,
962 there may be a requirement for Finnish (the second most-used language
963 in Sweden). What rules and code points are then defined for Finnish?
964 Does it have special mappings that collide with those that are
965 defined for Swedish? And what does one do in countries that use more
966 than one script? (Finnish and Swedish use the same script.) In all
967 cases, the dispute will ultimately be about whether two strings are
968 the same (or confusingly similar) or not. That, in turn, will
969 generate a discussion of how one defines "what is the same" and "what
970 is similar enough to be a problem".
971
972 Another example arose recently that further illustrates the problem.
973 If one were to use Cyrillic characters to represent the country code
974 for Russia in a localized equivalent to the ccTLD label, the
975 characters themselves would be indistinguishable from the Latin
976 characters "P" and "Y" (in either lower- or uppercase) in most fonts.
977 We presume this might cause some consternation in Paraguay.
978
979 These difficulties can never be completely eliminated by algorithmic
980 means. Some of the problem can be addressed by appropriate tuning of
981 the protocols and their tables, other parts by registry actions to
982 reduce confusion and conflicts, and still other parts can be
983
984
985
986
987 Klensin, et al. Informational [Page 18]
988 RFC 4690 IAB -- IDN Next Steps September 2006
989
990
991 addressed by careful design of user interfaces in application
992 programs. But, ultimately, some responsibility to avoid being
993 tricked or harmfully confused will rest with the user.
994
995 Another registry technique that has been extensively explored
996 involves looking at confusable characters and confusion between
997 complete labels, restricting the labels that can be registered based
998 on relationships to what is registered already. Registries that
999 adopt this approach might establish special mapping rules such as:
1000
1001 1. If you register something with code point A, domain names with B
1002 instead of A will be blocked from registration by others (where B
1003 is a character at a separate code point that has a confusingly
1004 similar appearance to A).
1005
1006 2. If you register something with code point A, you also get domain
1007 name with B instead of A.
1008
1009 These approaches are discussed in more detail for "CJK" characters in
1010 RFC 3743 [RFC3743] and more generally in RFC 4290 [RFC4290].
1011
1012 2.2.7. The IESG Statement and IDNA issues
1013
1014 The issues above, at least as they were understood at the time,
1015 provided the background for the IESG statement included in
1016 Section 1.6.1 (which, in turn, was part of the basis for the initial
1017 ICANN Guidelines) that a registry should have a policy about the
1018 scripts, languages, code points and text directions for which
1019 registrations will be accepted. While "accept all" might be an
1020 acceptable policy, it implies there is also a dispute resolution
1021 process that takes the problems listed above into account. This
1022 process must be designed for dealing with all types of potential
1023 disputes. For example, issues might arise between registrant and
1024 registry over a decision by the registry on collisions with already
1025 registered domain names and between registrant and trademark holder
1026 (that a domain name infringes on a trademark). In both cases, the
1027 parties disagreeing have different views on whether two strings are
1028 "equivalent" or not. They may believe that a string that is not
1029 allowed to be registered is actually different from one that is
1030 already registered. Or they might believe that two strings are the
1031 same, even though the rules adopted by the registry to prevent
1032 confusion define them as two different domain names.
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042 Klensin, et al. Informational [Page 19]
1043 RFC 4690 IAB -- IDN Next Steps September 2006
1044
1045
1046 3. Migrating to New Versions of Unicode
1047
1048 3.1. Versions of Unicode
1049
1050 While opinions differ about how important the issues are in practice,
1051 the use of Unicode and its supporting tables for IDNA appears to be
1052 far more sensitive to subtle changes than it is in typical Unicode
1053 applications. This may be, at least in part, because many other
1054 applications are internally sensitive only to the appearance of
1055 characters and not to their representation. Or those applications
1056 may be able to take effective advantage of script, language, or
1057 character class identification. The working group that developed
1058 IDNA concluded that attempting to encode any ancillary character
1059 information into the DNS label would be impractical and unwise, and
1060 the IAB, based in part on the comments in the ad hoc committee, saw
1061 no reason to review that decision.
1062
1063 The Unicode Consortium has sometimes used the likelihood of a
1064 combination of characters actually appearing in a natural language as
1065 a criterion for the safety of a possible change. However, as
1066 discussed above, DNS names are often fabrications -- abbreviations,
1067 strings deliberately formed to be unusual, members of a series
1068 sequenced by numbers or other characters, and so on. Consequently, a
1069 criterion that considers a change to be safe if it would not be
1070 visible in properly-constructed running text is not helpful for DNS
1071 purposes: a change that would be safe under that criterion could
1072 still be quite problematic for the DNS.
1073
1074 This sensitivity to changes has made it quite difficult to migrate
1075 IDNA from one version of Unicode to the next if any changes are made
1076 that are not strictly additive. A change in a code point assignment
1077 or definition may be extremely disruptive if a DNS label has been
1078 defined using the earlier form and any of its previous components has
1079 been moved from one table position or normalization rule to another.
1080 Unicode normalization tables, tables of scripts or languages and
1081 characters that belong to them, and even tables of confusable
1082 characters as an adjunct to security recommendations may be very
1083 helpful in designing registry restrictions on registrations and
1084 applications provisions for avoiding or identifying suspicious names.
1085 Ironically, they also extend the sensitivity of IDNA and its
1086 implementations to all forms of change between one version of Unicode
1087 and the next. Consequently, they make Unicode version migration more
1088 difficult.
1089
1090 An example of the type of change that appears to be just a small
1091 correction from one perspective but may be problematic from another
1092 was the correction to the normalization definition in 2004
1093 [Unicode-PR29]. Community input suggested that the change would
1094
1095
1096
1097 Klensin, et al. Informational [Page 20]
1098 RFC 4690 IAB -- IDN Next Steps September 2006
1099
1100
1101 cause problems for Stringprep, but the Unicode Technical Committee
1102 decided, on balance, that the change was worthwhile. Because of
1103 difficulties with consistency, some deployed implementations have
1104 decided to adopt the change and others have not, leading to subtle
1105 incompatibilities.
1106
1107 This situation leads to a dilemma. On the one hand, it is completely
1108 unacceptable to freeze IDNA at a Unicode version level that excludes
1109 more recently-defined characters and scripts that are important to
1110 those who use them. On the other hand, it is equally unacceptable to
1111 migrate from one version of Unicode to the next if such migration
1112 might invalidate an existing registered DNS name or some of its
1113 registered properties or might make the string or representation of
1114 that name ambiguous. If IDNA is to be modified to accommodate new
1115 versions of Unicode, the IETF will need to work with the Unicode
1116 Consortium and other bodies to find an appropriate balance in this
1117 area, but progress will be possible only if all relevant parties are
1118 able to fairly consider and discuss possible decisions that may be
1119 very difficult and unpalatable.
1120
1121 It would also prove useful if, during the course of that dialog, the
1122 need for Unicode Consortium concern with security issues in
1123 applications of the Unicode character set could be clarified. It
1124 would be unfortunate from almost every perspective considered here,
1125 if such matters slowed the inclusion of as yet unencoded scripts.
1126
1127 3.2. Version Changes and Normalization Issues
1128
1129 3.2.1. Unnormalized Combining Sequences
1130
1131 One of the advantages of the Unicode model of combining characters,
1132 as with previous systems that use character overstriking to
1133 accomplish similar purposes, is that it is possible to use sequences
1134 of code points to generate characters that are not explicitly
1135 provided for in the character set. However, unless sequences that
1136 are not explicitly provided for are prohibited by some mechanism
1137 (such as the normalization tables), such combining sequences can
1138 permit two related dangers.
1139
1140 o The first is another risk of character confusion, especially if
1141 the relationship of the combining character with characters it
1142 combines with are not precisely defined or unexpected combinations
1143 of combining characters are used. That issue is discussed in more
1144 detail, with an example, in Section 2.2.3.
1145
1146 o These same issues also inherently impact the stability of the
1147 normalization tables. Suppose that, somewhere in the world, there
1148 is a character that looks like a Roman-derived lowercase "i", but
1149
1150
1151
1152 Klensin, et al. Informational [Page 21]
1153 RFC 4690 IAB -- IDN Next Steps September 2006
1154
1155
1156 with three (not one or two) dots above it. And suppose that the
1157 users of that character agree to represent it by combining a
1158 traditional "i" (U+0069) with a combining diaeresis (U+0308). So
1159 far, no problem. But, later, a broader need for this character is
1160 discovered and it is coded into Unicode either as a single
1161 precomposed character or, more likely under existing rules, by
1162 introducing a three-dot-above combining character. In either
1163 case, that version of Unicode should include a rule in NFKC that
1164 maps the "i"-plus-diaeresis sequence into the new, approved, one.
1165 If one does not do so, then there is arguably a normalization that
1166 should occur that does not. If one does so, then strings that
1167 were valid and normalized (although unanticipated) under the
1168 previous versions of Unicode become unnormalized under the new
1169 version. That, in turn, would impact IDNA comparisons because,
1170 effectively, it would introduce a change in the matching rules.
1171
1172 It would be useful to consider rules that would avoid or minimize
1173 these problems with the understanding that, for reasons given
1174 elsewhere, simply minimizing it may not be good enough for IDNA. One
1175 partial solution might be to ban any combination of a base character
1176 and a combining character that does not appear in a hypothetical
1177 "anticipated combinations" table from being used in a domain name
1178 label. The next subsection discusses a more radical, if impractical,
1179 view of the problem and its solutions.
1180
1181 3.2.2. Combining Characters and Character Components
1182
1183 For several reasons, including those discussed above, one thing that
1184 increases IDNA complexity and the need for normalization is that
1185 combining characters are permitted. Without them, complexity might
1186 be reduced enough to permit easier transitions to new versions. The
1187 community should consider the impact of entirely prohibiting
1188 combining characters from IDNs. While it is almost certainly
1189 unfeasible to introduce this change into Unicode as it is now defined
1190 and doing so would be extremely disruptive even if it were feasible,
1191 the thought experiment can be helpful in understanding both the
1192 issues and the implications of the paths not taken. For example, one
1193 consequence of this, of course, is that each new language or script,
1194 and several existing ones, would require that all of its characters
1195 have Unicode assignments to specific, precomposed, code points.
1196
1197 Note that this is not currently permitted within Unicode for Latin
1198 scripts. For non-Latin scripts, some such code points have been
1199 defined. The decisions that govern the assignment of such code
1200 points are managed entirely within the Unicode Consortium. Were the
1201 IETF to choose to reduce IDNA complexity by excluding combining
1202 characters, no doubt there would be additional input to the Unicode
1203 Consortium from users and proponents of scripts that precomposed
1204
1205
1206
1207 Klensin, et al. Informational [Page 22]
1208 RFC 4690 IAB -- IDN Next Steps September 2006
1209
1210
1211 characters be required. The IAB and the IETF should examine whether
1212 it is appropriate to press the Unicode Consortium to revise these
1213 policies or otherwise to recommend actions that would reduce the need
1214 for normalization and the related complexities. However, we have
1215 been told that the Technical Committee does not believe it is
1216 reasonable or feasible to add all possible precomposed characters to
1217 Unicode. If Unicode cannot be modified to contain the precomposed
1218 characters necessary to support existing languages and scripts, much
1219 less new ones, this option for IDN restrictions will not be feasible.
1220
1221 3.2.3. When does normalization occur?
1222
1223 In many Unicode applications, the preferred solution is to pick a
1224 style of normalization and require that all text that is stored or
1225 transmitted be normalized to that form. (This is the approach taken
1226 in ongoing work in the IETF on a standard Unicode text form
1227 [net-utf8]). IDNA does not impose this requirement. Text is
1228 normalized and case-reduced at registration time, and only the
1229 normalized version is placed in the DNS. However, there is no
1230 requirement that applications show only the native (and lower-case
1231 where appropriate) characters associated with the normalized form in
1232 discussions or references such as URLs. If conventions used for
1233 all-ASCII DNS labels are to be extended to internationalized forms,
1234 such a requirement would be unreasonable, since it would prohibit the
1235 use of mixed-case references for clarity or market identification.
1236 It might even be culturally inappropriate. However, without that
1237 restriction, the comparison that will ultimately be made in the DNS
1238 will be between strings normalized at different times and under
1239 different versions of Unicode. The assertion that a string in
1240 normalized form under one version of Unicode will still be in
1241 normalized form under all future versions is not sufficient.
1242 Normalization at different times also requires that a given source
1243 string always normalizes to the same target string, regardless of the
1244 version under which it is normalized. That criterion is much more
1245 difficult to fulfill. The discussion above suggests that it may even
1246 be impossible.
1247
1248 Ignoring these issues with combining characters entirely, as IDNA
1249 effectively does today, may leave us "stuck" at Unicode 3.2, leading
1250 either to incompatibility differences in applications that otherwise
1251 use a modern version of Unicode (while IDN remains at Unicode 3.2) or
1252 to painful transitions to new versions. If decisions are made
1253 quickly, it may still be possible to make a one-time version upgrade
1254 to Version 4.1 or Version 5 of Unicode. However, unless we can
1255 impose sufficient global restrictions to permit smooth transitions,
1256 upgrading to versions beyond that one are likely to be painful (e.g.,
1257 potentially requiring changing strings already in the DNS or even a
1258 new Punycode prefix) or impossible.
1259
1260
1261
1262 Klensin, et al. Informational [Page 23]
1263 RFC 4690 IAB -- IDN Next Steps September 2006
1264
1265
1266 4. Framework for Next Steps in IDN Development
1267
1268 4.1. Issues within the Scope of the IETF
1269
1270 4.1.1. Review of IDNA
1271
1272 The IETF should consider reviewing RFCs 3454, 3490, 3491, and/or
1273 3492, and update, replace, or supplement them to meet the criteria of
1274 this paragraph (one or more of them may prove impractical after
1275 further study). Any new versions or additional specifications should
1276 be adapted to the version of Unicode that is current when they are
1277 created. Ideally, they should specify a path for adapting to future
1278 versions of Unicode (some suggestions below may facilitate this).
1279 The IETF should also consider whether there are significant
1280 advantages to mapping some groups of characters, such as code points
1281 assigned to font variations, into others or whether clarity and
1282 comprehensibility for the user would be better served by simply
1283 prohibiting those characters. More generally, it appears that it
1284 would be worthwhile for the IETF to review whether the Unicode
1285 normalization rules now invoked by the Stringprep profile in Nameprep
1286 are optimal for the DNS or whether more restrictive rules, or an even
1287 more restrictive set of permitted character combinations, would
1288 provide better support for DNS internationalization.
1289
1290 The IAB has concluded that there is a consensus within the broader
1291 community that lists of code points should be specified by the use of
1292 an inclusion-based mechanism (i.e., identifying the characters that
1293 are permitted), rather than by excluding a small number of characters
1294 from the total Unicode set as Stringprep and Nameprep do today. That
1295 conclusion should be reviewed by the IETF community and action taken
1296 as appropriate.
1297
1298 We suggest that the individuals doing the review of the code points
1299 should work as a specialized design team. To the extent possible,
1300 that work should be done jointly by people with experience from the
1301 IETF and deep knowledge of the constraints of the DNS and application
1302 design, participants from the Unicode Consortium, and other people
1303 necessary to be able to reach a generally-accepted result. Because
1304 any work along these lines would be modifications and updates to
1305 standards-track documents, final review and approval of any proposals
1306 would necessarily follow normal IETF processes.
1307
1308 It is worth noting that sufficiently extreme changes to IDNA would
1309 require a new Punycode prefix, probably with long-term support for
1310 both the old prefix and the new one in both registration arrangements
1311 and applications. An alternative, which is almost certainly
1312 impractical, would be some sort of "flag day", i.e., a date on which
1313 the old rules are simultaneously abandoned by everyone and the new
1314
1315
1316
1317 Klensin, et al. Informational [Page 24]
1318 RFC 4690 IAB -- IDN Next Steps September 2006
1319
1320
1321 ones adopted. However, preliminary analysis indicates that few, if
1322 any, of the changes recommended for consideration elsewhere in this
1323 document would require this type of version change. For example,
1324 suppose additional restrictions, such as those implied above, are
1325 imposed on what can be registered. Those restrictions might require
1326 policy decisions about how labels are to be disposed of if they
1327 conformed to the earlier rules but not to the new ones. But they
1328 would not inherently require changes in the protocol or prefix.
1329
1330 4.1.2. Non-DNS and Above-DNS Internationalization Approaches
1331
1332 The IETF should once again examine the extent to which it is
1333 appropriate to try to solve internationalization problems via the DNS
1334 and what place the many varieties of so-called "keyword systems" or
1335 other Internet navigational techniques might have. Those techniques
1336 can be designed to impose fewer constraints, or at least different
1337 constraints, than IDNA and the DNS. As discussed elsewhere in this
1338 document, IDNA cannot support information about scripts, languages,
1339 or Unicode versions on lookup. As a consequence of the nature of DNS
1340 lookups, characters and labels either match or do not match; a near-
1341 match is simply not a possible concept in the DNS. By contrast,
1342 observation of near-matching is common in human communication and in
1343 matching operations performed by people, especially when they have a
1344 particular script or language context in mind. The DNS is further
1345 constrained by a fairly rigid internal aliasing system (via CNAME and
1346 DNAME resource records), while some applications of international
1347 naming may require more flexibility. Finally, the rigid hierarchy of
1348 the DNS --and the tendency in practice for it to become flat at
1349 levels nearest the root-- and the need for names to be unique are
1350 more suitable for some purposes than others and may not be a good
1351 match for some purposes for which people wish to use IDNs. Each of
1352 these constraints can be relaxed or changed by one or more systems
1353 that would provide alternatives to direct use of the DNS by users.
1354 Some of the issues involved are discussed further in Section 5.3 and
1355 various ideas have been discussed in detail in the IETF or IRTF.
1356 Many of those ideas have even been described in Internet Drafts or
1357 other documents. As experience with IDNs and with expectations for
1358 them accumulates, it will probably become appropriate for the IETF or
1359 IRTF to revisit the underlying questions and possibilities.
1360
1361 4.1.3. Security Issues, Certificates, etc.
1362
1363 Some characters look like others, often as the result of common
1364 origins. The problem with these "confusable" characters, often
1365 incorrectly called homographs, has always existed when characters are
1366 presented to humans who interpret what is displayed and then make
1367 decisions based on what is seen. This is not a problem that exists
1368 only when working with internationalized domain names, but they make
1369
1370
1371
1372 Klensin, et al. Informational [Page 25]
1373 RFC 4690 IAB -- IDN Next Steps September 2006
1374
1375
1376 the problem worse. The result of a survey that would explain what
1377 the problems are might be interesting. Many of these issues are
1378 mentioned in Unicode Technical Report #36 [UTR36].
1379
1380 In this and other issues associated with IDNs, precise use of
1381 terminology is important lest even more confusion result. The
1382 definition of the term 'homograph' that normally appears in
1383 dictionaries and linguistic texts states that homographs are
1384 different words that are spelled identically (for example, the
1385 adjective 'brief' meaning short, the noun 'brief' meaning a document,
1386 and the verb 'brief' meaning to inform). By definition, letters in
1387 two different alphabets are not the same, regardless of similarities
1388 in appearance. This means that sequences of letters from two
1389 different scripts that appear to be identical on a computer display
1390 cannot be homographs in the accepted sense, even if they are both
1391 words in the dictionary of some language. Assuming that there is a
1392 language written with Cyrillic script in which "cap" is a word,
1393 regardless of what it might mean, it is not a homograph of the
1394 Latin-script English word "cap".
1395
1396 When the security implications of visually confusable characters were
1397 brought to the forefront in 2005, the term homograph was used to
1398 designate any instance of graphic similarity, even when comparing
1399 individual characters. This usage is not only incorrect, but risks
1400 introducing even more confusion and hence should be avoided. The
1401 current preferred terminology is to describe these similar-looking
1402 characters as "confusable characters" or even "confusables".
1403
1404 Many people have suggested that confusable characters are a problem
1405 that must be addressed, at least in part, directly in the user
1406 interfaces of application software. While it should almost certainly
1407 be part of a complete solution, that approach creates it own set of
1408 difficulties. For example, a user switching between systems, or even
1409 between applications on the same system, may be surprised by
1410 different types of behavior and different levels of protection. In
1411 addition, it is unclear how a secure setup for the end user should be
1412 designed. Today, in the web browser, a padlock is a traditional way
1413 of describing some level of security for the end user. Is this
1414 binary signaling enough? Should there be any connection between a
1415 risk for a displayed string including confusable characters and the
1416 padlock or similar signaling to the user?
1417
1418 Many web browsers have adopted a convention, based on a "whitelist"
1419 or similar technique, of restricting the display of native characters
1420 to subdomains of top-level domains that are deemed to have safe
1421 practices for the registration of potentially confusable labels.
1422 IDNs in other domains are displayed as Punycode. These techniques
1423 may not be sufficiently sensitive to differences in policies among
1424
1425
1426
1427 Klensin, et al. Informational [Page 26]
1428 RFC 4690 IAB -- IDN Next Steps September 2006
1429
1430
1431 top-level domains and their subdomains and so, while they are clearly
1432 helpful, they may not be adequate. Are other methods of dealing with
1433 confusable characters possible? Would other methods of identifying
1434 and listing policies about avoiding confusing registrations be
1435 feasible and helpful?
1436
1437 It would be interesting to see a more coordinated effort in
1438 establishing guidelines for user interfaces. If nothing else, the
1439 current whitelists are browser specific and both can, and do, differ
1440 between implementations.
1441
1442 4.1.4. Protocol Changes and Policy Implications
1443
1444 Some potential protocol or table changes raise important policy
1445 issues about what to do with existing, registered, names. Should
1446 such changes be needed, their impact must be carefully evaluated in
1447 the IETF, ICANN, and possibly other forums. In particular, protocol
1448 or policy changes that would not permit existing names to be
1449 registered under the newer rules should be considered carefully,
1450 balancing their importance against possible disruption and the issues
1451 of invalidating older names against the importance of consistency as
1452 seen by the user.
1453
1454 4.1.5. Non-US-ASCII in Local Part of Email Addresses
1455
1456 Work is going on in the IETF related to the local part of email
1457 addresses. It should be noted that the local part of email addresses
1458 has much different syntax and constraints than a domain name label,
1459 so to directly apply IDNA on the local part is not possible.
1460
1461 4.1.6. Use of the Unicode Character Set in the IETF
1462
1463 Unicode and the closely-related ISO 10646 are the only coded
1464 character sets that aspire to include all of the world's characters.
1465 As such, they permit use of international characters without having
1466 to identify particular character coding standards or tables. The
1467 requirement for a single character set is particularly important for
1468 use with the DNS since there is no place to put character set
1469 identification. The decision to use Unicode as the base for IETF
1470 protocols going forward is discussed in [RFC2277]. The IAB does not
1471 see any reason to revisit the decision to use Unicode in IETF
1472 protocols.
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482 Klensin, et al. Informational [Page 27]
1483 RFC 4690 IAB -- IDN Next Steps September 2006
1484
1485
1486 4.2. Issues That Fall within the Purview of ICANN
1487
1488 4.2.1. Dispute Resolution
1489
1490 IDNs create new types of collisions between trademarks and domain
1491 names as well as collisions between domain names. These have impact
1492 on dispute resolution processes used by registries and otherwise. It
1493 is important that deployment of IDNs evolve in parallel with review
1494 and updating of ICANN or registry-specific dispute resolution
1495 processes.
1496
1497 4.2.2. Policy at Registries
1498
1499 The IAB recommends that registries use an inclusion-based model when
1500 choosing what characters to allow at the time of registration. This
1501 list of characters is in turn to be a subset of what is allowed
1502 according to the updated IDNA standard. The IAB further recommends
1503 that registries develop their inclusion-based models in parallel with
1504 dispute resolution process at the registry itself.
1505
1506 Most established policies for dealing with claimed or apparent
1507 confusion or conflicts of names are based on dispute resolution.
1508 Decisions about legitimate use or registration of one or more names
1509 are resolved at or after the time of registration on a case-by-case
1510 basis and using policies that are specific to the particular DNS zone
1511 or jurisdiction involved. These policies have generally not been
1512 extended below the level of the DNS that is directly controlled by
1513 the top-level registry.
1514
1515 Because of the number of conflicts that can be generated by the
1516 larger number of available and confusable characters in Unicode, we
1517 recommend that registration-restriction and dispute resolution
1518 policies be developed to constrain registration of IDNs and zone
1519 administrators at all levels of the DNS tree. Of course, many of
1520 these policies will be less formal than others and there is no
1521 requirement for complete global consistency, but the arguments for
1522 reduction of confusable characters and other issues in TLDs should
1523 apply to all zones below that specific TLD.
1524
1525 Consistency across all zones can obviously only be accomplished by
1526 changes to the protocols. Such changes should be considered by the
1527 IETF if particular restrictions are identified that are important and
1528 consistent enough to be applied globally.
1529
1530 Some potential protocol changes or changes to character-mapping
1531 tables might, if adopted, have profound registry policy implications.
1532 See Section 4.1.4.
1533
1534
1535
1536
1537 Klensin, et al. Informational [Page 28]
1538 RFC 4690 IAB -- IDN Next Steps September 2006
1539
1540
1541 4.2.3. IDNs at the Top Level of the DNS
1542
1543 The IAB has concluded that there is not one issue with IDNs at the
1544 top level of the DNS (IDN TLDs) but at least three very separate
1545 ones:
1546
1547 o If IDNs are to be entered in the root zone, decisions must first
1548 be made about how these TLDs are to be named and delegated. These
1549 decisions fall within the traditional IANA scope and are ICANN
1550 issues today.
1551
1552 o There has been discussion of permitting some or all existing TLDs
1553 to be referenced by multiple labels, with those labels presumably
1554 representing some understanding of the "name" of the TLD in
1555 different languages. If actual aliases of this type are desired
1556 for existing domains, the IETF may need to consider whether the
1557 use of DNAME records in the root is appropriate to meet that need,
1558 what constraints, if any, are needed, whether alternate
1559 approaches, such as those of [RFC4185], are appropriate or whether
1560 further alternatives should be investigated. But, to the extent
1561 to which aliases are considered desirable and feasible, decisions
1562 presumably must be made as to which, if any, root IDN labels
1563 should be associated with DNAME records and which ones should be
1564 handled by normal delegation records or other mechanisms. That
1565 decision is one of DNS root-level namespace policy and hence falls
1566 to ICANN although we would expect ICANN to pay careful attention
1567 to any technical, operational, or security recommendations that
1568 may be produced by other bodies.
1569
1570 o Finally, if IDN labels are to be placed in the root zone, there
1571 are issues associated with how they are to be encoded and
1572 deployed. This area may have implications for work that has been
1573 done, or should be done, in the IETF.
1574
1575 5. Specific Recommendations for Next Steps
1576
1577 Consistent with the framework described above, the IAB offers these
1578 recommendations as steps for further consideration in the identified
1579 groups.
1580
1581 5.1. Reduction of Permitted Character List
1582
1583 Generalize from the original "hostname" rules to non-ASCII
1584 characters, permitting as few characters as possible to do that job.
1585 This would involve a restrictive model for characters permitted in
1586 IDN labels, thus contrasting with the approach used to develop the
1587 original IDNA/Nameprep tables. That approach was to include all
1588 Unicode characters that there was not a clear reason to exclude.
1589
1590
1591
1592 Klensin, et al. Informational [Page 29]
1593 RFC 4690 IAB -- IDN Next Steps September 2006
1594
1595
1596 The specific recommendation here is to specify such internationalized
1597 hostnames. Such an activity would fall to the IETF, although the
1598 task of developing the appropriate list of permitted characters will
1599 require effort both in the IETF and elsewhere. The effort should be
1600 as linguistically and culturally sensitive as possible, but smooth
1601 and effective operation of the DNS, including minimizing of
1602 complexity, should be primary goals. The following should be
1603 considered as possible mechanisms for achieving an appropriate
1604 minimum number of characters.
1605
1606 5.1.1. Elimination of All Non-Language Characters
1607
1608 Unicode characters that are not needed to write words or numbers in
1609 any of the world's languages should be eliminated from the list of
1610 characters that are appropriate in DNS labels. In addition to such
1611 characters as those used for box-drawing and sentence punctuation,
1612 this should exclude punctuation for word structure and other
1613 delimiters. While DNS labels may conveniently be used to express
1614 words in many circumstances, the goal is not to express words (or
1615 sentences or phrases), but to permit the creation of unambiguous
1616 labels with good mnemonic value.
1617
1618 5.1.2. Elimination of Word-Separation Punctuation
1619
1620 The inclusion of the hyphen in the original hostname rules is a
1621 historical artifact from an older, flat, namespace. The community
1622 should consider whether it is appropriate to treat it as a simple
1623 legacy property of ASCII names and not attempt to generalize it to
1624 other scripts. We might, for example, not permit claimed equivalents
1625 to the hyphen from other scripts to be used in IDNs. We might even
1626 consider banning use of the hyphen itself in non-ASCII strings or,
1627 less restrictively, strings that contained non-Latin characters.
1628
1629 5.2. Updating to New Versions of Unicode
1630
1631 As new scripts, to support new languages, continue to be added to
1632 Unicode, it is important that IDNA track updates. If it does not do
1633 so, but remains "stuck" at 3.2 or some single later version, it will
1634 not be possible to include labels in the DNS that are derived from
1635 words in languages that require characters that are available only in
1636 later versions. Making those upgrades is difficult, and will
1637 continue to be difficult, as long as new versions require, not just
1638 addition of characters, but changes to canonicalization conventions,
1639 normalization tables, or matching procedures (see Section 3.1).
1640 Anything that can be done to lower complexity and simplify forward
1641 transitions should be seriously considered.
1642
1643
1644
1645
1646
1647 Klensin, et al. Informational [Page 30]
1648 RFC 4690 IAB -- IDN Next Steps September 2006
1649
1650
1651 5.3. Role and Uses of the DNS
1652
1653 We wish to remind the community that there are boundaries to the
1654 appropriate uses of the DNS. It was designed and implemented to
1655 serve some specific purposes. There are additional things that it
1656 does well, other things that it does badly, and still other things it
1657 cannot do at all. No amount of protocol work on IDNs will solve
1658 problems with alternate spellings, near-matches, searching for
1659 appropriate names, and so on. Registration restrictions and
1660 carefully-designed user interfaces can be used to reduce the risk and
1661 pain of attempts to do some of these things gone wrong, as well as
1662 reducing the risks of various sort of deliberate bad behavior, but,
1663 beyond a certain point, use of the DNS simply because it is available
1664 becomes a bad tradeoff. The tradeoff may be particularly unfortunate
1665 when the use of IDNs does not actually solve the proposed problem.
1666 For example, internationalization of DNS names does not eliminate the
1667 ASCII protocol identifiers and structure of URIs [RFC3986] and even
1668 IRIs [RFC3987]. Hence, DNS internationalization itself, at any or
1669 all levels of the DNS tree, is not a sufficient response to the
1670 desire of populations to use the Internet entirely in their own
1671 languages and the characters associated with those languages.
1672
1673 These issues are discussed at more length, and alternatives
1674 presented, in [RFC2825], [RFC3467], [INDNS], and [DNS-Choices].
1675
1676 5.4. Databases of Registered Names
1677
1678 In addition to their presence in the DNS, IDNs introduce issues in
1679 other contexts in which domain names are used. In particular, the
1680 design and content of databases that bind registered names to
1681 information about the registrant (commonly described as "whois"
1682 databases) will require review and updating. For example, the whois
1683 protocol itself [RFC3912] has no standard capability for handling
1684 non-ASCII text: one cannot search consistently for, or report, either
1685 a DNS name or contact information that is not in ASCII characters.
1686 This may provide some additional impetus for a switch to IRIS
1687 [RFC3981] [RFC3982] but also raises a number of other questions about
1688 what information, and in what languages and scripts, should be
1689 included or permitted in such databases.
1690
1691 6. Security Considerations
1692
1693 This document is simply a discussion of IDNs and IDNA issues; it
1694 raises no new security concerns. However, if some of its
1695 recommendations to reduce IDNA complexity, the number of available
1696 characters, and various approaches to constraining the use of
1697 confusable characters, are followed and prove successful, the risks
1698 of name spoofing and other problems may be reduced.
1699
1700
1701
1702 Klensin, et al. Informational [Page 31]
1703 RFC 4690 IAB -- IDN Next Steps September 2006
1704
1705
1706 7. Acknowledgements
1707
1708 The contributions to this report from members of the IAB-IDN ad hoc
1709 committee are gratefully acknowledged. Of course, not all of the
1710 members of that group endorse every comment and suggestion of this
1711 report. In particular, this report does not claim to reflect the
1712 views of the Unicode Consortium as a whole or those of particular
1713 participants in the work of that Consortium.
1714
1715 The members of the ad hoc committee were: Rob Austein, Leslie Daigle,
1716 Tina Dam, Mark Davis, Patrik Faltstrom, Scott Hollenbeck, Cary Karp,
1717 John Klensin, Gervase Markham, David Meyer, Thomas Narten, Michael
1718 Suignard, Sam Weiler, Bert Wijnen, Kurt Zeilenga, and Lixia Zhang.
1719
1720 Thanks are due to Tina Dam and others associated with the ICANN IDN
1721 Working Group for contributions of considerable specific text, to
1722 Marcos Sanz and Paul Hoffman for careful late-stage reading and
1723 extensive comments, and to Pete Resnick for many contributions and
1724 comments, both in conjunction with his former IAB service and
1725 subsequently. Olaf M. Kolkman took over IAB leadership for this
1726 document after Patrik Faltstrom and Pete Resnick stepped down in
1727 March 2006.
1728
1729 Members of the IAB at the time of approval of this document were:
1730 Bernard Aboba, Loa Andersson, Brian Carpenter, Leslie Daigle, Patrik
1731 Faltstrom, Bob Hinden, Kurtis Lindqvist, David Meyer, Pekka Nikander,
1732 Eric Rescorla, Pete Resnick, Jonathan Rosenberg and Lixia Zhang.
1733
1734 8. References
1735
1736 8.1. Normative References
1737
1738 [ISO10646] International Organization for Standardization,
1739 "Information Technology - Universal Multiple-
1740 Octet Coded Character Set (UCS) - Part 1:
1741 Architecture and Basic Multilingual Plane"",
1742 ISO/IEC 10646-1:2000, October 2000.
1743
1744 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
1745 Internationalized Strings ("stringprep")",
1746 RFC 3454, December 2002.
1747
1748 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
1749 "Internationalizing Domain Names in Applications
1750 (IDNA)", RFC 3490, March 2003.
1751
1752
1753
1754
1755
1756
1757 Klensin, et al. Informational [Page 32]
1758 RFC 4690 IAB -- IDN Next Steps September 2006
1759
1760
1761 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A
1762 Stringprep Profile for Internationalized Domain
1763 Names (IDN)", RFC 3491, March 2003.
1764
1765 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of
1766 Unicode for Internationalized Domain Names in
1767 Applications (IDNA)", RFC 3492, March 2003.
1768
1769 [Unicode32] The Unicode Consortium, "The Unicode Standard,
1770 Version 3.0", 2000.
1771 (Reading, MA, Addison-Wesley, 2000. ISBN
1772 0-201-61633-5). Version 3.2 consists of the
1773 definition in that book as amended by the Unicode
1774 Standard Annex #27: Unicode 3.1
1775 (http://www.unicode.org/reports/tr27/) and by the
1776 Unicode Standard Annex #28: Unicode 3.2
1777 (http://www.unicode.org/reports/tr28/).
1778
The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.
This RFC is included in the DNS RFCs annotation project whose home page is here.
1779 8.2. Informative References
1780
1781 [DNS-Choices] Faltstrom, P., "Design Choices When Expanding
1782 DNS", Work in Progress, June 2005.
1783
1784 [ICANNv1] ICANN, "Guidelines for the Implementation of
1785 Internationalized Domain Names, Version 1.0",
1786 March 2003, <http://www.icann.org/general/
1787 idn-guidelines-20jun03.htm>.
1788
1789 [ICANNv2] ICANN, "Guidelines for the Implementation of
1790 Internationalized Domain Names, Version 2.0",
1791 November 2005, <http://www.icann.org/general/
1792 idn-guidelines-20sep05.htm>.
1793
1794 [IESG-IDN] Internet Engineering Steering Group (IESG), "IESG
1795 Statement on IDN", IESG Statements IDN Statement,
1796 February 2003, <http://www.ietf.org/IESG/
1797 STATEMENTS/IDNstatement.txt>.
1798
1799 [INDNS] National Research Council, "Signposts in
1800 Cyberspace: The Domain Name System and Internet
1801 Navigation", National Academy Press ISBN 0309-
1802 09640-5 (Book) 0309-54979-5 (PDF), 2005, <http://
1803 www7.nationalacademies.org/cstb/pub_dns.html>.
1804
1805 [ISO.2022.1986] International Organization for Standardization,
1806 "Information Processing: ISO 7-bit and 8-bit
1807 coded character sets: Code extension techniques",
1808 ISO Standard 2022, 1986.
1809
1810
1811
1812 Klensin, et al. Informational [Page 33]
1813 RFC 4690 IAB -- IDN Next Steps September 2006
1814
1815
1816 [ISO.646.1991] International Organization for Standardization,
1817 "Information technology - ISO 7-bit coded
1818 character set for information interchange",
1819 ISO Standard 646, 1991.
1820
1821 [ISO.8859.2003] International Organization for Standardization,
1822 "Information processing - 8-bit single-byte coded
1823 graphic character sets - Part 1: Latin alphabet
1824 No. 1 (1998) - Part 2: Latin alphabet No. 2
1825 (1999) - Part 3: Latin alphabet No. 3 (1999) -
1826 Part 4: Latin alphabet No. 4 (1998) - Part 5:
1827 Latin/Cyrillic alphabet (1999) - Part 6: Latin/
1828 Arabic alphabet (1999) - Part 7: Latin/Greek
1829 alphabet (2003) - Part 8: Latin/Hebrew alphabet
1830 (1999) - Part 9: Latin alphabet No. 5 (1999) -
1831 Part 10: Latin alphabet No. 6 (1998) - Part 11:
1832 Latin/Thai alphabet (2001) - Part 13: Latin
1833 alphabet No. 7 (1998) - Part 14: Latin alphabet
1834 No. 8 (Celtic) (1998) - Part 15: Latin alphabet
1835 No. 9 (1999) - Part 16: Part 16: Latin alphabet
1836 No. 10 (2001)", ISO Standard 8859, 2003.
1837
1838 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets
1839 and Languages", BCP 18, RFC 2277, January 1998.
1840
1841 [RFC2825] IAB and L. Daigle, "A Tangled Web: Issues of
1842 I18N, Domain Names, and the Other Internet
1843 protocols", RFC 2825, May 2000.
1844
1845 [RFC3066] Alvestrand, H., "Tags for the Identification of
1846 Languages", BCP 47, RFC 3066, January 2001.
1847
1848 [RFC3467] Klensin, J., "Role of the Domain Name System
1849 (DNS)", RFC 3467, February 2003.
1850
1851 [RFC3536] Hoffman, P., "Terminology Used in
1852 Internationalization in the IETF", RFC 3536,
1853 May 2003.
1854
1855 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko,
1856 "Joint Engineering Team (JET) Guidelines for
1857 Internationalized Domain Names (IDN) Registration
1858 and Administration for Chinese, Japanese, and
1859 Korean", RFC 3743, April 2004.
1860
1861 [RFC3912] Daigle, L., "WHOIS Protocol Specification",
1862 RFC 3912, September 2004.
1863
1864
1865
1866
1867 Klensin, et al. Informational [Page 34]
1868 RFC 4690 IAB -- IDN Next Steps September 2006
1869
1870
1871 [RFC3981] Newton, A. and M. Sanz, "IRIS: The Internet
1872 Registry Information Service (IRIS) Core
1873 Protocol", RFC 3981, January 2005.
1874
1875 [RFC3982] Newton, A. and M. Sanz, "IRIS: A Domain Registry
1876 (dreg) Type for the Internet Registry Information
1877 Service (IRIS)", RFC 3982, January 2005.
1878
1879 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter,
1880 "Uniform Resource Identifier (URI): Generic
1881 Syntax", STD 66, RFC 3986, January 2005.
1882
1883 [RFC3987] Duerst, M. and M. Suignard, "Internationalized
1884 Resource Identifiers (IRIs)", RFC 3987,
1885 January 2005.
1886
1887 [RFC4185] Klensin, J., "National and Local Characters for
1888 DNS Top Level Domain (TLD) Names", RFC 4185,
1889 October 2005.
1890
1891 [RFC4290] Klensin, J., "Suggested Practices for
1892 Registration of Internationalized Domain Names
1893 (IDN)", RFC 4290, December 2005.
1894
1895 [RFC4645] Ewell, D., "Initial Language Subtag Registry",
1896 RFC 4645, September 2006.
1897
1898 [RFC4646] Phillips, A. and M. Davis, "Tags for Identifying
1899 Languages", BCP 47, RFC 4646, September 2006.
1900
1901 [UTR] Unicode Consortium, "Unicode Technical Reports",
1902 <http://www.unicode.org/reports/>.
1903
1904 [UTR36] Davis, M. and M. Suignard, "Unicode Technical
1905 Report #36: Unicode Security Considerations",
1906 November 2005, <http://www.unicode.org/draft/
1907 reports/tr36/tr36.html>.
1908
1909 [UTR39] Davis, M. and M. Suignard, "Unicode Technical
1910 Standard #39 (proposed): Unicode Security
1911 Considerations", July 2005, <http://
1912 www.unicode.org/draft/reports/tr39/tr39.html>.
1913
1914 [Unicode-PR29] The Unicode Consortium, "Public Review Issue #29:
1915 Normalization Issue", Unicode PR 29,
1916 February 2004.
1917
1918 [Unicode10] The Unicode Consortium, "The Unicode Standard,
1919
1920
1921
1922 Klensin, et al. Informational [Page 35]
1923 RFC 4690 IAB -- IDN Next Steps September 2006
1924
1925
1926 Version 1.0", 1991.
1927
1928 [W3C-Localization] Ishida, R. and S. Miller, "Localization vs.
1929 Internationalization", W3C International/
1930 questions/qa-i18n.txt, December 2005.
1931
1932 [net-utf8] Klensin, J. and M. Padlipsky, "Unicode Format for
1933 Network Interchange", Work in Progress,
1934 April 2006.
1935
1936 Authors' Addresses
1937
1938 John C Klensin
1939 1770 Massachusetts Ave, #322
1940 Cambridge, MA 02140
1941 USA
1942
1943 Phone: +1 617 491 5735
1944 EMail: john-ietf@jck.com
1945
1946
1947 Patrik Faltstrom
1948 Cisco Systems
1949
1950 EMail: paf@cisco.com
1951
1952
1953 Cary Karp
1954 Swedish Museum of Natural History
1955 Box 50007
1956 Stockholm SE-10405
1957 Sweden
1958
1959 Phone: +46 8 5195 4055
1960 EMail: ck@nrm.museum
1961
1962
1963 IAB
1964
1965 EMail: iab@iab.org
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977 Klensin, et al. Informational [Page 36]
1978 RFC 4690 IAB -- IDN Next Steps September 2006
1979
1980
1981 Full Copyright Statement
1982
1983 Copyright (C) The Internet Society (2006).
1984
1985 This document is subject to the rights, licenses and restrictions
1986 contained in BCP 78, and except as set forth therein, the authors
1987 retain all their rights.
1988
1989 This document and the information contained herein are provided on an
1990 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1991 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1992 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1993 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1994 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1995 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1996
1997 Intellectual Property
1998
1999 The IETF takes no position regarding the validity or scope of any
2000 Intellectual Property Rights or other rights that might be claimed to
2001 pertain to the implementation or use of the technology described in
2002 this document or the extent to which any license under such rights
2003 might or might not be available; nor does it represent that it has
2004 made any independent effort to identify any such rights. Information
2005 on the procedures with respect to rights in RFC documents can be
2006 found in BCP 78 and BCP 79.
2007
2008 Copies of IPR disclosures made to the IETF Secretariat and any
2009 assurances of licenses to be made available, or the result of an
2010 attempt made to obtain a general license or permission for the use of
2011 such proprietary rights by implementers or users of this
2012 specification can be obtained from the IETF on-line IPR repository at
2013 http://www.ietf.org/ipr.
2014
2015 The IETF invites any interested party to bring to its attention any
2016 copyrights, patents or patent applications, or other proprietary
2017 rights that may cover technology that may be required to implement
2018 this standard. Please address the information to the IETF at
2019 ietf-ipr@ietf.org.
2020
2021 Acknowledgement
2022
2023 Funding for the RFC Editor function is provided by the IETF
2024 Administrative Support Activity (IASA).
2025
2026
2027
2028
2029
2030
2031
2032 Klensin, et al. Informational [Page 37]
2033
[IESG-IDN] Internet Engineering Steering Group (IESG), "IESG Statement on IDN", IESG Statements IDN Statement, February 2003, <http://www.ietf.org/IESG/ STATEMENTS/IDNstatement.txt>.
[IESG-IDN] Internet Engineering Steering Group (IESG), "IESG Statement on IDN", IESG Statements IDN Statement, February 2003, <https://www.ietf.org/iesg/statement/ idn.html>.
URL of resource has changed. Original gives 'Not found'. --VERIFIER NOTES-- The right thing to do here is make sure the original URL redirects to the right place, which is now happening.