1 Network Working Group                                         J. Klensin   
    2 Request for Comments: 4690                                  P. Faltstrom   
    3 Category: Informational                                    Cisco Systems   
    4                                                                  C. Karp   
    5                                        Swedish Museum of Natural History   
    6                                                                      IAB   
    7                                                           September 2006   
   10   Review and Recommendations for Internationalized Domain Names (IDNs)     
   12 Status of This Memo                                                        
   14    This memo provides information for the Internet community.  It does     
   15    not specify an Internet standard of any kind.  Distribution of this     
   16    memo is unlimited.                                                      
   18 Copyright Notice                                                           
   20    Copyright (C) The Internet Society (2006).                              
   22 Abstract                                                                   
   24    This note describes issues raised by the deployment and use of          
   25    Internationalized Domain Names.  It describes problems both at the      
   26    time of registration and for use of those names in the DNS.  It         
   27    recommends that IETF should update the RFCs relating to IDNs and a      
   28    framework to be followed in doing so, as well as summarizing and        
   29    identifying some work that is required outside the IETF.  In            
   30    particular, it proposes that some changes be investigated for the       
   31    Internationalizing Domain Names in Applications (IDNA) standard and     
   32    its supporting tables, based on experience gained since those           
   33    standards were completed.                                               
   35 Table of Contents                                                          
   37    1. Introduction ....................................................3   
   38       1.1. The Role of IDNs and This Document .........................3   
   39       1.2. Status of This Document and Its Recommendations ............4   
   40       1.3. The IDNA Standard ..........................................4   
   41       1.4. Unicode Documents ..........................................5   
   42       1.5. Definitions ................................................5   
   43            1.5.1. Language ............................................6   
   44            1.5.2. Script ..............................................6   
   45            1.5.3. Multilingual ........................................6   
   46            1.5.4. Localization ........................................7   
   47            1.5.5. Internationalization ................................7   
   52 Klensin, et al.              Informational                      [Page 1]   

   53 RFC 4690                 IAB -- IDN Next Steps            September 2006   
   56       1.6. Statements and Guidelines ..................................7   
   57            1.6.1. IESG Statement ......................................8   
   58            1.6.2. ICANN Statements ....................................8   
   59    2. General Problems and Issues ....................................11   
   60       2.1. User Conceptions, Local Character Sets, and Input issues ..11   
   61       2.2. Examples of Issues ........................................13   
   62            2.2.1. Language-Specific Character Matching ...............13   
   63            2.2.2. Multiple Scripts ...................................13   
   64            2.2.3. Normalization and Character Mappings ...............14   
   65            2.2.4. URLs in Printed Form ...............................16   
   66            2.2.5. Bidirectional Text .................................17   
   67            2.2.6. Confusable Character Issues ........................17   
   68            2.2.7. The IESG Statement and IDNA issues .................19   
   69    3. Migrating to New Versions of Unicode ...........................20   
   70       3.1. Versions of Unicode .......................................20   
   71       3.2. Version Changes and Normalization Issues ..................21   
   72            3.2.1. Unnormalized Combining Sequences ...................21   
   73            3.2.2. Combining Characters and Character Components ......22   
   74            3.2.3. When does normalization occur? .....................23   
   75    4. Framework for Next Steps in IDN Development ....................24   
   76       4.1. Issues within the Scope of the IETF .......................24   
   77            4.1.1. Review of IDNA .....................................24   
   78            4.1.2. Non-DNS and Above-DNS Internationalization               
   79                   Approaches .........................................25   
   80            4.1.3. Security Issues, Certificates, etc. ................25   
   81            4.1.4. Protocol Changes and Policy Implications ...........27   
   82            4.1.5. Non-US-ASCII in Local Part of Email Addresses ......27   
   83            4.1.6. Use of the Unicode Character Set in the IETF .......27   
   84       4.2. Issues That Fall within the Purview of ICANN ..............28   
   85            4.2.1. Dispute Resolution .................................28   
   86            4.2.2. Policy at Registries ...............................28   
   87            4.2.3. IDNs at the Top Level of the DNS ...................29   
   88    5. Specific Recommendations for Next Steps ........................29   
   89       5.1. Reduction of Permitted Character List .....................29   
   90            5.1.1. Elimination of All Non-Language Characters .........30   
   91            5.1.2. Elimination of Word-Separation Punctuation .........30   
   92       5.2. Updating to New Versions of Unicode .......................30   
   93       5.3. Role and Uses of the DNS ..................................31   
   94       5.4. Databases of Registered Names .............................31   
   95    6. Security Considerations ........................................31   
   96    7. Acknowledgements ...............................................32   
   97    8. References .....................................................32   
   98       8.1. Normative References ......................................32   
   99       8.2. Informative References ....................................33   
  107 Klensin, et al.              Informational                      [Page 2]   

  108 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  111 1.  Introduction                                                           
  113 1.1.  The Role of IDNs and This Document                                   
  115    While IDNs have been advocated as the solution to a wide range of       
  116    problems, this document is written from the perspective that they are   
  117    no more and no less than DNS names, reflecting the same requirements    
  118    for use, stability, and accuracy as traditional "hostnames", but        
  119    using a much larger collection of permitted characters.  In             
  120    particular, while IDNs represent a step toward an Internet that is      
  121    equally accessible from all languages and scripts, they, at best,       
  122    address only a small part of that very broad objective.  There has      
  123    been controversy since IDNs were first suggested about how important    
  124    they will actually turn out to be; that controversy will probably       
  125    continue.  Accessibility from all languages is an important             
  126    objective, hence it is important that our standards and definitions     
  127    for IDNs be smoothly adaptable to additional scripts as they are        
  128    added to the Unicode character set.                                     
  130    The utility of IDNs must be evaluated in terms of their application     
  131    by users and in protocols: the ability to simply put a name into the    
  132    DNS and retrieve it is not, in and of itself, important.  From this     
  133    point of view, IDNs will be useful and effective if they provide        
  134    stable and predictable references -- references that are no less        
  135    stable and predictable, and no less secure, than their ASCII            
  136    counterparts.                                                           
  138    This combination of objectives and criteria has proven very difficult   
  139    to satisfy.  Experience in developing the IDNA standard and during      
  140    the initial years of its implementation and deployment suggests that    
  141    it may be impossible to fully satisfy all of them and that              
  142    engineering compromises are needed to yield a result that is            
  143    workable, even if not completely satisfactory.  Based on that           
  144    experience and issues that have been raised, it is now appropriate to   
  145    review some of the implications of IDNs, the decisions made in          
  146    defining them, and the foundation on which they rest and determine      
  147    whether changes are needed and, if so, which ones.                      
  149    The design of the DNS itself imposes some additional constraints.  If   
  150    the DNS is to remain globally interoperable, there are specific         
  151    characteristics that no implementation of IDNs, or the DNS more         
  152    generally, can change.  For example, because the DNS is a global        
  153    hierarchal administrative namespace with only a single name at any      
  154    given node, there is one and only one owner of each domain name.        
  155    Also, when strings are looked up in the DNS, positive responses can     
  156    only reflect exact matches: if there is no exact match, then one gets   
  157    an error reply, not a list of near matches or other supplemental        
  158    information.  Searches and approximate matchings are not possible.      
  162 Klensin, et al.              Informational                      [Page 3]   

  163 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  166    Finally, because the DNS is a distributed system where any server       
  167    might cache responses, and later use those cached responses to          
  168    attempt to satisfy queries before a global lookup is done, every        
  169    server must use the same matching criteria.                             
  171 1.2.  Status of This Document and Its Recommendations                      
  173    This document reviews the IDN landscape from an IETF perspective and    
  174    presents the recommendations and conclusions of the IAB, based          
  175    partially on input from an ad hoc committee charged with reviewing      
  176    IDN issues and the path forward (see Section 7).  Its recommendations   
  177    are advice to the IETF, or in a few cases to other bodies, for topics   
  178    to be investigated and actions to be taken if those bodies, after       
  179    their examinations, consider those actions appropriate.                 
  181 1.3.  The IDNA Standard                                                    
  183    During 2002, the IETF completed the following RFCs that, together,      
  184    define IDNs:                                                            
  186    RFC 3454  Preparation of Internationalized Strings ("Stringprep")       
  187       [RFC3454].                                                           
  188       Stringprep is a generic mechanism for taking a Unicode string and    
  189       converting it into a canonical format.  Stringprep itself is just    
  190       a collection of rules, tables, and operations.  Any protocol or      
  191       algorithm that uses it must define a "Stringprep profile", which     
  192       specifies which of those rules are applied, how, and with which      
  193       characteristics.                                                     
  195    RFC 3490  Internationalizing Domain Names in Applications (IDNA)        
  196       [RFC3490].                                                           
  197       IDNA is the base specification in this group.  It specifies that     
  198       Nameprep is used as the Stringprep profile for domain names, and     
  199       that Punycode is the relevant encoding mechanism for use in          
  200       generating an ASCII-compatible ("ACE") form of the name.  It also    
  201       applies some additional conversions and character filtering that     
  202       are not part of Nameprep.                                            
  204    RFC 3491  Nameprep: A Stringprep Profile for Internationalized Domain   
  205       Names (IDN) [RFC3491].                                               
  206       Nameprep is designed to meet the specific needs of IDNs and, in      
  207       particular, to support case-folding for scripts that support what    
  208       are traditionally known as upper- and lowercase forms of the same    
  209       letters.  The result of the Nameprep algorithm is a string           
  210       containing a subset of the Unicode Character set, normalized and     
  211       case-folded so that case-insensitive comparison can be made.         
  217 Klensin, et al.              Informational                      [Page 4]   

  218 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  221    RFC 3492  Punycode: A Bootstring encoding of Unicode for                
  222       Internationalized Domain Names in Applications (IDNA) [RFC3492].     
  223       Punycode is a mechanism for encoding a Unicode string in ASCII       
  224       characters.  The characters used are the same the subset of          
  225       characters that are allowed in the hostname definition of DNS,       
  226       i.e., the "letter, digit, and hyphen" characters, sometimes known    
  227       as "LDH".                                                            
  229 1.4.  Unicode Documents                                                    
  231    Unicode is used as the base, and defining, character set for IDNs.      
  232    Unicode is standardized by the Unicode Consortium, and synchronized     
  233    with ISO to create ISO/IEC 10646 [ISO10646].  At the time the RFCs      
  234    mentioned earlier were created, Unicode was at Version 3.2.  For        
  235    reasons explained later, it was necessary to pick a particular,         
  236    then-current, version of Unicode when IDNA was adopted.                 
  237    Consequently, the RFCs are explicitly dependent on Unicode Version      
  238    3.2 [Unicode32].  There is, at present, no established mechanism for    
  239    modifying the IDNA RFCs to use newer Unicode versions (see              
  240    Section 3.1).                                                           
  242    Unicode is a very large and complex character set.  (The term           
  243    "character set" or "charset" is used in a way that is peculiar to the   
  244    IETF and may not be the same as the usage in other bodies and           
  245    contexts.)  The Unicode Standard and related documents are created      
  246    and maintained by the Unicode Technical Committee (UTC), one of the     
  247    committees of the Unicode Consortium.                                   
  249    The Consortium first published The Unicode Standard [Unicode10] in      
  250    1991, and continues to develop standards based on that original work.   
  251    Unicode is developed in conjunction with the International              
  252    Organization for Standardization, and it shares its character           
  253    repertoire with ISO/IEC 10646.  Unicode and ISO/IEC 10646 function      
  254    equivalently as character encodings, but The Unicode Standard           
  255    contains much more information for implementers, covering -- in depth   
  256    -- topics such as bitwise encoding, collation, and rendering.  The      
  257    Unicode Standard enumerates a multitude of character properties,        
  258    including those needed for supporting bidirectional text.  The          
  259    Unicode Consortium and ISO standards do use slightly different          
  260    terminology.                                                            
  262 1.5.  Definitions                                                          
  264    The following terms and their meanings are critical to understanding    
  265    the rest of this document and to discussions of IDNs more generally.    
  266    These terms are derived from [RFC3536], which contains additional       
  267    discussion of some of them.                                             
  272 Klensin, et al.              Informational                      [Page 5]   

  273 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  276 1.5.1.  Language                                                           
  278    A language is a way that humans interact.  The use of language occurs   
  279    in many forms, including speech, writing, and signing.                  
  281    Some languages have a close relationship between the written and        
  282    spoken forms, while others have a looser relationship.  RFC 3066        
  283    [RFC3066] discusses languages in more detail and provides identifiers   
  284    for languages for use in Internet protocols.  Computer languages are    
  285    explicitly excluded from this definition.  The most recent IETF work    
  286    in this area, and on script identification (see below), is documented   
  287    in [RFC4645] and [RFC4646].                                             
  289 1.5.2.  Script                                                             
  291    A script is a set of graphic characters used for the written form of    
  292    one or more languages.  This definition is the one used in              
  293    [ISO10646].                                                             
  295    Examples of scripts are Arabic, Cyrillic, Greek, Han (the so-called     
  296    ideographs used in writing Chinese, Japanese, and Korean), and          
  297    "Latin".  Arabic, Greek, and Latin are, of course, also names of        
  298    languages.                                                              
  300    Historically, the script that is known as "Latin" in Unicode and most   
  301    contexts associated with information technology standards is known in   
  302    the linguistic community as "Roman" or "Roman-derived".  The latter     
  303    terminology distinguishes between the Latin language and the            
  304    characters used to write it, especially in Republican times, from the   
  305    much richer and more decorated script derived and adapted from those    
  306    characters.  Since IDNA is defined using Unicode and that standard      
  307    used the term "LATIN" in its character names and descriptions, that     
  308    terminology will be used in this document as well except when           
  309    "Roman-derived" is needed for clarity.  However, readers approaching    
  310    this document from a cultural or linguistic standpoint should be        
  311    aware that the use of, or references to, "Latin script" in this         
  312    document refers to the entire collection of Roman-derived characters,   
  313    not just the characters used to write the Latin language.  Some other   
  314    issues with script identification and relationships with other          
  315    standards are discussed in [RFC4646].                                   
  317 1.5.3.  Multilingual                                                       
  319    The term "multilingual" has many widely-varying definitions and thus    
  320    is not recommended for use in standards.  Some of the definitions       
  321    relate to the ability to handle international characters; other         
  322    definitions relate to the ability to handle multiple charsets; and      
  323    still others relate to the ability to handle multiple languages.        
  327 Klensin, et al.              Informational                      [Page 6]   

  328 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  331    While this term has been deprecated for IETF-related uses and does      
  332    not otherwise appear in this document, a discussion here seemed         
  333    appropriate since the term is still widely used in some discussions     
  334    of IDNs.                                                                
  336 1.5.4.  Localization                                                       
  338    Localization is the process of adapting an internationalized            
  339    application platform or application to a specific cultural              
  340    environment.  In localization, the same semantics are preserved while   
  341    the syntax or presentation forms may be changed.                        
  343    Localization is the act of tailoring an application for a different     
  344    language or script or culture.  Some internationalized applications     
  345    can handle a wide variety of languages.  Typical users understand       
  346    only a small number of languages, so the program must be tailored to    
  347    interact with users in just the languages they know.                    
  349    Somewhat different definitions for localization and                     
  350    internationalization (see below) are used by groups other than the      
  351    IETF.  See [W3C-Localization] for one example.                          
  353 1.5.5.  Internationalization                                               
  355    In the IETF, the term "internationalization" is used to describe        
  356    adding or improving the handling of non-ASCII text in a protocol.       
  357    Other bodies use the term in other ways, often with subtle variation    
  358    in meaning.  The term "internationalization" is often abbreviated       
  359    "i18n" (and localization as "l10n").                                    
  361    Many protocols that handle text only handle the characters associated   
  362    with one script (often, a subset of the characters used in writing      
  363    English text), or leave the question of what character set is used up   
  364    to local guesswork (which leads to interoperability problems).          
  365    Adding non-ASCII text to such a protocol allows the protocol to         
  366    handle more scripts, with the intention of being able to include all    
  367    of the scripts that are useful in the world.  It is naive (sic) to      
  368    believe that all English words can be written in ASCII, various         
  369    mythologies notwithstanding.                                            
  371 1.6.  Statements and Guidelines                                            
  373    When the IDNA RFCs were published, the IESG and ICANN made statements   
  374    that were intended to guide deployment and future work.  In recent      
  375    months, ICANN has updated its statement and others have also made       
  376    contributions.  It is worth noting that the quality of understanding    
  377    of internationalization issues as applied to the DNS has evolved        
  382 Klensin, et al.              Informational                      [Page 7]   

  383 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  386    considerably over the last few years.  Organizations that took          
  387    specific positions a year or more ago might not make exactly the same   
  388    statements today.                                                       
  390 1.6.1.  IESG Statement                                                     
  392    The IESG made a statement on IDNA [IESG-IDN]:                           
  394       IDNA, through its requirement of Nameprep [RFC3491], uses            
  395       equivalence tables that are based only on the characters             
  396       themselves; no attention is paid to the intended language (if any)   
  397       for the domain name.  However, for many domain names, the intended   
  398       language of one or more parts of the domain name actually does       
  399       matter to the users.                                                 
  401       Similarly, many names cannot be presented and used without           
  402       ambiguity unless the scripts to which their characters belong are    
  403       known.  In both cases, this additional information should be of      
  404       concern to the registry.                                             
  406    The statement is longer than this, but these paragraphs are the         
  407    important ones.  The rest of the statement consists of explanations     
  408    and examples.                                                           
  410 1.6.2.  ICANN Statements                                                   
  412  Initial ICANN Guidelines                                         
  414    Soon after the IDNA standards were adopted, ICANN produced an initial   
  415    version of its "IDN Guidelines" [ICANNv1].  This document was           
  416    intended to serve two purposes.  The first was to provide a basis for   
  417    releasing the Generic Top Level Domain (gTLD) registries that had       
  418    been established by ICANN from a contractual restriction on the         
  419    registration of labels containing hyphens in the third and fourth       
  420    positions.  The second was to provide a general framework for the       
  421    development of registry policies for the implementation of IDNs.        
  423    One of the key components of this framework prescribed strict           
  424    compliance with RFCs 3490, 3491, and 3492.  With the framework, ICANN   
  425    specified that IDNA was to be the sole mechanism to be used in the      
  426    DNS to represent IDNs.                                                  
  428    Limitations on the characters available for inclusion in IDNs were      
  429    mandated by two mechanisms.  The first was by requiring an              
  430    "inclusion-based approach (meaning that code points that are not        
  431    explicitly permitted by the registry are prohibited) for identifying    
  432    permissible                                                             
  437 Klensin, et al.              Informational                      [Page 8]   

  438 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  441    code points from among the full Unicode repertoire."  The second        
  442    mechanism required the association of every IDN with a specific         
  443    language, with additional policies also being language based:           
  445    "In implementing the IDN standards, top-level domain registries will    
  446    (a) associate each registered internationalized domain name with one    
  447    language or set of languages,                                           
  448    (b) employ language-specific registration and administration rules      
  449    that are documented and publicly available, such as the reservation     
  450    of all domain names with equivalent character variants in the           
  451    languages associated with the registered domain name, and,              
  452    (c) where the registry finds that the registration and administration   
  453    rules for a given language would benefit from a character variants      
  454    table, allow registrations in that language only when an appropriate    
  455    table is available. ...  In implementing the IDN standards, top-level   
  456    domain registries should, at least initially, limit any given domain    
  457    label (such as a second-level domain name) to the characters            
  458    associated with one language or set of languages only."                 
  460    It was left to each TLD registry to define the character repertoire     
  461    it would associate with any given language.  This led to significant    
  462    variation from registry to registry, with further heterogeneity in      
  463    the underlying language-based IDN policies.  If the guidelines had      
  464    made provision for IDN policies also being based on script, a           
  465    substantial amount of the resulting ambiguity could have been           
  466    avoided.  However, they did not, and the sequence of events leading     
  467    to the present review of IDNA was thus triggered.                       
  469  ICANN Version 2 Guidelines                                       
  471    One of the responses of the TLD registries to what was widely           
  472    perceived as a crisis situation was to invoke the mechanism described   
  473    in the initial guidelines: "As the deployment of IDNs proceeds, ICANN   
  474    and the IDN registries will review these Guidelines at regular          
  475    intervals, and revise them as necessary based on experience."           
  477    The pivotal requirement was the modification of the guidelines to       
  478    permit script-based policies for IDNs.  Further concern was expressed   
  479    about the need for realistically implementable mechanisms for the       
  480    propagation of TLD registry policies into the lower levels of their     
  481    name trees.  In addition to the anticipated increase of constraint on   
  482    the protocol level, one obvious additional approach would be to         
  483    replace the guidelines by an instrument that itself had clear status    
  484    in the IETF's normative framework.  A BCP was therefore seen as the     
  485    appropriate focus for longer-term effort.  The most pressing issues     
  486    would be dealt with in the interim by incremental modification to the   
  487    guidelines, but no need was seen for the detailed further development   
  488    of those guidelines once that incremental modification was complete.    
  492 Klensin, et al.              Informational                      [Page 9]   

  493 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  496    The outcome of this action was a version 2.0 of the guidelines          
  497    [ICANNv2], which was endorsed by the ICANN Board on November 8, 2005    
  498    for a period of nine months.  The Board stated further that it "tasks   
  499    the IDN working group to continue its important work and return to      
  500    the board with specific IDN improvement recommendations before the      
  501    ICANN Meeting in Morocco" and "supports the working group's continued   
  502    action to reframe the guidelines completely in a manner appropriate     
  503    for further development as a Best Current Practices (BCP) document,     
  504    to ensure that the Guideline directions will be used deeper into the    
  505    DNS hierarchy and within TLD's where ICANN has a lesser policy          
  506    relationship."                                                          
  508    Retaining the inclusion-based approach established in version 1.0,      
  509    the crucial addition to the policy framework is that:                   
  511    "All code points in a single label will be taken from the same script   
  512    as determined by the Unicode Standard Annex #24: Script Names at        
  513    http://www.unicode.org/reports/tr24.  Exception to this is              
  514    permissible for languages with established orthographies and            
  515    conventions that require the commingled use of multiple scripts.  In    
  516    such cases, visually confusable characters from different scripts       
  517    will not be allowed to coexist in a single set of permissible           
  518    codepoints unless a corresponding policy and character table is         
  519    clearly defined."                                                       
  521    Additionally:                                                           
  523    "Permissible code points will not include: (a) line symbol-drawing      
  524    characters (as those in the Unicode Box Drawing block), (b) symbols     
  525    and icons that are neither alphanumeric nor ideographic language        
  526    characters, such as typographic and pictographic dingbats, (c)          
  527    characters with well-established functions as protocol elements, (d)    
  528    punctuation marks used solely to indicate the structure of              
  529    sentences."                                                             
  531    Attention has been called to several points that are not adequately     
  532    dealt with (if at all) in the version 2.0 guidelines but that ought     
  533    to be included in the policy framework without waiting for the          
  534    production and release of a document based on a "best practices"        
  535    model.  The term "BCP" above does not necessarily refer to an IETF      
  536    consensus document.                                                     
  538    The intention in November 2005 was for the recommended major revision   
  539    to be put to the ICANN Board prior to its meeting in Morocco (in late   
  540    June 2006), but for the changes to be collated incrementally and        
  541    appear in interim version 2.n releases of the guidelines.  The IAB's    
  542    understanding is that, while there has been some progress with this,    
  547 Klensin, et al.              Informational                     [Page 10]   

  548 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  551    other issues relating to IDNs subsequently diverted much of the         
  552    energy that was intended to be devoted to the more extensive            
  553    treatment of the guidelines.                                            
  555 2.  General Problems and Issues                                            
  557    This section interweaves problems and issues of several types.  Each    
  558    subsection outlines something that is perceived to be a problem or      
  559    issue "with IDNs", therefore needing correction.  Some of these         
  560    issues can be at least partially resolved by making changes to          
  561    elements of the IDNA protocol or tables.  Others will exist as long     
  562    as people have expectations of IDNs that are inconsistent with the      
  563    basic DNS architecture.  It is important to identify this entire        
  564    range of problems because users, registrants, and policy makers often   
  565    do not understand the protocol and other technical issues but only      
  566    the difference between what they believe happens or should happen and   
  567    what actually happens.  As long as those differences exist, there       
  568    will be demands for functionality or policy changes for IDNs.  Of       
  569    course, some of these demands will be less realistic than others, but   
  570    even the realistic ones should be understood in the same context as     
  571    the others.                                                             
  573    Most of the issues that have been raised, and that are discussed in     
  574    this document, exist whether IDNA remains tied to Unicode 3.2 or        
  575    whether migration to new Unicode versions is contemplated.  A           
  576    migration path is necessary to accommodate newly-coded scripts and to   
  577    permit the maximum number of languages and scripts to be represented    
  578    in domain names.  However, the migration issues are largely separate    
  579    from those involving a single Unicode version or Version 3.2 in         
  580    particular, so they have been separated into this section and           
  581    Section 3.                                                              
  583 2.1.  User Conceptions, Local Character Sets, and Input issues             
  585    The labels of the DNS are just strings of characters that are not       
  586    inherently tied to a particular language.  As mentioned briefly in      
  587    the Introduction, DNS labels that could not lexically be words in any   
  588    language are possible and indeed common.  There appears to be no        
  589    reason to impose protocol restrictions on IDNs that would restrict      
  590    them more than all-ASCII hostname labels have been restricted.  For     
  591    that reason, even describing DNS labels or strings of them as "names"   
  592    is something of a misnomer, one that has probably added to user         
  593    confusion about what to expect.                                         
  595    Ordinarily, people use "words" when they think of things and wish       
  596    others to think of them too, for example, "orange", "tree",             
  597    "restaurant" or "Acme Inc".  Words are normally in a specific           
  598    language, such as English or Swedish.  The character-string labels      
  602 Klensin, et al.              Informational                     [Page 11]   

  603 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  606    supported by the DNS are, as suggested above, not inherently "words".   
  607    While it is useful, especially for mnemonic value or to identify        
  608    objects, for actual words to be used as DNS labels, other constraints   
  609    on the DNS make it impossible to guarantee that it will be possible     
  610    to represent every word in every language as a DNS label,               
  611    internationalized or not.                                               
  613    When writing or typing the label (or word), a script must be selected   
  614    and a charset must be picked for use with that script.  The choice of   
  615    charset is typically not under the control of the user on a per-word    
  616    or per-document basis, but may depend on local input devices,           
  617    keyboard or terminal drivers, or other decisions made by operating      
  618    system or even hardware designers and implementers.                     
  620    If that charset, or the local charset being used by the relevant        
  621    operating system or application software, is not Unicode, a further     
  622    conversion must be performed to produce Unicode.  How often this is     
  623    an issue depends on estimates of how widely Unicode is deployed as      
  624    the native character set for hardware, operating systems, and           
  625    applications.  Those estimates differ widely, but it should be noted    
  626    that, among other difficulties:                                         
  628    o  ISO 8859 versions [ISO.8859.2003] and even national variations of    
  629       ISO 646 [ISO.646.1991], are still widely used in parts of Europe;    
  631    o  code-table switching methods, typically based on the techniques of   
  632       ISO 2022 [ISO.2022.1986] are still in general use in many parts of   
  633       the world, especially in Japan with Shift-JIS and its variations;    
  634       and                                                                  
  636    o  computing, systems, and communications in China tend to use one or   
  637       more of the national "GB" standards rather than native Unicode.      
  639    Additionally, not all charsets define their characters in the same      
  640    way and not all preexisting coding systems were incorporated into       
  641    Unicode without changes.  Sometimes local distinctions were made that   
  642    Unicode does not make or vice versa.  Consequently, conversion from     
  643    other systems to Unicode may potentially lose information.              
  645    The Unicode string that results from this processing -- processing      
  646    that is trivial in a Unicode-native system but that may be              
  647    significant in others -- is then used as input to IDNA.                 
  657 Klensin, et al.              Informational                     [Page 12]   

  658 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  661 2.2.  Examples of Issues                                                   
  663    While much of the discussion below is stated in terms of Unicode        
  664    codings and associated rules, the IAB believes that some of the         
  665    issues are actually not about the Unicode character set per se, but     
  666    about how distributed matching systems operate in reality, and about    
  667    what implications the distributed delayed search for stored data that   
  668    characterizes the DNS has on the mapping algorithms.                    
  670 2.2.1.  Language-Specific Character Matching                               
  672    There are similar words that can be expressed in multiple languages.    
  673    Consider, for example, the name Torbjorn in Norwegian and Swedish.      
  674    In Norwegian it is spelled with the character U+00F8 (LATIN SMALL       
  675    LETTER O WITH STROKE) in the second syllable, while in Swedish it is    
  676    spelled with U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS).  Those       
  677    characters are not treated as equivalent according to the Unicode       
  678    Standard and its Annexes while most people speaking Swedish, Danish,    
  679    or Norwegian probably think they are equivalent.                        
  681    It is neither possible nor desirable to make these characters           
  682    equivalent on a global basis.  To do so would, for this example,        
  683    rationalize the situation in Sweden while causing considerable          
  684    confusion in Germany because the U+00F8 character is never used in      
  685    the German language.  But the "variant" model introduced in [RFC3743]   
  686    and [RFC4290] can be used by a registry to prevent the worst            
  687    consequence of the possible confusion, by ensuring either that both     
  688    names are registered to the same party in a given domain or that one    
  689    of them is completely prohibited.                                       
  691 2.2.2.  Multiple Scripts                                                   
  693    There are languages in the world that can be expressed using multiple   
  694    scripts.  For example, some Eastern European and Central Asian          
  695    languages can be expressed in either Cyrillic or Latin (see             
  696    Section 1.5.2) characters, or some African and Southeast Asian          
  697    languages can be expressed in either Arabic or Latin characters.  A     
  698    few languages can even be written in three different scripts.  In       
  699    other cases, the language is typically written in a combination of      
  700    scripts (e.g., Kanji, Kana, and Romaji for Japanese; Hangul and Hanji   
  701    for Korean).  Because of this, the same word, in the same language,     
  702    can be expressed in different ways.  For some languages, only a         
  703    single script is normally used to write a single word; for others,      
  704    mixed scripts are required; and, for still others, special              
  705    circumstances may dictate mixing scripts in labels although that is     
  706    not normally done for "words".  For IDN purposes, these variations      
  707    make the definition of "script" extremely sensitive, especially since   
  708    ICANN is now recommending that it be used as the primary basis for      
  712 Klensin, et al.              Informational                     [Page 13]   

  713 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  716    registry policies.  However essential it may be to prohibit mixed-      
  717    script labels, additional policy nuance is required for "languages      
  718    with established orthographies and conventions that require the         
  719    commingled use of multiple scripts".                                    
  721 2.2.3.  Normalization and Character Mappings                               
  723    Unicode contains several different models for representing              
  724    characters.  The Chinese (Han)-derived characters of the "CJK"          
  725    (Chinese, Japanese, and Korean) languages are "unified", i.e.,          
  726    characters with common derivation and similar appearances are           
  727    assigned to the same code point.  European characters derived from a    
  728    Greek-Latin base are separated into separate code blocks for Latin,     
  729    Greek, and Cyrillic even when individual characters are identical in    
  730    both form and semantics.  Separate code points based on font            
  731    differences alone are generally prohibited, but a large number of       
  732    characters for "mathematical" use have been assigned separate code      
  733    points even though they differ from base ASCII characters only by       
  734    font attributes such as "script", "bold", or "italic".  Some            
  735    characters that often appear together are treated as typographical      
  736    digraphs with specific code points assigned to the combination,         
  737    others require that the two-character sequences be used, and still      
  738    others are available in both forms.  Some Roman-derived letters that    
  739    were developed as decorated variations on the basic Latin letter        
  740    collection (e.g., by addition of diacritical marks) are assigned code   
  741    points as individual characters, others must be built up as two (or     
  742    more) character sequences using "combining characters".                 
  744    Many of these differences result from the desire to maintain backward   
  745    compatibility while the standard evolved historically, and are hence    
  746    understandable.  However, the DNS requires precise knowledge of which   
  747    codes and code sequences represent the same character and which ones    
  748    do not.  Limiting the potential difficulties with confusable            
  749    characters (see Section 2.2.6) requires even more knowledge of which    
  750    characters might look alike in some fonts but not in others.  These     
  751    variations make it difficult or impossible to apply a single set of     
  752    rules to all of Unicode and, in doing so, satisfy everyone and their    
  753    perceived needs.  Instead, more or less complex mapping tables,         
  754    defined on a character-by-character basis, are required to              
  755    "normalize" different representations of the same character to a        
  756    single form so that matching is possible.                               
  758    Unless normalization rules, such as those that underlie Nameprep, are   
  759    applied, characters that are essentially identical will not match in    
  760    the DNS, creating many opportunities for problems.  The most common     
  761    of these problems is that, due to the processing applied (and           
  762    discussed above) before a word is represented as a Unicode string, a    
  763    single word can end up being expressed as several different Unicode     
  767 Klensin, et al.              Informational                     [Page 14]   

  768 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  771    strings.  Even if normalization rules are applied, some strings that    
  772    are considered identical by users will not compare equal.  That         
  773    problem is discussed in more detail elsewhere in this document,         
  774    particularly in Section 3.2.1.                                          
  776    IDNA attempts to compensate for these problems by using a               
  777    normalization algorithm defined by the Unicode Consortium.  This        
  778    algorithm can change a sequence of one or more Unicode characters to    
  779    another set of characters.  One example is that the base character      
  780    U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING             
  781    DIAERESIS) is changed to the single Unicode character U+00E4 (LATIN     
  782    SMALL LETTER A WITH DIAERESIS).                                         
  784    This Unicode normalization process accounts only for simple character   
  785    equivalences, not equivalences that are language or script dependent.   
  786    For example, as mentioned above, the characters U+00F8 (LATIN SMALL     
  787    LETTER O WITH STROKE) and U+00F6 (LATIN SMALL LETTER O WITH             
  788    DIAERESIS) are considered to match in Swedish (and some other           
  789    languages), but not for all languages that use either of the            
  790    characters.  Having these characters be treated as equivalent in some   
  791    contexts and not in others requires decisions and mechanisms that, in   
  792    turn, depend much more on context than either IDNA or the Unicode       
  793    character-based normalization tables can provide.                       
  795    Additional complications occur if the sequences are more complicated    
  796    or if an attacker is making a deliberate effort to confuse the          
  797    normalization process.  For example, if the sequence U+0069 U+0307      
  798    (LATIN SMALL LETTER I followed by COMBINING DOT ABOVE) appears, the     
  799    Unicode Normalization Method known as NFKC maps it into U+00EF (LATIN   
  800    SMALL LETTER I WITH DIAERESIS), which is what one would predict.  But   
  801    consider U+0131 U+0308 (LATIN SMALL LETTER DOTLESS I and COMBINING      
  802    DIAERESIS):  is that the same character?  Is U+0131 U+0307 U+0307       
  803    (dotless i and two combining dot-above characters) equivalent to        
  804    U+00EF or U+0069, or neither?  NFKC does not appear to tell us, nor     
  805    does the definition of U+0307 appear to tell us what happens when it    
  806    is combined with other "symbol above" arrangements (unlike some of      
  807    the "accent above" combining characters, which more or less specify     
  808    kerning).  Similar issues arise when U+00EF is combined with various    
  809    dot-above combining characters.  Each of these questions provides       
  810    some opportunities for spoofing if different display implementations    
  811    interpret the rules in different ways.                                  
  813    If we leave Latin scripts and examine those based on Chinese            
  814    characters, we see there is also an absence of specific, lexigraphic,   
  815    rules for transformations between Traditional and Simplified Chinese.   
  816    Even if there were such rules, unification of Japanese and Korean       
  822 Klensin, et al.              Informational                     [Page 15]   

  823 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  826    characters with Chinese ones would make it impossible to normalize      
  827    Traditional Chinese into Simplified Chinese ones without causing        
  828    problems in Japanese and Korean use of the same characters.             
  830    More generally, while some mappings, such as those between              
  831    precomposed Latin script characters and the equivalent multiple code    
  832    point composed character sequences, depend only on the characters       
  833    themselves, in many or most cases, such as the case with Swedish        
  834    above, the mapping is language or culturally dependent.  There have     
  835    been discussions as to whether different canonicalization rules (in     
  836    addition to or instead of Unicode normalization) should be, or could    
  837    be, applied differently to different languages or scripts.  The fact    
  838    that most scripts included in Unicode have been initially               
  839    incorporated by copying an existing standard more or less intact has    
  840    impact on the optimization of these algorithms and on forward           
  841    compatibility.  Even if the language is known and language-specific     
  842    rules can be defined, dependencies on the language do not disappear.    
  843    Canonicalization operations are not possible unless they either         
  844    depend only on short sequences of text or have significant context      
  845    available that is not obvious from the text itself.  DNS lookups and    
  846    many other operations do not have a way to capture and utilize the      
  847    language or other information that would be needed to provide that      
  848    context.                                                                
  850    These variations in languages and in user perceptions of characters     
  851    make it difficult or impossible to provide uniform algorithms for       
  852    matching Unicode strings in a way that no end users are ever            
  853    surprised by the result.  For closely-related scripts or characters,    
  854    surprises may even be frequent.  However, because uniform algorithms    
  855    are required for mappings that are applied when names are looked up     
  856    in the DNS, the rules that are chosen will always represent an          
  857    approximation that will be more or less successful in minimizing        
  858    those user surprises.  The current Nameprep and Stringprep algorithms   
  859    use mapping tables to "normalize" different representations of the      
  860    same text to a single form so that matching is possible.                
  862    More details on the creation of the normalization algorithms can be     
  863    found in the Unicode Specification and the associated Technical         
  864    Reports [UTR] and Annexes.  Technical Report #36 [UTR36] and [UTR39]    
  865    are specifically related to the IDN discussion.                         
  867 2.2.4.  URLs in Printed Form                                               
  869    URLs and other identifiers appear, not only in electronic forms from    
  870    which they can (at least in principle) be accurately copied and         
  871    "pasted" but in printed forms from which the user must transcribe       
  872    them into the computer system.  This is often known as the "side-of-    
  873    the-bus problem" because a particularly problematic version of it       
  877 Klensin, et al.              Informational                     [Page 16]   

  878 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  881    requires that the user be able to observe and accurately remember a     
  882    URL that is quickly glimpsed in a transient form -- a billboard seen    
  883    while driving, a sign on the side of a passing vehicle, a television    
  884    advertisement that is not frequently repeated or on-screen for a long   
  885    time, and so on.                                                        
  887    The difficulty, in short, is that two Unicode strings that are          
  888    actually different might look exactly the same, especially when there   
  889    is no time to study them.  This is because, for example, some glyphs    
  890    in Cyrillic, Greek, and Latin do look the same, but have been           
  891    assigned different code points in Unicode.  Worse, one needs to be      
  892    reasonably familiar with a script and how it is used to understand      
  893    how much characters can reasonably vary as the result of artistic       
  894    fonts and typography.  For example, there are a few fonts for Latin     
  895    characters that are sufficiently highly ornamented that an observer     
  896    might easily confuse some of the characters with characters in Thai     
  897    script.  Uppercase ITC Blackadder (a registered trademark of            
  898    International Typeface Corporation) and Curlz MT are two fairly         
  899    obvious examples; these fonts use loops at the end of serifs,           
  900    creating a resemblance to Thai (in some fonts) for some characters.     
  902 2.2.5.  Bidirectional Text                                                 
  904    Some scripts (and because of that some words in some languages) are     
  905    written not left to right, but right to left.  And, to complicate       
  906    things, one might have something written in Arabic script right to      
  907    left that includes some characters that are read from left to right,    
  908    such as European-style digits.  This implies that some texts might      
  909    have a mixed left-to-right AND right-to-left order (even though in      
  910    most implementations, and in IDNA, all texts have a major direction,    
  911    with the other as an exception).                                        
  913    IDNA permits the inclusion of European digits in a label that is        
  914    otherwise a sequence of right-to-left characters, but prohibits most    
  915    other mixed-directional (or bidirectional) strings.  This prohibition   
  916    can cause other problems such as the rejection of some otherwise        
  917    linguistically and culturally sensible strings.  As Unicode and         
  918    conventions for handling so-called bidirectional ("BIDI") strings       
  919    evolve, the prohibition in IDNA should be reviewed and reevaluated.     
  921 2.2.6.  Confusable Character Issues                                        
  923    Similar-looking characters in identifiers can cause actual problems     
  924    on the Internet since they can result, deliberately or accidentally,    
  925    in people being directed to the wrong host or mailbox by believing      
  926    that they are typing, or clicking on, intended characters that are      
  927    different from those that actually appear in the domain name or         
  928    reference.  See Section 4.1.3 for further discussion of this issue.     
  932 Klensin, et al.              Informational                     [Page 17]   

  933 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  936    IDNs complicate these issues, not only by providing many additional     
  937    characters that look sufficiently alike to be potentially confused,     
  938    but also by raising new policy questions.  For example, if a language   
  939    can be written in two different scripts, is a label constructed from    
  940    a word written in one script equivalent to a label constructed from     
  941    the same word written in the other script?  Is the answer the same      
  942    for words in two different languages that translate into each other?    
  944    It is now generally understood that, in addition to the collision       
  945    problems of possibly equivalent words and hence labels, it is           
  946    possible to utilize characters that look alike -- "confusable"          
  947    characters -- to spoof names in order to mislead or defraud users.      
  948    That issue, driven by particular attacks such as those known as         
  949    "phishing", has introduced stronger requirements for registry efforts   
  950    to prevent problems than were previously generally recognized as        
  951    important.                                                              
  953    One commonly-proposed approach is to have a registry establish          
  954    restrictions on the characters, and combinations of characters, it      
  955    will permit to be included in a string to be registered as a label.     
  956    Taking the Swedish top-level domain, .SE, as an example, a rule might   
  957    be adopted that the registry "only accepts registrations in Swedish,    
  958    using Latin script, and because of this, Unicode characters Latin-a,    
  959    -b, -c,...".  But, because there is not a 1:1 mapping between country   
  960    and language, even a Country Code Top Level Domain (ccTLD) like .SE     
  961    might have to accept registrations in other languages.  For example,    
  962    there may be a requirement for Finnish (the second most-used language   
  963    in Sweden).  What rules and code points are then defined for Finnish?   
  964    Does it have special mappings that collide with those that are          
  965    defined for Swedish?  And what does one do in countries that use more   
  966    than one script?  (Finnish and Swedish use the same script.)  In all    
  967    cases, the dispute will ultimately be about whether two strings are     
  968    the same (or confusingly similar) or not.  That, in turn, will          
  969    generate a discussion of how one defines "what is the same" and "what   
  970    is similar enough to be a problem".                                     
  972    Another example arose recently that further illustrates the problem.    
  973    If one were to use Cyrillic characters to represent the country code    
  974    for Russia in a localized equivalent to the ccTLD label, the            
  975    characters themselves would be indistinguishable from the Latin         
  976    characters "P" and "Y" (in either lower- or uppercase) in most fonts.   
  977    We presume this might cause some consternation in Paraguay.             
  979    These difficulties can never be completely eliminated by algorithmic    
  980    means.  Some of the problem can be addressed by appropriate tuning of   
  981    the protocols and their tables, other parts by registry actions to      
  982    reduce confusion and conflicts, and still other parts can be            
  987 Klensin, et al.              Informational                     [Page 18]   

  988 RFC 4690                 IAB -- IDN Next Steps            September 2006   
  991    addressed by careful design of user interfaces in application           
  992    programs.  But, ultimately, some responsibility to avoid being          
  993    tricked or harmfully confused will rest with the user.                  
  995    Another registry technique that has been extensively explored           
  996    involves looking at confusable characters and confusion between         
  997    complete labels, restricting the labels that can be registered based    
  998    on relationships to what is registered already.  Registries that        
  999    adopt this approach might establish special mapping rules such as:      
 1001    1.  If you register something with code point A, domain names with B    
 1002        instead of A will be blocked from registration by others (where B   
 1003        is a character at a separate code point that has a confusingly      
 1004        similar appearance to A).                                           
 1006    2.  If you register something with code point A, you also get domain    
 1007        name with B instead of A.                                           
 1009    These approaches are discussed in more detail for "CJK" characters in   
 1010    RFC 3743 [RFC3743] and more generally in RFC 4290 [RFC4290].            
 1012 2.2.7.  The IESG Statement and IDNA issues                                 
 1014    The issues above, at least as they were understood at the time,         
 1015    provided the background for the IESG statement included in              
 1016    Section 1.6.1 (which, in turn, was part of the basis for the initial    
 1017    ICANN Guidelines) that a registry should have a policy about the        
 1018    scripts, languages, code points and text directions for which           
 1019    registrations will be accepted.  While "accept all" might be an         
 1020    acceptable policy, it implies there is also a dispute resolution        
 1021    process that takes the problems listed above into account.  This        
 1022    process must be designed for dealing with all types of potential        
 1023    disputes.  For example, issues might arise between registrant and       
 1024    registry over a decision by the registry on collisions with already     
 1025    registered domain names and between registrant and trademark holder     
 1026    (that a domain name infringes on a trademark).  In both cases, the      
 1027    parties disagreeing have different views on whether two strings are     
 1028    "equivalent" or not.  They may believe that a string that is not        
 1029    allowed to be registered is actually different from one that is         
 1030    already registered.  Or they might believe that two strings are the     
 1031    same, even though the rules adopted by the registry to prevent          
 1032    confusion define them as two different domain names.                    
 1042 Klensin, et al.              Informational                     [Page 19]   

 1043 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1046 3.  Migrating to New Versions of Unicode                                   
 1048 3.1.  Versions of Unicode                                                  
 1050    While opinions differ about how important the issues are in practice,   
 1051    the use of Unicode and its supporting tables for IDNA appears to be     
 1052    far more sensitive to subtle changes than it is in typical Unicode      
 1053    applications.  This may be, at least in part, because many other        
 1054    applications are internally sensitive only to the appearance of         
 1055    characters and not to their representation.  Or those applications      
 1056    may be able to take effective advantage of script, language, or         
 1057    character class identification.  The working group that developed       
 1058    IDNA concluded that attempting to encode any ancillary character        
 1059    information into the DNS label would be impractical and unwise, and     
 1060    the IAB, based in part on the comments in the ad hoc committee, saw     
 1061    no reason to review that decision.                                      
 1063    The Unicode Consortium has sometimes used the likelihood of a           
 1064    combination of characters actually appearing in a natural language as   
 1065    a criterion for the safety of a possible change.  However, as           
 1066    discussed above, DNS names are often fabrications -- abbreviations,     
 1067    strings deliberately formed to be unusual, members of a series          
 1068    sequenced by numbers or other characters, and so on.  Consequently, a   
 1069    criterion that considers a change to be safe if it would not be         
 1070    visible in properly-constructed running text is not helpful for DNS     
 1071    purposes: a change that would be safe under that criterion could        
 1072    still be quite problematic for the DNS.                                 
 1074    This sensitivity to changes has made it quite difficult to migrate      
 1075    IDNA from one version of Unicode to the next if any changes are made    
 1076    that are not strictly additive.  A change in a code point assignment    
 1077    or definition may be extremely disruptive if a DNS label has been       
 1078    defined using the earlier form and any of its previous components has   
 1079    been moved from one table position or normalization rule to another.    
 1080    Unicode normalization tables, tables of scripts or languages and        
 1081    characters that belong to them, and even tables of confusable           
 1082    characters as an adjunct to security recommendations may be very        
 1083    helpful in designing registry restrictions on registrations and         
 1084    applications provisions for avoiding or identifying suspicious names.   
 1085    Ironically, they also extend the sensitivity of IDNA and its            
 1086    implementations to all forms of change between one version of Unicode   
 1087    and the next.  Consequently, they make Unicode version migration more   
 1088    difficult.                                                              
 1090    An example of the type of change that appears to be just a small        
 1091    correction from one perspective but may be problematic from another     
 1092    was the correction to the normalization definition in 2004              
 1093    [Unicode-PR29].  Community input suggested that the change would        
 1097 Klensin, et al.              Informational                     [Page 20]   

 1098 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1101    cause problems for Stringprep, but the Unicode Technical Committee      
 1102    decided, on balance, that the change was worthwhile.  Because of        
 1103    difficulties with consistency, some deployed implementations have       
 1104    decided to adopt the change and others have not, leading to subtle      
 1105    incompatibilities.                                                      
 1107    This situation leads to a dilemma.  On the one hand, it is completely   
 1108    unacceptable to freeze IDNA at a Unicode version level that excludes    
 1109    more recently-defined characters and scripts that are important to      
 1110    those who use them.  On the other hand, it is equally unacceptable to   
 1111    migrate from one version of Unicode to the next if such migration       
 1112    might invalidate an existing registered DNS name or some of its         
 1113    registered properties or might make the string or representation of     
 1114    that name ambiguous.  If IDNA is to be modified to accommodate new      
 1115    versions of Unicode, the IETF will need to work with the Unicode        
 1116    Consortium and other bodies to find an appropriate balance in this      
 1117    area, but progress will be possible only if all relevant parties are    
 1118    able to fairly consider and discuss possible decisions that may be      
 1119    very difficult and unpalatable.                                         
 1121    It would also prove useful if, during the course of that dialog, the    
 1122    need for Unicode Consortium concern with security issues in             
 1123    applications of the Unicode character set could be clarified.  It       
 1124    would be unfortunate from almost every perspective considered here,     
 1125    if such matters slowed the inclusion of as yet unencoded scripts.       
 1127 3.2.  Version Changes and Normalization Issues                             
 1129 3.2.1.  Unnormalized Combining Sequences                                   
 1131    One of the advantages of the Unicode model of combining characters,     
 1132    as with previous systems that use character overstriking to             
 1133    accomplish similar purposes, is that it is possible to use sequences    
 1134    of code points to generate characters that are not explicitly           
 1135    provided for in the character set.  However, unless sequences that      
 1136    are not explicitly provided for are prohibited by some mechanism        
 1137    (such as the normalization tables), such combining sequences can        
 1138    permit two related dangers.                                             
 1140    o  The first is another risk of character confusion, especially if      
 1141       the relationship of the combining character with characters it       
 1142       combines with are not precisely defined or unexpected combinations   
 1143       of combining characters are used.  That issue is discussed in more   
 1144       detail, with an example, in Section 2.2.3.                           
 1146    o  These same issues also inherently impact the stability of the        
 1147       normalization tables.  Suppose that, somewhere in the world, there   
 1148       is a character that looks like a Roman-derived lowercase "i", but    
 1152 Klensin, et al.              Informational                     [Page 21]   

 1153 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1156       with three (not one or two) dots above it.  And suppose that the     
 1157       users of that character agree to represent it by combining a         
 1158       traditional "i" (U+0069) with a combining diaeresis (U+0308).  So    
 1159       far, no problem.  But, later, a broader need for this character is   
 1160       discovered and it is coded into Unicode either as a single           
 1161       precomposed character or, more likely under existing rules, by       
 1162       introducing a three-dot-above combining character.  In either        
 1163       case, that version of Unicode should include a rule in NFKC that     
 1164       maps the "i"-plus-diaeresis sequence into the new, approved, one.    
 1165       If one does not do so, then there is arguably a normalization that   
 1166       should occur that does not.  If one does so, then strings that       
 1167       were valid and normalized (although unanticipated) under the         
 1168       previous versions of Unicode become unnormalized under the new       
 1169       version.  That, in turn, would impact IDNA comparisons because,      
 1170       effectively, it would introduce a change in the matching rules.      
 1172    It would be useful to consider rules that would avoid or minimize       
 1173    these problems with the understanding that, for reasons given           
 1174    elsewhere, simply minimizing it may not be good enough for IDNA.  One   
 1175    partial solution might be to ban any combination of a base character    
 1176    and a combining character that does not appear in a hypothetical        
 1177    "anticipated combinations" table from being used in a domain name       
 1178    label.  The next subsection discusses a more radical, if impractical,   
 1179    view of the problem and its solutions.                                  
 1181 3.2.2.  Combining Characters and Character Components                      
 1183    For several reasons, including those discussed above, one thing that    
 1184    increases IDNA complexity and the need for normalization is that        
 1185    combining characters are permitted.  Without them, complexity might     
 1186    be reduced enough to permit easier transitions to new versions.  The    
 1187    community should consider the impact of entirely prohibiting            
 1188    combining characters from IDNs.  While it is almost certainly           
 1189    unfeasible to introduce this change into Unicode as it is now defined   
 1190    and doing so would be extremely disruptive even if it were feasible,    
 1191    the thought experiment can be helpful in understanding both the         
 1192    issues and the implications of the paths not taken.  For example, one   
 1193    consequence of this, of course, is that each new language or script,    
 1194    and several existing ones, would require that all of its characters     
 1195    have Unicode assignments to specific, precomposed, code points.         
 1197    Note that this is not currently permitted within Unicode for Latin      
 1198    scripts.  For non-Latin scripts, some such code points have been        
 1199    defined.  The decisions that govern the assignment of such code         
 1200    points are managed entirely within the Unicode Consortium.  Were the    
 1201    IETF to choose to reduce IDNA complexity by excluding combining         
 1202    characters, no doubt there would be additional input to the Unicode     
 1203    Consortium from users and proponents of scripts that precomposed        
 1207 Klensin, et al.              Informational                     [Page 22]   

 1208 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1211    characters be required.  The IAB and the IETF should examine whether    
 1212    it is appropriate to press the Unicode Consortium to revise these       
 1213    policies or otherwise to recommend actions that would reduce the need   
 1214    for normalization and the related complexities.  However, we have       
 1215    been told that the Technical Committee does not believe it is           
 1216    reasonable or feasible to add all possible precomposed characters to    
 1217    Unicode.  If Unicode cannot be modified to contain the precomposed      
 1218    characters necessary to support existing languages and scripts, much    
 1219    less new ones, this option for IDN restrictions will not be feasible.   
 1221 3.2.3.  When does normalization occur?                                     
 1223    In many Unicode applications, the preferred solution is to pick a       
 1224    style of normalization and require that all text that is stored or      
 1225    transmitted be normalized to that form.  (This is the approach taken    
 1226    in ongoing work in the IETF on a standard Unicode text form             
 1227    [net-utf8]).  IDNA does not impose this requirement.  Text is           
 1228    normalized and case-reduced at registration time, and only the          
 1229    normalized version is placed in the DNS.  However, there is no          
 1230    requirement that applications show only the native (and lower-case      
 1231    where appropriate) characters associated with the normalized form in    
 1232    discussions or references such as URLs.  If conventions used for        
 1233    all-ASCII DNS labels are to be extended to internationalized forms,     
 1234    such a requirement would be unreasonable, since it would prohibit the   
 1235    use of mixed-case references for clarity or market identification.      
 1236    It might even be culturally inappropriate.  However, without that       
 1237    restriction, the comparison that will ultimately be made in the DNS     
 1238    will be between strings normalized at different times and under         
 1239    different versions of Unicode.  The assertion that a string in          
 1240    normalized form under one version of Unicode will still be in           
 1241    normalized form under all future versions is not sufficient.            
 1242    Normalization at different times also requires that a given source      
 1243    string always normalizes to the same target string, regardless of the   
 1244    version under which it is normalized.  That criterion is much more      
 1245    difficult to fulfill.  The discussion above suggests that it may even   
 1246    be impossible.                                                          
 1248    Ignoring these issues with combining characters entirely, as IDNA       
 1249    effectively does today, may leave us "stuck" at Unicode 3.2, leading    
 1250    either to incompatibility differences in applications that otherwise    
 1251    use a modern version of Unicode (while IDN remains at Unicode 3.2) or   
 1252    to painful transitions to new versions.  If decisions are made          
 1253    quickly, it may still be possible to make a one-time version upgrade    
 1254    to Version 4.1 or Version 5 of Unicode.  However, unless we can         
 1255    impose sufficient global restrictions to permit smooth transitions,     
 1256    upgrading to versions beyond that one are likely to be painful (e.g.,   
 1257    potentially requiring changing strings already in the DNS or even a     
 1258    new Punycode prefix) or impossible.                                     
 1262 Klensin, et al.              Informational                     [Page 23]   

 1263 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1266 4.  Framework for Next Steps in IDN Development                            
 1268 4.1.  Issues within the Scope of the IETF                                  
 1270 4.1.1.  Review of IDNA                                                     
 1272    The IETF should consider reviewing RFCs 3454, 3490, 3491, and/or        
 1273    3492, and update, replace, or supplement them to meet the criteria of   
 1274    this paragraph (one or more of them may prove impractical after         
 1275    further study).  Any new versions or additional specifications should   
 1276    be adapted to the version of Unicode that is current when they are      
 1277    created.  Ideally, they should specify a path for adapting to future    
 1278    versions of Unicode (some suggestions below may facilitate this).       
 1279    The IETF should also consider whether there are significant             
 1280    advantages to mapping some groups of characters, such as code points    
 1281    assigned to font variations, into others or whether clarity and         
 1282    comprehensibility for the user would be better served by simply         
 1283    prohibiting those characters.  More generally, it appears that it       
 1284    would be worthwhile for the IETF to review whether the Unicode          
 1285    normalization rules now invoked by the Stringprep profile in Nameprep   
 1286    are optimal for the DNS or whether more restrictive rules, or an even   
 1287    more restrictive set of permitted character combinations, would         
 1288    provide better support for DNS internationalization.                    
 1290    The IAB has concluded that there is a consensus within the broader      
 1291    community that lists of code points should be specified by the use of   
 1292    an inclusion-based mechanism (i.e., identifying the characters that     
 1293    are permitted), rather than by excluding a small number of characters   
 1294    from the total Unicode set as Stringprep and Nameprep do today.  That   
 1295    conclusion should be reviewed by the IETF community and action taken    
 1296    as appropriate.                                                         
 1298    We suggest that the individuals doing the review of the code points     
 1299    should work as a specialized design team.  To the extent possible,      
 1300    that work should be done jointly by people with experience from the     
 1301    IETF and deep knowledge of the constraints of the DNS and application   
 1302    design, participants from the Unicode Consortium, and other people      
 1303    necessary to be able to reach a generally-accepted result.  Because     
 1304    any work along these lines would be modifications and updates to        
 1305    standards-track documents, final review and approval of any proposals   
 1306    would necessarily follow normal IETF processes.                         
 1308    It is worth noting that sufficiently extreme changes to IDNA would      
 1309    require a new Punycode prefix, probably with long-term support for      
 1310    both the old prefix and the new one in both registration arrangements   
 1311    and applications.  An alternative, which is almost certainly            
 1312    impractical, would be some sort of "flag day", i.e., a date on which    
 1313    the old rules are simultaneously abandoned by everyone and the new      
 1317 Klensin, et al.              Informational                     [Page 24]   

 1318 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1321    ones adopted.  However, preliminary analysis indicates that few, if     
 1322    any, of the changes recommended for consideration elsewhere in this     
 1323    document would require this type of version change.  For example,       
 1324    suppose additional restrictions, such as those implied above, are       
 1325    imposed on what can be registered.  Those restrictions might require    
 1326    policy decisions about how labels are to be disposed of if they         
 1327    conformed to the earlier rules but not to the new ones.  But they       
 1328    would not inherently require changes in the protocol or prefix.         
 1330 4.1.2.  Non-DNS and Above-DNS Internationalization Approaches              
 1332    The IETF should once again examine the extent to which it is            
 1333    appropriate to try to solve internationalization problems via the DNS   
 1334    and what place the many varieties of so-called "keyword systems" or     
 1335    other Internet navigational techniques might have.  Those techniques    
 1336    can be designed to impose fewer constraints, or at least different      
 1337    constraints, than IDNA and the DNS.  As discussed elsewhere in this     
 1338    document, IDNA cannot support information about scripts, languages,     
 1339    or Unicode versions on lookup.  As a consequence of the nature of DNS   
 1340    lookups, characters and labels either match or do not match; a near-    
 1341    match is simply not a possible concept in the DNS.  By contrast,        
 1342    observation of near-matching is common in human communication and in    
 1343    matching operations performed by people, especially when they have a    
 1344    particular script or language context in mind.  The DNS is further      
 1345    constrained by a fairly rigid internal aliasing system (via CNAME and   
 1346    DNAME resource records), while some applications of international       
 1347    naming may require more flexibility.  Finally, the rigid hierarchy of   
 1348    the DNS --and the tendency in practice for it to become flat at         
 1349    levels nearest the root-- and the need for names to be unique are       
 1350    more suitable for some purposes than others and may not be a good       
 1351    match for some purposes for which people wish to use IDNs.  Each of     
 1352    these constraints can be relaxed or changed by one or more systems      
 1353    that would provide alternatives to direct use of the DNS by users.      
 1354    Some of the issues involved are discussed further in Section 5.3 and    
 1355    various ideas have been discussed in detail in the IETF or IRTF.        
 1356    Many of those ideas have even been described in Internet Drafts or      
 1357    other documents.  As experience with IDNs and with expectations for     
 1358    them accumulates, it will probably become appropriate for the IETF or   
 1359    IRTF to revisit the underlying questions and possibilities.             
 1361 4.1.3.  Security Issues, Certificates, etc.                                
 1363    Some characters look like others, often as the result of common         
 1364    origins.  The problem with these "confusable" characters, often         
 1365    incorrectly called homographs, has always existed when characters are   
 1366    presented to humans who interpret what is displayed and then make       
 1367    decisions based on what is seen.  This is not a problem that exists     
 1368    only when working with internationalized domain names, but they make    
 1372 Klensin, et al.              Informational                     [Page 25]   

 1373 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1376    the problem worse.  The result of a survey that would explain what      
 1377    the problems are might be interesting.  Many of these issues are        
 1378    mentioned in Unicode Technical Report #36 [UTR36].                      
 1380    In this and other issues associated with IDNs, precise use of           
 1381    terminology is important lest even more confusion result.  The          
 1382    definition of the term 'homograph' that normally appears in             
 1383    dictionaries and linguistic texts states that homographs are            
 1384    different words that are spelled identically (for example, the          
 1385    adjective 'brief' meaning short, the noun 'brief' meaning a document,   
 1386    and the verb 'brief' meaning to inform).  By definition, letters in     
 1387    two different alphabets are not the same, regardless of similarities    
 1388    in appearance.  This means that sequences of letters from two           
 1389    different scripts that appear to be identical on a computer display     
 1390    cannot be homographs in the accepted sense, even if they are both       
 1391    words in the dictionary of some language.  Assuming that there is a     
 1392    language written with Cyrillic script in which "cap" is a word,         
 1393    regardless of what it might mean, it is not a homograph of the          
 1394    Latin-script English word "cap".                                        
 1396    When the security implications of visually confusable characters were   
 1397    brought to the forefront in 2005, the term homograph was used to        
 1398    designate any instance of graphic similarity, even when comparing       
 1399    individual characters.  This usage is not only incorrect, but risks     
 1400    introducing even more confusion and hence should be avoided.  The       
 1401    current preferred terminology is to describe these similar-looking      
 1402    characters as "confusable characters" or even "confusables".            
 1404    Many people have suggested that confusable characters are a problem     
 1405    that must be addressed, at least in part, directly in the user          
 1406    interfaces of application software.  While it should almost certainly   
 1407    be part of a complete solution, that approach creates it own set of     
 1408    difficulties.  For example, a user switching between systems, or even   
 1409    between applications on the same system, may be surprised by            
 1410    different types of behavior and different levels of protection.  In     
 1411    addition, it is unclear how a secure setup for the end user should be   
 1412    designed.  Today, in the web browser, a padlock is a traditional way    
 1413    of describing some level of security for the end user.  Is this         
 1414    binary signaling enough?  Should there be any connection between a      
 1415    risk for a displayed string including confusable characters and the     
 1416    padlock or similar signaling to the user?                               
 1418    Many web browsers have adopted a convention, based on a "whitelist"     
 1419    or similar technique, of restricting the display of native characters   
 1420    to subdomains of top-level domains that are deemed to have safe         
 1421    practices for the registration of potentially confusable labels.        
 1422    IDNs in other domains are displayed as Punycode.  These techniques      
 1423    may not be sufficiently sensitive to differences in policies among      
 1427 Klensin, et al.              Informational                     [Page 26]   

 1428 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1431    top-level domains and their subdomains and so, while they are clearly   
 1432    helpful, they may not be adequate.  Are other methods of dealing with   
 1433    confusable characters possible?  Would other methods of identifying     
 1434    and listing policies about avoiding confusing registrations be          
 1435    feasible and helpful?                                                   
 1437    It would be interesting to see a more coordinated effort in             
 1438    establishing guidelines for user interfaces.  If nothing else, the      
 1439    current whitelists are browser specific and both can, and do, differ    
 1440    between implementations.                                                
 1442 4.1.4.  Protocol Changes and Policy Implications                           
 1444    Some potential protocol or table changes raise important policy         
 1445    issues about what to do with existing, registered, names.  Should       
 1446    such changes be needed, their impact must be carefully evaluated in     
 1447    the IETF, ICANN, and possibly other forums.  In particular, protocol    
 1448    or policy changes that would not permit existing names to be            
 1449    registered under the newer rules should be considered carefully,        
 1450    balancing their importance against possible disruption and the issues   
 1451    of invalidating older names against the importance of consistency as    
 1452    seen by the user.                                                       
 1454 4.1.5.  Non-US-ASCII in Local Part of Email Addresses                      
 1456    Work is going on in the IETF related to the local part of email         
 1457    addresses.  It should be noted that the local part of email addresses   
 1458    has much different syntax and constraints than a domain name label,     
 1459    so to directly apply IDNA on the local part is not possible.            
 1461 4.1.6.  Use of the Unicode Character Set in the IETF                       
 1463    Unicode and the closely-related ISO 10646 are the only coded            
 1464    character sets that aspire to include all of the world's characters.    
 1465    As such, they permit use of international characters without having     
 1466    to identify particular character coding standards or tables.  The       
 1467    requirement for a single character set is particularly important for    
 1468    use with the DNS since there is no place to put character set           
 1469    identification.  The decision to use Unicode as the base for IETF       
 1470    protocols going forward is discussed in [RFC2277].  The IAB does not    
 1471    see any reason to revisit the decision to use Unicode in IETF           
 1472    protocols.                                                              
 1482 Klensin, et al.              Informational                     [Page 27]   

 1483 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1486 4.2.  Issues That Fall within the Purview of ICANN                         
 1488 4.2.1.  Dispute Resolution                                                 
 1490    IDNs create new types of collisions between trademarks and domain       
 1491    names as well as collisions between domain names.  These have impact    
 1492    on dispute resolution processes used by registries and otherwise.  It   
 1493    is important that deployment of IDNs evolve in parallel with review     
 1494    and updating of ICANN or registry-specific dispute resolution           
 1495    processes.                                                              
 1497 4.2.2.  Policy at Registries                                               
 1499    The IAB recommends that registries use an inclusion-based model when    
 1500    choosing what characters to allow at the time of registration.  This    
 1501    list of characters is in turn to be a subset of what is allowed         
 1502    according to the updated IDNA standard.  The IAB further recommends     
 1503    that registries develop their inclusion-based models in parallel with   
 1504    dispute resolution process at the registry itself.                      
 1506    Most established policies for dealing with claimed or apparent          
 1507    confusion or conflicts of names are based on dispute resolution.        
 1508    Decisions about legitimate use or registration of one or more names     
 1509    are resolved at or after the time of registration on a case-by-case     
 1510    basis and using policies that are specific to the particular DNS zone   
 1511    or jurisdiction involved.  These policies have generally not been       
 1512    extended below the level of the DNS that is directly controlled by      
 1513    the top-level registry.                                                 
 1515    Because of the number of conflicts that can be generated by the         
 1516    larger number of available and confusable characters in Unicode, we     
 1517    recommend that registration-restriction and dispute resolution          
 1518    policies be developed to constrain registration of IDNs and zone        
 1519    administrators at all levels of the DNS tree.  Of course, many of       
 1520    these policies will be less formal than others and there is no          
 1521    requirement for complete global consistency, but the arguments for      
 1522    reduction of confusable characters and other issues in TLDs should      
 1523    apply to all zones below that specific TLD.                             
 1525    Consistency across all zones can obviously only be accomplished by      
 1526    changes to the protocols.  Such changes should be considered by the     
 1527    IETF if particular restrictions are identified that are important and   
 1528    consistent enough to be applied globally.                               
 1530    Some potential protocol changes or changes to character-mapping         
 1531    tables might, if adopted, have profound registry policy implications.   
 1532    See Section 4.1.4.                                                      
 1537 Klensin, et al.              Informational                     [Page 28]   

 1538 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1541 4.2.3.  IDNs at the Top Level of the DNS                                   
 1543    The IAB has concluded that there is not one issue with IDNs at the      
 1544    top level of the DNS (IDN TLDs) but at least three very separate        
 1545    ones:                                                                   
 1547    o  If IDNs are to be entered in the root zone, decisions must first     
 1548       be made about how these TLDs are to be named and delegated.  These   
 1549       decisions fall within the traditional IANA scope and are ICANN       
 1550       issues today.                                                        
 1552    o  There has been discussion of permitting some or all existing TLDs    
 1553       to be referenced by multiple labels, with those labels presumably    
 1554       representing some understanding of the "name" of the TLD in          
 1555       different languages.  If actual aliases of this type are desired     
 1556       for existing domains, the IETF may need to consider whether the      
 1557       use of DNAME records in the root is appropriate to meet that need,   
 1558       what constraints, if any, are needed, whether alternate              
 1559       approaches, such as those of [RFC4185], are appropriate or whether   
 1560       further alternatives should be investigated.  But, to the extent     
 1561       to which aliases are considered desirable and feasible, decisions    
 1562       presumably must be made as to which, if any, root IDN labels         
 1563       should be associated with DNAME records and which ones should be     
 1564       handled by normal delegation records or other mechanisms.  That      
 1565       decision is one of DNS root-level namespace policy and hence falls   
 1566       to ICANN although we would expect ICANN to pay careful attention     
 1567       to any technical, operational, or security recommendations that      
 1568       may be produced by other bodies.                                     
 1570    o  Finally, if IDN labels are to be placed in the root zone, there      
 1571       are issues associated with how they are to be encoded and            
 1572       deployed.  This area may have implications for work that has been    
 1573       done, or should be done, in the IETF.                                
 1575 5.  Specific Recommendations for Next Steps                                
 1577    Consistent with the framework described above, the IAB offers these     
 1578    recommendations as steps for further consideration in the identified    
 1579    groups.                                                                 
 1581 5.1.  Reduction of Permitted Character List                                
 1583    Generalize from the original "hostname" rules to non-ASCII              
 1584    characters, permitting as few characters as possible to do that job.    
 1585    This would involve a restrictive model for characters permitted in      
 1586    IDN labels, thus contrasting with the approach used to develop the      
 1587    original IDNA/Nameprep tables.  That approach was to include all        
 1588    Unicode characters that there was not a clear reason to exclude.        
 1592 Klensin, et al.              Informational                     [Page 29]   

 1593 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1596    The specific recommendation here is to specify such internationalized   
 1597    hostnames.  Such an activity would fall to the IETF, although the       
 1598    task of developing the appropriate list of permitted characters will    
 1599    require effort both in the IETF and elsewhere.  The effort should be    
 1600    as linguistically and culturally sensitive as possible, but smooth      
 1601    and effective operation of the DNS, including minimizing of             
 1602    complexity, should be primary goals.  The following should be           
 1603    considered as possible mechanisms for achieving an appropriate          
 1604    minimum number of characters.                                           
 1606 5.1.1.  Elimination of All Non-Language Characters                         
 1608    Unicode characters that are not needed to write words or numbers in     
 1609    any of the world's languages should be eliminated from the list of      
 1610    characters that are appropriate in DNS labels.  In addition to such     
 1611    characters as those used for box-drawing and sentence punctuation,      
 1612    this should exclude punctuation for word structure and other            
 1613    delimiters.  While DNS labels may conveniently be used to express       
 1614    words in many circumstances, the goal is not to express words (or       
 1615    sentences or phrases), but to permit the creation of unambiguous        
 1616    labels with good mnemonic value.                                        
 1618 5.1.2.  Elimination of Word-Separation Punctuation                         
 1620    The inclusion of the hyphen in the original hostname rules is a         
 1621    historical artifact from an older, flat, namespace.  The community      
 1622    should consider whether it is appropriate to treat it as a simple       
 1623    legacy property of ASCII names and not attempt to generalize it to      
 1624    other scripts.  We might, for example, not permit claimed equivalents   
 1625    to the hyphen from other scripts to be used in IDNs.  We might even     
 1626    consider banning use of the hyphen itself in non-ASCII strings or,      
 1627    less restrictively, strings that contained non-Latin characters.        
 1629 5.2.  Updating to New Versions of Unicode                                  
 1631    As new scripts, to support new languages, continue to be added to       
 1632    Unicode, it is important that IDNA track updates.  If it does not do    
 1633    so, but remains "stuck" at 3.2 or some single later version, it will    
 1634    not be possible to include labels in the DNS that are derived from      
 1635    words in languages that require characters that are available only in   
 1636    later versions.  Making those upgrades is difficult, and will           
 1637    continue to be difficult, as long as new versions require, not just     
 1638    addition of characters, but changes to canonicalization conventions,    
 1639    normalization tables, or matching procedures (see Section 3.1).         
 1640    Anything that can be done to lower complexity and simplify forward      
 1641    transitions should be seriously considered.                             
 1647 Klensin, et al.              Informational                     [Page 30]   

 1648 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1651 5.3.  Role and Uses of the DNS                                             
 1653    We wish to remind the community that there are boundaries to the        
 1654    appropriate uses of the DNS.  It was designed and implemented to        
 1655    serve some specific purposes.  There are additional things that it      
 1656    does well, other things that it does badly, and still other things it   
 1657    cannot do at all.  No amount of protocol work on IDNs will solve        
 1658    problems with alternate spellings, near-matches, searching for          
 1659    appropriate names, and so on.  Registration restrictions and            
 1660    carefully-designed user interfaces can be used to reduce the risk and   
 1661    pain of attempts to do some of these things gone wrong, as well as      
 1662    reducing the risks of various sort of deliberate bad behavior, but,     
 1663    beyond a certain point, use of the DNS simply because it is available   
 1664    becomes a bad tradeoff.  The tradeoff may be particularly unfortunate   
 1665    when the use of IDNs does not actually solve the proposed problem.      
 1666    For example, internationalization of DNS names does not eliminate the   
 1667    ASCII protocol identifiers and structure of URIs [RFC3986] and even     
 1668    IRIs [RFC3987].  Hence, DNS internationalization itself, at any or      
 1669    all levels of the DNS tree, is not a sufficient response to the         
 1670    desire of populations to use the Internet entirely in their own         
 1671    languages and the characters associated with those languages.           
 1673    These issues are discussed at more length, and alternatives             
 1674    presented, in [RFC2825], [RFC3467], [INDNS], and [DNS-Choices].         
 1676 5.4.  Databases of Registered Names                                        
 1678    In addition to their presence in the DNS, IDNs introduce issues in      
 1679    other contexts in which domain names are used.  In particular, the      
 1680    design and content of databases that bind registered names to           
 1681    information about the registrant (commonly described as "whois"         
 1682    databases) will require review and updating.  For example, the whois    
 1683    protocol itself [RFC3912] has no standard capability for handling       
 1684    non-ASCII text: one cannot search consistently for, or report, either   
 1685    a DNS name or contact information that is not in ASCII characters.      
 1686    This may provide some additional impetus for a switch to IRIS           
 1687    [RFC3981] [RFC3982] but also raises a number of other questions about   
 1688    what information, and in what languages and scripts, should be          
 1689    included or permitted in such databases.                                
 1691 6.  Security Considerations                                                
 1693    This document is simply a discussion of IDNs and IDNA issues; it        
 1694    raises no new security concerns.  However, if some of its               
 1695    recommendations to reduce IDNA complexity, the number of available      
 1696    characters, and various approaches to constraining the use of           
 1697    confusable characters, are followed and prove successful, the risks     
 1698    of name spoofing and other problems may be reduced.                     
 1702 Klensin, et al.              Informational                     [Page 31]   

 1703 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1706 7.  Acknowledgements                                                       
 1708    The contributions to this report from members of the IAB-IDN ad hoc     
 1709    committee are gratefully acknowledged.  Of course, not all of the       
 1710    members of that group endorse every comment and suggestion of this      
 1711    report.  In particular, this report does not claim to reflect the       
 1712    views of the Unicode Consortium as a whole or those of particular       
 1713    participants in the work of that Consortium.                            
 1715    The members of the ad hoc committee were: Rob Austein, Leslie Daigle,   
 1716    Tina Dam, Mark Davis, Patrik Faltstrom, Scott Hollenbeck, Cary Karp,    
 1717    John Klensin, Gervase Markham, David Meyer, Thomas Narten, Michael      
 1718    Suignard, Sam Weiler, Bert Wijnen, Kurt Zeilenga, and Lixia Zhang.      
 1720    Thanks are due to Tina Dam and others associated with the ICANN IDN     
 1721    Working Group for contributions of considerable specific text, to       
 1722    Marcos Sanz and Paul Hoffman for careful late-stage reading and         
 1723    extensive comments, and to Pete Resnick for many contributions and      
 1724    comments, both in conjunction with his former IAB service and           
 1725    subsequently.  Olaf M. Kolkman took over IAB leadership for this        
 1726    document after Patrik Faltstrom and Pete Resnick stepped down in        
 1727    March 2006.                                                             
 1729    Members of the IAB at the time of approval of this document were:       
 1730    Bernard Aboba, Loa Andersson, Brian Carpenter, Leslie Daigle, Patrik    
 1731    Faltstrom, Bob Hinden, Kurtis Lindqvist, David Meyer, Pekka Nikander,   
 1732    Eric Rescorla, Pete Resnick, Jonathan Rosenberg and Lixia Zhang.        
 1734 8.  References                                                             
 1736 8.1.  Normative References                                                 
 1738    [ISO10646]          International Organization for Standardization,     
 1739                        "Information Technology - Universal Multiple-       
 1740                        Octet Coded Character Set (UCS) - Part 1:           
 1741                        Architecture and Basic Multilingual Plane"",        
 1742                        ISO/IEC 10646-1:2000, October 2000.                 
 1744    [RFC3454]           Hoffman, P. and M. Blanchet, "Preparation of        
 1745                        Internationalized Strings ("stringprep")",          
 1746                        RFC 3454, December 2002.                            
 1748    [RFC3490]           Faltstrom, P., Hoffman, P., and A. Costello,        
 1749                        "Internationalizing Domain Names in Applications    
 1750                        (IDNA)", RFC 3490, March 2003.                      
 1757 Klensin, et al.              Informational                     [Page 32]   

 1758 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1761    [RFC3491]           Hoffman, P. and M. Blanchet, "Nameprep: A           
 1762                        Stringprep Profile for Internationalized Domain     
 1763                        Names (IDN)", RFC 3491, March 2003.                 
 1765    [RFC3492]           Costello, A., "Punycode: A Bootstring encoding of   
 1766                        Unicode for Internationalized Domain Names in       
 1767                        Applications (IDNA)", RFC 3492, March 2003.         
 1769    [Unicode32]         The Unicode Consortium, "The Unicode Standard,      
 1770                        Version 3.0", 2000.                                 
 1771                        (Reading, MA, Addison-Wesley, 2000.  ISBN           
 1772                        0-201-61633-5).  Version 3.2 consists of the        
 1773                        definition in that book as amended by the Unicode   
 1774                        Standard Annex #27: Unicode 3.1                     
 1775                        (http://www.unicode.org/reports/tr27/) and by the   
 1776                        Unicode Standard Annex #28: Unicode 3.2             
 1777                        (http://www.unicode.org/reports/tr28/).             

The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.

This RFC is included in the DNS RFCs annotation project whose home page is here.

 1779 8.2.  Informative References                                               
 1781    [DNS-Choices]       Faltstrom, P., "Design Choices When Expanding       
 1782                        DNS", Work in Progress, June 2005.                  
 1784    [ICANNv1]           ICANN, "Guidelines for the Implementation of        
 1785                        Internationalized Domain Names, Version 1.0",       
 1786                        March 2003, <http://www.icann.org/general/          
 1787                        idn-guidelines-20jun03.htm>.                        
 1789    [ICANNv2]           ICANN, "Guidelines for the Implementation of        
 1790                        Internationalized Domain Names, Version 2.0",       
 1791                        November 2005, <http://www.icann.org/general/       
 1792                        idn-guidelines-20sep05.htm>.                        
 1794    [IESG-IDN]          Internet Engineering Steering Group (IESG), "IESG   
 1795                        Statement on IDN", IESG Statements IDN Statement,   
 1796                        February 2003, <http://www.ietf.org/IESG/           
 1797                        STATEMENTS/IDNstatement.txt>.                       
 1799    [INDNS]             National Research Council, "Signposts in            
 1800                        Cyberspace: The Domain Name System and Internet     
 1801                        Navigation", National Academy Press ISBN 0309-      
 1802                        09640-5 (Book) 0309-54979-5 (PDF), 2005, <http://   
 1803                        www7.nationalacademies.org/cstb/pub_dns.html>.      
 1805    [ISO.2022.1986]     International Organization for Standardization,     
 1806                        "Information Processing: ISO 7-bit and 8-bit        
 1807                        coded character sets: Code extension techniques",   
 1808                        ISO Standard 2022, 1986.                            
 1812 Klensin, et al.              Informational                     [Page 33]   

 1813 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1816    [ISO.646.1991]      International Organization for Standardization,     
 1817                        "Information technology - ISO 7-bit coded           
 1818                        character set for information interchange",         
 1819                        ISO Standard 646, 1991.                             
 1821    [ISO.8859.2003]     International Organization for Standardization,     
 1822                        "Information processing - 8-bit single-byte coded   
 1823                        graphic character sets - Part 1: Latin alphabet     
 1824                        No. 1 (1998) - Part 2: Latin alphabet No. 2         
 1825                        (1999) - Part 3: Latin alphabet No. 3 (1999) -      
 1826                        Part 4: Latin alphabet No. 4 (1998) - Part 5:       
 1827                        Latin/Cyrillic alphabet (1999) - Part 6: Latin/     
 1828                        Arabic alphabet (1999) - Part 7: Latin/Greek        
 1829                        alphabet (2003) - Part 8: Latin/Hebrew alphabet     
 1830                        (1999) - Part 9: Latin alphabet No. 5 (1999) -      
 1831                        Part 10: Latin alphabet No. 6 (1998) - Part 11:     
 1832                        Latin/Thai alphabet (2001) - Part 13: Latin         
 1833                        alphabet No. 7 (1998) - Part 14: Latin alphabet     
 1834                        No. 8 (Celtic) (1998) - Part 15: Latin alphabet     
 1835                        No. 9 (1999) - Part 16: Part 16: Latin alphabet     
 1836                        No. 10 (2001)", ISO Standard 8859, 2003.            
 1838    [RFC2277]           Alvestrand, H., "IETF Policy on Character Sets      
 1839                        and Languages", BCP 18, RFC 2277, January 1998.     
 1841    [RFC2825]           IAB and L. Daigle, "A Tangled Web: Issues of        
 1842                        I18N, Domain Names, and the Other Internet          
 1843                        protocols", RFC 2825, May 2000.                     
 1845    [RFC3066]           Alvestrand, H., "Tags for the Identification of     
 1846                        Languages", BCP 47, RFC 3066, January 2001.         
 1848    [RFC3467]           Klensin, J., "Role of the Domain Name System        
 1849                        (DNS)", RFC 3467, February 2003.                    
 1851    [RFC3536]           Hoffman, P., "Terminology Used in                   
 1852                        Internationalization in the IETF", RFC 3536,        
 1853                        May 2003.                                           
 1855    [RFC3743]           Konishi, K., Huang, K., Qian, H., and Y. Ko,        
 1856                        "Joint Engineering Team (JET) Guidelines for        
 1857                        Internationalized Domain Names (IDN) Registration   
 1858                        and Administration for Chinese, Japanese, and       
 1859                        Korean", RFC 3743, April 2004.                      
 1861    [RFC3912]           Daigle, L., "WHOIS Protocol Specification",         
 1862                        RFC 3912, September 2004.                           
 1867 Klensin, et al.              Informational                     [Page 34]   

 1868 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1871    [RFC3981]           Newton, A. and M. Sanz, "IRIS: The Internet         
 1872                        Registry Information Service (IRIS) Core            
 1873                        Protocol", RFC 3981, January 2005.                  
 1875    [RFC3982]           Newton, A. and M. Sanz, "IRIS: A Domain Registry    
 1876                        (dreg) Type for the Internet Registry Information   
 1877                        Service (IRIS)", RFC 3982, January 2005.            
 1879    [RFC3986]           Berners-Lee, T., Fielding, R., and L. Masinter,     
 1880                        "Uniform Resource Identifier (URI): Generic         
 1881                        Syntax", STD 66, RFC 3986, January 2005.            
 1883    [RFC3987]           Duerst, M. and M. Suignard, "Internationalized      
 1884                        Resource Identifiers (IRIs)", RFC 3987,             
 1885                        January 2005.                                       
 1887    [RFC4185]           Klensin, J., "National and Local Characters for     
 1888                        DNS Top Level Domain (TLD) Names", RFC 4185,        
 1889                        October 2005.                                       
 1891    [RFC4290]           Klensin, J., "Suggested Practices for               
 1892                        Registration of Internationalized Domain Names      
 1893                        (IDN)", RFC 4290, December 2005.                    
 1895    [RFC4645]           Ewell, D., "Initial Language Subtag Registry",      
 1896                        RFC 4645, September 2006.                           
 1898    [RFC4646]           Phillips, A. and M. Davis, "Tags for Identifying    
 1899                        Languages", BCP 47, RFC 4646, September 2006.       
 1901    [UTR]               Unicode Consortium, "Unicode Technical Reports",    
 1902                        <http://www.unicode.org/reports/>.                  
 1904    [UTR36]             Davis, M. and M. Suignard, "Unicode Technical       
 1905                        Report #36: Unicode Security Considerations",       
 1906                        November 2005, <http://www.unicode.org/draft/       
 1907                        reports/tr36/tr36.html>.                            
 1909    [UTR39]             Davis, M. and M. Suignard, "Unicode Technical       
 1910                        Standard #39 (proposed): Unicode Security           
 1911                        Considerations", July 2005, <http://                
 1912                        www.unicode.org/draft/reports/tr39/tr39.html>.      
 1914    [Unicode-PR29]      The Unicode Consortium, "Public Review Issue #29:   
 1915                        Normalization Issue", Unicode PR 29,                
 1916                        February 2004.                                      
 1918    [Unicode10]         The Unicode Consortium, "The Unicode Standard,      
 1922 Klensin, et al.              Informational                     [Page 35]   

 1923 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1926                        Version 1.0", 1991.                                 
 1928    [W3C-Localization]  Ishida, R. and S. Miller, "Localization vs.         
 1929                        Internationalization", W3C International/           
 1930                        questions/qa-i18n.txt, December 2005.               
 1932    [net-utf8]          Klensin, J. and M. Padlipsky, "Unicode Format for   
 1933                        Network Interchange", Work in Progress,             
 1934                        April 2006.                                         
 1936 Authors' Addresses                                                         
 1938    John C Klensin                                                          
 1939    1770 Massachusetts Ave, #322                                            
 1940    Cambridge, MA  02140                                                    
 1941    USA                                                                     
 1943    Phone: +1 617 491 5735                                                  
 1944    EMail: john-ietf@jck.com                                                
 1947    Patrik Faltstrom                                                        
 1948    Cisco Systems                                                           
 1950    EMail: paf@cisco.com                                                    
 1953    Cary Karp                                                               
 1954    Swedish Museum of Natural History                                       
 1955    Box 50007                                                               
 1956    Stockholm  SE-10405                                                     
 1957    Sweden                                                                  
 1959    Phone: +46 8 5195 4055                                                  
 1960    EMail: ck@nrm.museum                                                    
 1963    IAB                                                                     
 1965    EMail: iab@iab.org                                                      
 1977 Klensin, et al.              Informational                     [Page 36]   

 1978 RFC 4690                 IAB -- IDN Next Steps            September 2006   
 1981 Full Copyright Statement                                                   
 1983    Copyright (C) The Internet Society (2006).                              
 1985    This document is subject to the rights, licenses and restrictions       
 1986    contained in BCP 78, and except as set forth therein, the authors       
 1987    retain all their rights.                                                
 1989    This document and the information contained herein are provided on an   
 1997 Intellectual Property                                                      
 1999    The IETF takes no position regarding the validity or scope of any       
 2000    Intellectual Property Rights or other rights that might be claimed to   
 2001    pertain to the implementation or use of the technology described in     
 2002    this document or the extent to which any license under such rights      
 2003    might or might not be available; nor does it represent that it has      
 2004    made any independent effort to identify any such rights.  Information   
 2005    on the procedures with respect to rights in RFC documents can be        
 2006    found in BCP 78 and BCP 79.                                             
 2008    Copies of IPR disclosures made to the IETF Secretariat and any          
 2009    assurances of licenses to be made available, or the result of an        
 2010    attempt made to obtain a general license or permission for the use of   
 2011    such proprietary rights by implementers or users of this                
 2012    specification can be obtained from the IETF on-line IPR repository at   
 2013    http://www.ietf.org/ipr.                                                
 2015    The IETF invites any interested party to bring to its attention any     
 2016    copyrights, patents or patent applications, or other proprietary        
 2017    rights that may cover technology that may be required to implement      
 2018    this standard.  Please address the information to the IETF at           
 2019    ietf-ipr@ietf.org.                                                      
 2021 Acknowledgement                                                            
 2023    Funding for the RFC Editor function is provided by the IETF             
 2024    Administrative Support Activity (IASA).                                 
 2032 Klensin, et al.              Informational                     [Page 37]   
section-8.2 Hugo Salgado(Editorial Erratum #4896) [Rejected]
based on outdated version
[IESG-IDN]          Internet Engineering Steering Group (IESG), "IESG
                    Statement on IDN", IESG Statements IDN Statement,
                    February 2003, <­http://www.ietf.org/IESG/

It should say:
[IESG-IDN]          Internet Engineering Steering Group (IESG), "IESG
                    Statement on IDN", IESG Statements IDN Statement,
                    February 2003, <­https://www.ietf.org/iesg/statement/

URL of resource has changed. Original gives 'Not found'.
The right thing to do here is make sure the original URL redirects to
the right place, which is now happening.