RFC 5894

    1 Internet Engineering Task Force (IETF)                        J. Klensin   
    2 Request for Comments: 5894                                   August 2010   
    3 Category: Informational                                                    
    4 ISSN: 2070-1721                                                            
    5                                                                            
    6                                                                            
    7         Internationalized Domain Names for Applications (IDNA):            
    8                  Background, Explanation, and Rationale                    
    9                                                                            
   10 Abstract                                                                   
   11                                                                            
   12    Several years have passed since the original protocol for               
   13    Internationalized Domain Names (IDNs) was completed and deployed.       
   14    During that time, a number of issues have arisen, including the need    
   15    to update the system to deal with newer versions of Unicode.  Some of   
   16    these issues require tuning of the existing protocols and the tables    
   17    on which they depend.  This document provides an overview of a          
   18    revised system and provides explanatory material for its components.    
   19                                                                            
   20 Status of This Memo                                                        
   21                                                                            
   22    This document is not an Internet Standards Track specification; it is   
   23    published for informational purposes.                                   
   24                                                                            
   25    This document is a product of the Internet Engineering Task Force       
   26    (IETF).  It represents the consensus of the IETF community.  It has     
   27    received public review and has been approved for publication by the     
   28    Internet Engineering Steering Group (IESG).  Not all documents          
   29    approved by the IESG are a candidate for any level of Internet          
   30    Standard; see Section 2 of RFC 5741.                                    
   31                                                                            
   32    Information about the current status of this document, any errata,      
   33    and how to provide feedback on it may be obtained at                    
   34    http://www.rfc-editor.org/info/rfc5894.                                 
   35                                                                            
   36                                                                            
   37                                                                            
   38                                                                            
   39                                                                            
   40                                                                            
   41                                                                            
   42                                                                            
   43                                                                            
   44                                                                            
   45                                                                            
   46                                                                            
   47                                                                            
   48                                                                            
   49                                                                            
   50                                                                            
   51                                                                            
   52 Klensin                       Informational                     [Page 1]   

   53 RFC 5894                     IDNA Rationale                  August 2010   
   54                                                                            
   55                                                                            
   56 Copyright Notice                                                           
   57                                                                            
   58    Copyright (c) 2010 IETF Trust and the persons identified as the         
   59    document authors.  All rights reserved.                                 
   60                                                                            
   61    This document is subject to BCP 78 and the IETF Trust's Legal           
   62    Provisions Relating to IETF Documents                                   
   63    (http://trustee.ietf.org/license-info) in effect on the date of         
   64    publication of this document.  Please review these documents            
   65    carefully, as they describe your rights and restrictions with respect   
   66    to this document.  Code Components extracted from this document must    
   67    include Simplified BSD License text as described in Section 4.e of      
   68    the Trust Legal Provisions and are provided without warranty as         
   69    described in the Simplified BSD License.                                
   70                                                                            
   71    This document may contain material from IETF Documents or IETF          
   72    Contributions published or made publicly available before November      
   73    10, 2008.  The person(s) controlling the copyright in some of this      
   74    material may not have granted the IETF Trust the right to allow         
   75    modifications of such material outside the IETF Standards Process.      
   76    Without obtaining an adequate license from the person(s) controlling    
   77    the copyright in such materials, this document may not be modified      
   78    outside the IETF Standards Process, and derivative works of it may      
   79    not be created outside the IETF Standards Process, except to format     
   80    it for publication as an RFC or to translate it into languages other    
   81    than English.                                                           
   82                                                                            
   83 Table of Contents                                                          
   84                                                                            
   85    1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4   
   86      1.1.  Context and Overview . . . . . . . . . . . . . . . . . . .  4   
   87      1.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  5   
   88        1.2.1.  DNS "Name" Terminology . . . . . . . . . . . . . . . .  5   
   89        1.2.2.  New Terminology and Restrictions . . . . . . . . . . .  6   
   90      1.3.  Objectives . . . . . . . . . . . . . . . . . . . . . . . .  6   
   91      1.4.  Applicability and Function of IDNA . . . . . . . . . . . .  7   
   92      1.5.  Comprehensibility of IDNA Mechanisms and Processing  . . .  8   
   93    2.  Processing in IDNA2008 . . . . . . . . . . . . . . . . . . . .  9   
   94    3.  Permitted Characters: An Inclusion List  . . . . . . . . . . .  9   
   95      3.1.  A Tiered Model of Permitted Characters and Labels  . . . . 10   
   96        3.1.1.  PROTOCOL-VALID . . . . . . . . . . . . . . . . . . . . 10   
   97        3.1.2.  CONTEXTUAL RULE REQUIRED . . . . . . . . . . . . . . . 11   
   98          3.1.2.1.  Contextual Restrictions  . . . . . . . . . . . . . 11   
   99          3.1.2.2.  Rules and Their Application  . . . . . . . . . . . 12   
  100        3.1.3.  DISALLOWED . . . . . . . . . . . . . . . . . . . . . . 12   
  101        3.1.4.  UNASSIGNED . . . . . . . . . . . . . . . . . . . . . . 13   
  102      3.2.  Registration Policy  . . . . . . . . . . . . . . . . . . . 14   
  103                                                                            
  104                                                                            
  105                                                                            
  106                                                                            
  107 Klensin                       Informational                     [Page 2]   

  108 RFC 5894                     IDNA Rationale                  August 2010   
  109                                                                            
  110                                                                            
  111      3.3.  Layered Restrictions: Tables, Context, Registration, and        
  112            Applications . . . . . . . . . . . . . . . . . . . . . . . 15   
  113    4.  Application-Related Issues . . . . . . . . . . . . . . . . . . 15   
  114      4.1.  Display and Network Order  . . . . . . . . . . . . . . . . 15   
  115      4.2.  Entry and Display in Applications  . . . . . . . . . . . . 16   
  116      4.3.  Linguistic Expectations: Ligatures, Digraphs, and               
  117            Alternate Character Forms  . . . . . . . . . . . . . . . . 19   
  118      4.4.  Case Mapping and Related Issues  . . . . . . . . . . . . . 20   
  119      4.5.  Right-to-Left Text . . . . . . . . . . . . . . . . . . . . 21   
  120    5.  IDNs and the Robustness Principle  . . . . . . . . . . . . . . 22   
  121    6.  Front-end and User Interface Processing for Lookup . . . . . . 22   
  122    7.  Migration from IDNA2003 and Unicode Version Synchronization  . 25   
  123      7.1.  Design Criteria  . . . . . . . . . . . . . . . . . . . . . 25   
  124        7.1.1.  Summary and Discussion of IDNA Validity Criteria . . . 25   
  125        7.1.2.  Labels in Registration . . . . . . . . . . . . . . . . 26   
  126        7.1.3.  Labels in Lookup . . . . . . . . . . . . . . . . . . . 27   
  127      7.2.  Changes in Character Interpretations . . . . . . . . . . . 28   
  128        7.2.1.  Character Changes: Eszett and Final Sigma  . . . . . . 28   
  129        7.2.2.  Character Changes: Zero Width Joiner and Zero               
  130                Width Non-Joiner . . . . . . . . . . . . . . . . . . . 29   
  131        7.2.3.  Character Changes and the Need for Transition  . . . . 29   
  132        7.2.4.  Transition Strategies  . . . . . . . . . . . . . . . . 30   
  133      7.3.  Elimination of Character Mapping . . . . . . . . . . . . . 31   
  134      7.4.  The Question of Prefix Changes . . . . . . . . . . . . . . 31   
  135        7.4.1.  Conditions Requiring a Prefix Change . . . . . . . . . 31   
  136        7.4.2.  Conditions Not Requiring a Prefix Change . . . . . . . 32   
  137        7.4.3.  Implications of Prefix Changes . . . . . . . . . . . . 32   
  138      7.5.  Stringprep Changes and Compatibility . . . . . . . . . . . 33   
  139      7.6.  The Symbol Question  . . . . . . . . . . . . . . . . . . . 33   
  140      7.7.  Migration between Unicode Versions: Unassigned Code             
  141            Points . . . . . . . . . . . . . . . . . . . . . . . . . . 35   
  142      7.8.  Other Compatibility Issues . . . . . . . . . . . . . . . . 36   
  143    8.  Name Server Considerations . . . . . . . . . . . . . . . . . . 37   
  144      8.1.  Processing Non-ASCII Strings . . . . . . . . . . . . . . . 37   
  145      8.2.  Root and Other DNS Server Considerations . . . . . . . . . 37   
  146    9.  Internationalization Considerations  . . . . . . . . . . . . . 38   
  147    10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 38   
  148      10.1. IDNA Character Registry  . . . . . . . . . . . . . . . . . 38   
  149      10.2. IDNA Context Registry  . . . . . . . . . . . . . . . . . . 39   
  150      10.3. IANA Repository of IDN Practices of TLDs . . . . . . . . . 39   
  151    11. Security Considerations  . . . . . . . . . . . . . . . . . . . 39   
  152      11.1. General Security Issues with IDNA  . . . . . . . . . . . . 39   
  153    12. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 39   
  154    13. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 40   
  155    14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40   
  156      14.1. Normative References . . . . . . . . . . . . . . . . . . . 40   
  157      14.2. Informative References . . . . . . . . . . . . . . . . . . 41   
  158                                                                            
  159                                                                            
  160                                                                            
  161                                                                            
  162 Klensin                       Informational                     [Page 3]   

  163 RFC 5894                     IDNA Rationale                  August 2010   
  164                                                                            
  165                                                                            
  166 1.  Introduction                                                           
  167                                                                            
  168 1.1.  Context and Overview                                                 
  169                                                                            
  170    Internationalized Domain Names in Applications (IDNA) is a collection   
  171    of standards that allow client applications to convert some mnemonic    
  172    strings expressed in Unicode to an ASCII-compatible encoding form       
  173    ("ACE") that is a valid DNS label containing only LDH syntax (see the   
  174    Definitions document [RFC5890]).  The specific form of ACE label used   
  175    by IDNA is called an "A-label".  A client can look up an exact          
  176    A-label in the existing DNS, so A-labels do not require any             
  177    extensions to DNS, upgrades of DNS servers, or updates to low-level     
  178    client libraries.  An A-label is recognizable from the prefix "xn--"    
  179    before the characters produced by the Punycode algorithm [RFC3492];     
  180    thus, a user application can identify an A-label and convert it into    
  181    Unicode (or some local coded character set) for display.                
  182                                                                            
  183    On the registry side, IDNA allows a registry to offer                   
  184    Internationalized Domain Names (IDNs) for registration as A-labels.     
  185    A registry may offer any subset of valid IDNs, and may apply any        
  186    restrictions or bundling (grouping of similar labels together in one    
  187    registration) appropriate for the context of that registry.             
  188    Registration of labels is sometimes discussed separately from lookup,   
  189    and it is subject to a few specific requirements that do not apply to   
  190    lookup.                                                                 
  191                                                                            
  192    DNS clients and registries are subject to some differences in           
  193    requirements for handling IDNs.  In particular, registries are urged    
  194    to register only exact, valid A-labels, while clients might do some     
  195    mapping to get from otherwise-invalid user input to a valid A-label.    
  196                                                                            
  197    The first version of IDNA was published in 2003 and is referred to      
  198    here as IDNA2003 to contrast it with the current version, which is      
  199    known as IDNA2008 (after the year in which IETF work started on it).    
  200    IDNA2003 consists of four documents: the IDNA base specification        
  201    [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep       
  202    [RFC3454].  The current set of documents, IDNA2008, is not dependent    
  203    on any of the IDNA2003 specifications other than the one for Punycode   
  204    encoding.  References to "IDNA2008", "these specifications", or         
  205    "these documents" are to the entire IDNA2008 set listed in a separate   
  206    Definitions document [RFC5890].  The characters that are valid in       
  207    A-labels are identified from rules listed in the Tables document        
  208    [RFC5892], but validity can be derived from the Unicode properties of   
  209    those characters with a very few exceptions.                            
  210                                                                            
  211    Traditionally, DNS labels are matched case-insensitively (as            
  212    described in the DNS specifications [RFC1034][RFC1035]).  That          
  213    convention was preserved in IDNA2003 by a case-folding operation that   
  214                                                                            
  215                                                                            
  216                                                                            
  217 Klensin                       Informational                     [Page 4]   

  218 RFC 5894                     IDNA Rationale                  August 2010   
  219                                                                            
  220                                                                            
  221    generally maps capital letters into lowercase ones.  However, if case   
  222    rules are enforced from one language, another language sometimes        
  223    loses the ability to treat two characters separately.  Case-            
  224    insensitivity is treated slightly differently in IDNA2008.              
  225                                                                            
  226    IDNA2003 used Unicode version 3.2 only.  In order to keep up with new   
  227    characters added in new versions of Unicode, IDNA2008 decouples its     
  228    rules from any particular version of Unicode.  Instead, the             
  229    attributes of new characters in Unicode, supplemented by a small        
  230    number of exception cases, determine how and whether the characters     
  231    can be used in IDNA labels.                                             
  232                                                                            
  233    This document provides informational context for IDNA2008, including    
  234    terminology, background, and policy discussions.  It contains no        
  235    normative material; specifications for conformance to the IDNA2008      
  236    protocols appears entirely in the other documents in the series.        
  237                                                                            
  238 1.2.  Terminology                                                          
  239                                                                            
  240    Terminology for IDNA2008 appears in the Definitions document            
  241    [RFC5890].  That document also contains a road map to the IDNA2008      
  242    document collection.  No attempt should be made to understand this      
  243    document without the definitions and concepts that appear there.        
  244                                                                            
  245 1.2.1.  DNS "Name" Terminology                                             
  246                                                                            
  247    In the context of IDNs, the DNS term "name" has introduced some         
  248    confusion as people speak of DNS labels in terms of the words or        
  249    phrases of various natural languages.  Historically, many of the        
  250    "names" in the DNS have been mnemonics to identify some particular      
  251    concept, object, or organization.  They are typically rooted in some    
  252    language because most people think in language-based ways.  But,        
  253    because they are mnemonics, they need not obey the orthographic         
  254    conventions of any language: it is not a requirement that it be         
  255    possible for them to be "words".                                        
  256                                                                            
  257    This distinction is important because the reasonable goal of an IDN     
  258    effort is not to be able to write the great Klingon (or language of     
  259    one's choice) novel in DNS labels but to be able to form a usefully     
  260    broad range of mnemonics in ways that are as natural as possible in a   
  261    very broad range of scripts.                                            
  262                                                                            
  263                                                                            
  264                                                                            
  265                                                                            
  266                                                                            
  267                                                                            
  268                                                                            
  269                                                                            
  270                                                                            
  271                                                                            
  272 Klensin                       Informational                     [Page 5]   

  273 RFC 5894                     IDNA Rationale                  August 2010   
  274                                                                            
  275                                                                            
  276 1.2.2.  New Terminology and Restrictions                                   
  277                                                                            
  278    IDNA2008 introduces new terminology.  Precise definitions are           
  279    provided in the Definitions document for the terms U-label, A-Label,    
  280    LDH label (to which all valid pre-IDNA hostnames conformed), Reserved   
  281    LDH label (R-LDH label), XN-label, Fake A-label, and Non-Reserved LDH   
  282    label (NR-LDH label).                                                   
  283                                                                            
  284    In addition, the term "putative label" has been adopted to refer to a   
  285    label that may appear to meet certain definitional constraints but      
  286    has not yet been sufficiently tested for validity.                      
  287                                                                            
  288    These definitions are also illustrated in Figure 1 of the Definitions   
  289    document.  R-LDH labels contain "--" in the third and fourth            
  290    character positions from the beginning of the label.  In IDNA-aware     
  291    applications, only a subset of these reserved labels is permitted to    
  292    be used, namely the A-label subset.  A-labels are a subset of the       
  293    R-LDH labels that begin with the case-insensitive string "xn--".        
  294    Labels that bear this prefix but that are not otherwise valid fall      
  295    into the "Fake A-label" category.  The Non-Reserved labels (NR-LDH      
  296    labels) are implicitly valid since they do not bear any resemblance     
  297    to the labels specified by IDNA.                                        
  298                                                                            
  299    The creation of the Reserved-LDH category is required for three         
  300    reasons:                                                                
  301                                                                            
  302    o  to prevent confusion with pre-IDNA coding forms;                     
  303                                                                            
  304    o  to permit future extensions that would require changing the          
  305       prefix, no matter how unlikely those might be (see Section 7.4);     
  306       and                                                                  
  307                                                                            
  308    o  to reduce the opportunities for attacks via the Punycode encoding    
  309       algorithm itself.                                                    
  310                                                                            
  311    As with other documents in the IDNA2008 set, this document uses the     
  312    term "registry" to describe any zone in the DNS.  That term, and the    
  313    terms "zone" or "zone administration", are interchangeable.             
  314                                                                            
  315 1.3.  Objectives                                                           
  316                                                                            
  317    These are the main objectives in revising IDNA.                         
  318                                                                            
  319    o  Use a more recent version of Unicode and allow IDNA to be            
  320       independent of Unicode versions, so that IDNA2008 need not be        
  321       updated for implementations to adopt code points from new Unicode    
  322       versions.                                                            
  323                                                                            
  324                                                                            
  325                                                                            
  326                                                                            
  327 Klensin                       Informational                     [Page 6]   

  328 RFC 5894                     IDNA Rationale                  August 2010   
  329                                                                            
  330                                                                            
  331    o  Fix a very small number of code point categorizations that have      
  332       turned out to cause problems in the communities that use those       
  333       code points.                                                         
  334                                                                            
  335    o  Reduce the dependency on mapping, in favor of valid A-labels.        
  336       This will result in pre-mapped forms that are not valid IDNA         
  337       labels appearing less often in various contexts.                     
  338                                                                            
  339    o  Fix some details in the bidirectional code point handling            
  340       algorithms.                                                          
  341                                                                            
  342 1.4.  Applicability and Function of IDNA                                   
  343                                                                            
  344    The IDNA specification solves the problem of extending the repertoire   
  345    of characters that can be used in domain names to include a large       
  346    subset of the Unicode repertoire.                                       
  347                                                                            
  348    IDNA does not extend DNS.  Instead, the applications (and, by           
  349    implication, the users) continue to see an exact-match lookup           
  350    service.  Either there is a single name that matches exactly (subject   
  351    to the base DNS requirement of case-insensitive ASCII matching) or      
  352    there is no match.  This model has served the existing applications     
  353    well, but it requires, with or without internationalized domain         
  354    names, that users know the exact spelling of the domain names that      
  355    are to be typed into applications such as web browsers and mail user    
  356    agents.  The introduction of the larger repertoire of characters        
  357    potentially makes the set of misspellings larger, especially given      
  358    that in some cases the same appearance, for example on a business       
  359    card, might visually match several Unicode code points or several       
  360    sequences of code points.                                               
  361                                                                            
  362    The IDNA standard does not require any applications to conform to it,   
  363    nor does it retroactively change those applications.  An application    
  364    can elect to use IDNA in order to support IDNs while maintaining        
  365    interoperability with existing infrastructure.  For applications that   
  366    want to use non-ASCII characters in public DNS domain names, IDNA is    
  367    the only option that is defined at the time this specification is       
  368    published.  Adding IDNA support to an existing application entails      
  369    changes to the application only, and leaves room for flexibility in     
  370    front-end processing and more specifically in the user interface (see   
  371    Section 6).                                                             
  372                                                                            
  373    A great deal of the discussion of IDN solutions has focused on          
  374    transition issues and how IDNs will work in a world where not all of    
  375    the components have been updated.  Proposals that were not chosen by    
  376    the original IDN Working Group would have depended on updating user     
  377    applications, DNS resolvers, and DNS servers in order for a user to     
  378    apply an internationalized domain name in any form or coding            
  379                                                                            
  380                                                                            
  381                                                                            
  382 Klensin                       Informational                     [Page 7]   

  383 RFC 5894                     IDNA Rationale                  August 2010   
  384                                                                            
  385                                                                            
  386    acceptable under that method.  While processing must be performed       
  387    prior to or after access to the DNS, IDNA requires no changes to the    
  388    DNS protocol, any DNS servers, or the resolvers on users' computers.    
  389                                                                            
  390    IDNA allows the graceful introduction of IDNs not only by avoiding      
  391    upgrades to existing infrastructure (such as DNS servers and mail       
  392    transport agents), but also by allowing some limited use of IDNs in     
  393    applications by using the ASCII-encoded representation of the labels    
  394    containing non-ASCII characters.  While such names are user-            
  395    unfriendly to read and type, and hence not optimal for user input,      
  396    they can be used as a last resort to allow rudimentary IDN usage.       
  397    For example, they might be the best choice for display if it were       
  398    known that relevant fonts were not available on the user's computer.    
  399    In order to allow user-friendly input and output of the IDNs and        
  400    acceptance of some characters as equivalent to those to be processed    
  401    according to the protocol, the applications need to be modified to      
  402    conform to this specification.                                          
  403                                                                            
  404    This version of IDNA uses the Unicode character repertoire for          
  405    continuity with the original version of IDNA.                           
  406                                                                            
  407 1.5.  Comprehensibility of IDNA Mechanisms and Processing                  
  408                                                                            
  409    One goal of IDNA2008, which is aided by the main goal of reducing the   
  410    dependency on mapping, is to improve the general understanding of how   
  411    IDNA works and what characters are permitted and what happens to        
  412    them.  Comprehensibility and predictability to users and registrants    
  413    are important design goals for this effort.  End-user applications      
  414    have an important role to play in increasing this comprehensibility.    
  415                                                                            
  416    Any system that tries to handle international characters encounters     
  417    some common problems.  For example, a User Interface (UI) cannot        
  418    display a character if no font containing that character is             
  419    available.  In some cases, internationalization enables effective       
  420    localization while maintaining some global uniformity but losing some   
  421    universality.                                                           
  422                                                                            
  423    It is difficult to even make suggestions as to how end-user             
  424    applications should cope when characters and fonts are not available.   
  425    Because display functions are rarely controlled by the types of         
  426    applications that would call upon IDNA, such suggestions will rarely    
  427    be very effective.                                                      
  428                                                                            
  429    Conversion between local character sets and normalized Unicode, if      
  430    needed, is part of this set of user interface issues.  Those            
  431    conversions introduce complexity in a system that does not use          
  432    Unicode as its primary (or only) internal character coding system.      
  433    If a label is converted to a local character set that does not have     
  434                                                                            
  435                                                                            
  436                                                                            
  437 Klensin                       Informational                     [Page 8]   

  438 RFC 5894                     IDNA Rationale                  August 2010   
  439                                                                            
  440                                                                            
  441    all the needed characters, or that uses different character-coding      
  442    principles, the user interface program may have to add special logic    
  443    to avoid or reduce loss of information.                                 
  444                                                                            
  445    The major difficulty may lie in accurately identifying the incoming     
  446    character set and applying the correct conversion routine.  Even more   
  447    difficult, the local character coding system could be based on          
  448    conceptually different assumptions than those used by Unicode (e.g.,    
  449    choice of font encodings used for publications in some Indic            
  450    scripts).  Those differences may not easily yield unambiguous           
  451    conversions or interpretations even if each coding system is            
  452    internally consistent and adequate to represent the local language      
  453    and script.                                                             
  454                                                                            
  455    IDNA2008 shifts responsibility for character mapping and other          
  456    adjustments from the protocol (where it was located in IDNA2003) to     
  457    pre-processing before invoking IDNA itself.  The intent is that this    
  458    change will lead to greater usage of fully-valid A-Labels or U-labels   
  459    in display, transit, and storage, which should aid comprehensibility    
  460    and predictability.  A careful look at pre-processing raises issues     
  461    about what that pre-processing should do and at what point              
  462    pre-processing becomes harmful; how universally consistent              
  463    pre-processing algorithms can be; and how to be compatible with         
  464    labels prepared in an IDNA2003 context.  Those issues are discussed     
  465    in Section 6 and in the Mapping document [IDNA2008-Mapping].            
  466                                                                            
  467 2.  Processing in IDNA2008                                                 
  468                                                                            
  469    IDNA2008 separates Domain Name Registration and Lookup in the           
  470    protocol specification (RFC 5891, Sections 4 and 5 [RFC5891]).          
  471    Although most steps in the two processes are similar, the separation    
  472    reflects current practice in which per-registry (DNS zone)              
  473    restrictions and special processing are applied at registration time    
  474    but not during lookup.  Another significant benefit is that             
  475    separation facilitates incremental addition of permitted character      
  476    groups to avoid freezing on one particular version of Unicode.          
  477                                                                            
  478    The actual registration and lookup protocols for IDNA2008 are           
  479    specified in the Protocol document.                                     
  480                                                                            
  481 3.  Permitted Characters: An Inclusion List                                
  482                                                                            
  483    IDNA2008 adopts the inclusion model.  A code point is assumed to be     
  484    invalid for IDN use unless it is included as part of a Unicode          
  485    property-based rule or, in rare cases, included individually by an      
  486    exception.  When an implementation moves to a new version of Unicode,   
  487    the rules may indicate new valid code points.                           
  488                                                                            
  489                                                                            
  490                                                                            
  491                                                                            
  492 Klensin                       Informational                     [Page 9]   

  493 RFC 5894                     IDNA Rationale                  August 2010   
  494                                                                            
  495                                                                            
  496    This section provides an overview of the model used to establish the    
  497    algorithm and character lists of the Tables document [RFC5892] and      
  498    describes the names and applicability of the categories used there.     
  499    Note that the inclusion of a character in the PROTOCOL-VALID category   
  500    group (Section 3.1.1) does not imply that it can be used                
  501    indiscriminately; some characters are associated with contextual        
  502    rules that must be applied as well.                                     
  503                                                                            
  504    The information given in this section is provided to make the rules,    
  505    tables, and protocol easier to understand.  The normative generating    
  506    rules that correspond to this informal discussion appear in the         
  507    Tables document, and the rules that actually determine what labels      
  508    can be registered or looked up are in the Protocol document.            
  509                                                                            
  510 3.1.  A Tiered Model of Permitted Characters and Labels                    
  511                                                                            
  512    Moving to an inclusion model involves a new specification for the       
  513    list of characters that are permitted in IDNs.  In IDNA2003,            
  514    character validity is independent of context and fixed forever (or      
  515    until the standard is replaced).  However, globally context-            
  516    independent rules have proved to be impractical because some            
  517    characters, especially those that are called "Join_Controls" in         
  518    Unicode, are needed to make reasonable use of some scripts but have     
  519    no visible effect in others.  IDNA2003 prohibited those types of        
  520    characters entirely by discarding them.  We now have a consensus that   
  521    under some conditions, these "joiner" characters are legitimately       
  522    needed to allow useful mnemonics for some languages and scripts.  In    
  523    general, context-dependent rules help deal with characters (generally   
  524    characters that would otherwise be prohibited entirely) that are used   
  525    differently or perceived differently across different scripts, and      
  526    allow the standard to be applied more appropriately in cases where a    
  527    string is not universally handled the same way.                         
  528                                                                            
  529    IDNA2008 divides all possible Unicode code points into four             
  530    categories: PROTOCOL-VALID, CONTEXTUAL RULE REQUIRED, DISALLOWED, and   
  531    UNASSIGNED.                                                             
  532                                                                            
  533 3.1.1.  PROTOCOL-VALID                                                     
  534                                                                            
  535    Characters identified as PROTOCOL-VALID (often abbreviated PVALID)      
  536    are permitted in IDNs.  Their use may be restricted by rules about      
  537    the context in which they appear or by other rules that apply to the    
  538    entire label in which they are to be embedded.  For example, any        
  539    label that contains a character in this category that has a             
  540    "right-to-left" property must be used in context with the Bidi rules    
  541    [RFC5893].  The term PROTOCOL-VALID is used to stress the fact that     
  542    the presence of a character in this category does not imply that a      
  543    given registry need accept registrations containing any of the          
  544                                                                            
  545                                                                            
  546                                                                            
  547 Klensin                       Informational                    [Page 10]   

  548 RFC 5894                     IDNA Rationale                  August 2010   
  549                                                                            
  550                                                                            
  551    characters in the category.  Registries are still expected to apply     
  552    judgment about labels they will accept and to maintain rules            
  553    consistent with those judgments (see the Protocol document [RFC5891]    
  554    and Section 3.3).                                                       
  555                                                                            
  556    Characters that are placed in the PROTOCOL-VALID category are           
  557    expected to never be removed from it or reclassified.  While            
  558    theoretically characters could be removed from Unicode, such removal    
  559    would be inconsistent with the Unicode stability principles (see        
  560    UTR 39: Unicode Security Mechanisms [Unicode52], Appendix F) and        
  561    hence should never occur.                                               
  562                                                                            
  563 3.1.2.  CONTEXTUAL RULE REQUIRED                                           
  564                                                                            
  565    Some characters may be unsuitable for general use in IDNs but           
  566    necessary for the plausible support of some scripts.  The two most      
  567    commonly cited examples are the ZERO WIDTH JOINER and ZERO WIDTH        
  568    NON-JOINER characters (ZWJ, U+200D and ZWNJ, U+200C), but other         
  569    characters may require special treatment because they would otherwise   
  570    be DISALLOWED (typically because Unicode considers them punctuation     
  571    or special symbols) but need to be permitted in limited contexts.       
  572    Other characters are given this special treatment because they pose     
  573    exceptional danger of being used to produce misleading labels or to     
  574    cause unacceptable ambiguity in label matching and interpretation.      
  575                                                                            
  576 3.1.2.1.  Contextual Restrictions                                          
  577                                                                            
  578    Characters with contextual restrictions are identified as CONTEXTUAL    
  579    RULE REQUIRED and are associated with a rule.  The rule defines         
  580    whether the character is valid in a particular string, and also         
  581    whether the rule itself is to be applied on lookup as well as           
  582    registration.                                                           
  583                                                                            
  584    A distinction is made between characters that indicate or prohibit      
  585    joining and ones similar to them (known as CONTEXT-JOINER or            
  586    CONTEXTJ) and other characters requiring contextual treatment           
  587    (CONTEXT-OTHER or CONTEXTO).  Only the former require full testing at   
  588    lookup time.                                                            
  589                                                                            
  590    It is important to note that these contextual rules cannot prevent      
  591    all uses of the relevant characters that might be confusing or          
  592    problematic.  What they are expected to do is to confine                
  593    applicability of the characters to scripts (and narrower contexts)      
  594    where zone administrators are knowledgeable enough about the use of     
  595    those characters to be prepared to deal with them appropriately.        
  596                                                                            
  597                                                                            
  598                                                                            
  599                                                                            
  600                                                                            
  601                                                                            
  602 Klensin                       Informational                    [Page 11]   

  603 RFC 5894                     IDNA Rationale                  August 2010   
  604                                                                            
  605                                                                            
  606    For example, a registry dealing with an Indic script that requires      
  607    ZWJ and/or ZWNJ as part of the writing system is expected to            
  608    understand where the characters have visible effect and where they do   
  609    not and to make registration rules accordingly.  By contrast, a         
  610    registry dealing primarily with Latin or Cyrillic script might not be   
  611    actively aware that the characters exist, much less about the           
  612    consequences of embedding them in labels drawn from those scripts and   
  613    therefore should avoid accepting registrations containing those         
  614    characters, at least in labels using characters from the Latin or       
  615    Cyrillic scripts.                                                       
  616                                                                            
  617 3.1.2.2.  Rules and Their Application                                      
  618                                                                            
  619    Rules have descriptions such as "Must follow a character from Script    
  620    XYZ", "Must occur only if the entire label is in Script ABC", or        
  621    "Must occur only if the previous and subsequent characters have the     
  622    DFG property".  The actual rules may be DEFINED or NULL.  If present,   
  623    they may have values of "True" (character may be used in any position   
  624    in any label), "False" (character may not be used in any label), or     
  625    may be a set of procedural rules that specify the context in which      
  626    the character is permitted.                                             
  627                                                                            
  628    Because it is easier to identify these characters than to know that     
  629    they are actually needed in IDNs or how to establish exactly the        
  630    right rules for each one, a rule may have a null value in a given       
  631    version of the tables.  Characters associated with null rules are not   
  632    permitted to appear in putative labels for either registration or       
  633    lookup.  Of course, a later version of the tables might contain a       
  634    non-null rule.                                                          
  635                                                                            
  636    The actual rules and their descriptions are in Sections 2 and 3 of      
  637    the Tables document [RFC5892].  That document also specifies the        
  638    creation of a registry for future rules.                                
  639                                                                            
  640 3.1.3.  DISALLOWED                                                         
  641                                                                            
  642    Some characters are inappropriate for use in IDNs and are thus          
  643    excluded for both registration and lookup (i.e., IDNA-conforming        
  644    applications performing name lookup should verify that these            
  645    characters are absent; if they are present, the label strings should    
  646    be rejected rather than converted to A-labels and looked up.  Some of   
  647    these characters are problematic for use in IDNs (such as the           
  648    FRACTION SLASH character, U+2044), while some of them (such as the      
  649    various HEART symbols, e.g., U+2665, U+2661, and U+2765, see            
  650    Section 7.6) simply fall outside the conventions for typical            
  651    identifiers (basically letters and numbers).                            
  652                                                                            
  653                                                                            
  654                                                                            
  655                                                                            
  656                                                                            
  657 Klensin                       Informational                    [Page 12]   

  658 RFC 5894                     IDNA Rationale                  August 2010   
  659                                                                            
  660                                                                            
  661    Of course, this category would include code points that had been        
  662    removed entirely from Unicode should such removals ever occur.          
  663                                                                            
  664    Characters that are placed in the DISALLOWED category are expected to   
  665    never be removed from it or reclassified.  If a character is            
  666    classified as DISALLOWED in error and the error is sufficiently         
  667    problematic, the only recourse would be either to introduce a new       
  668    code point into Unicode and classify it as PROTOCOL-VALID or for the    
  669    IETF to accept the considerable costs of an incompatible change and     
  670    replace the relevant RFC with one containing appropriate exceptions.    
  671                                                                            
  672    There is provision for exception cases but, in general, characters      
  673    are placed into DISALLOWED if they fall into one or more of the         
  674    following groups:                                                       
  675                                                                            
  676    o  The character is a compatibility equivalent for another character.   
  677       In slightly more precise Unicode terms, application of               
  678       Normalization Form KC (NFKC) to the character yields some other      
  679       character.                                                           
  680                                                                            
  681    o  The character is an uppercase form or some other form that is        
  682       mapped to another character by Unicode case folding.                 
  683                                                                            
  684    o  The character is a symbol or punctuation form or, more generally,    
  685       something that is not a letter, digit, or a mark that is used to     
  686       form a letter or digit.                                              
  687                                                                            
  688 3.1.4.  UNASSIGNED                                                         
  689                                                                            
  690    For convenience in processing and table-building, code points that do   
  691    not have assigned values in a given version of Unicode are treated as   
  692    belonging to a special UNASSIGNED category.  Such code points are       
  693    prohibited in labels to be registered or looked up.  The category       
  694    differs from DISALLOWED in that code points are moved out of it by      
  695    the simple expedient of being assigned in a later version of Unicode    
  696    (at which point, they are classified into one of the other categories   
  697    as appropriate).                                                        
  698                                                                            
  699    The rationale for restricting the processing of UNASSIGNED characters   
  700    is simply that the properties of such code points cannot be             
  701    completely known until actual characters are assigned to them.  For     
  702    example, assume that an UNASSIGNED code point were included in a        
  703    label to be looked up.  Assume that the code point was later assigned   
  704    to a character that required some set of contextual rules.  With that   
  705    combination, un-updated instances of IDNA-aware software might permit   
  706    lookup of labels containing the previously unassigned characters        
  707    while updated versions of the software might restrict use of the same   
  708                                                                            
  709                                                                            
  710                                                                            
  711                                                                            
  712 Klensin                       Informational                    [Page 13]   

  713 RFC 5894                     IDNA Rationale                  August 2010   
  714                                                                            
  715                                                                            
  716    label in lookup, depending on the contextual rules.  It should be       
  717    clear that under no circumstance should an UNASSIGNED character be      
  718    permitted in a label to be registered as part of a domain name.         
  719                                                                            
  720 3.2.  Registration Policy                                                  
  721                                                                            
  722    While these recommendations cannot and should not define registry       
  723    policies, registries should develop and apply additional restrictions   
  724    as needed to reduce confusion and other problems.  For example, it is   
  725    generally believed that labels containing characters from more than     
  726    one script are a bad practice although there may be some important      
  727    exceptions to that principle.  Some registries may choose to restrict   
  728    registrations to characters drawn from a very small number of           
  729    scripts.  For many scripts, the use of variant techniques such as       
  730    those as described in the JET specification for the CJK script          
  731    [RFC3743] and its generalization [RFC4290], and illustrated for         
  732    Chinese by the tables provided by the Chinese Domain Name Consortium    
  733    [RFC4713] may be helpful in reducing problems that might be perceived   
  734    by users.                                                               
  735                                                                            
  736    In general, users will benefit if registries only permit characters     
  737    from scripts that are well-understood by the registry or its            
  738    advisers.  If a registry decides to reduce opportunities for            
  739    confusion by constructing policies that disallow characters used in     
  740    historic writing systems or characters whose use is restricted to       
  741    specialized, highly technical contexts, some relevant information may   
  742    be found in Section 2.4 (Specific Character Adjustments) of Unicode     
  743    Identifier and Pattern Syntax [Unicode-UAX31], especially Table 4       
  744    (Candidate Characters for Exclusion from Identifiers), and Section      
  745    3.1 (General Security Profile for Identifiers) in Unicode Security      
  746    Mechanisms [Unicode-UTS39].                                             
  747                                                                            
  748    The requirement (in Section 4.1 of the Protocol document [RFC5891])     
  749    that registration procedures use only U-labels and/or A-labels is       
  750    intended to ensure that registrants are fully aware of exactly what     
  751    is being registered as well as encouraging use of those canonical       
  752    forms.  That provision should not be interpreted as requiring that      
  753    registrants need to provide characters in a particular code sequence.   
  754    Registrant input conventions and management are part of registrant-     
  755    registrar interactions and relationships between registries and         
  756    registrars and are outside the scope of these standards.                
  757                                                                            
  758    It is worth stressing that these principles of policy development and   
  759    application apply at all levels of the DNS, not only, e.g., top level   
  760    domain (TLD) or second level domain (SLD) registrations.  Even a        
  761    trivial, "anything is permitted that is valid under the protocol"       
  762    policy is helpful in that it helps users and application developers     
  763    know what to expect.                                                    
  764                                                                            
  765                                                                            
  766                                                                            
  767 Klensin                       Informational                    [Page 14]   

  768 RFC 5894                     IDNA Rationale                  August 2010   
  769                                                                            
  770                                                                            
  771 3.3.  Layered Restrictions: Tables, Context, Registration, and             
  772       Applications                                                         
  773                                                                            
  774    The character rules in IDNA2008 are based on the realization that       
  775    there is no single magic bullet for any of the security,                
  776    confusability, or other issues associated with IDNs.  Instead, the      
  777    specifications define a variety of approaches.  The character tables    
  778    are the first mechanism, protocol rules about how those characters      
  779    are applied or restricted in context are the second, and those two in   
  780    combination constitute the limits of what can be done in the            
  781    protocol.  As discussed in the previous section (Section 3.2),          
  782    registries are expected to restrict what they permit to be              
  783    registered, devising and using rules that are designed to optimize      
  784    the balance between confusion and risk on the one hand and maximum      
  785    expressiveness in mnemonics on the other.                               
  786                                                                            
  787    In addition, there is an important role for user interface programs     
  788    in warning against label forms that appear problematic given their      
  789    knowledge of local contexts and conventions.  Of course, no approach    
  790    based on naming or identifiers alone can protect against all threats.   
  791                                                                            
  792 4.  Application-Related Issues                                             
  793                                                                            
  794 4.1.  Display and Network Order                                            
  795                                                                            
  796    Domain names are always transmitted in network order (the order in      
  797    which the code points are sent in protocols), but they may have a       
  798    different display order (the order in which the code points are         
  799    displayed on a screen or paper).  When a domain name contains           
  800    characters that are normally written right to left, display order may   
  801    be affected although network order is not.  It gets even more           
  802    complicated if left-to-right and right-to-left labels are adjacent to   
  803    each other within a domain name.  The decision about the display        
  804    order is ultimately under the control of user agents -- including Web   
  805    browsers, mail clients, hosted Web applications and many more --        
  806    which may be highly localized.  Should a domain name abc.def, in        
  807    which both labels are represented in scripts that are written right     
  808    to left, be displayed as fed.cba or cba.fed?  Applications that are     
  809    in deployment today are already diverse, and one can find examples of   
  810    either choice.                                                          
  811                                                                            
  812    The picture changes once again when an IDN appears in an                
  813    Internationalized Resource Identifier (IRI) [RFC3987].  An IRI or       
  814    internationalized email address contains elements other than the        
  815    domain name.  For example, IRIs contain protocol identifiers and        
  816    field delimiter syntax such as "http://" or "mailto:" while email       
  817    addresses contain the "@" to separate local parts from domain names.    
  818                                                                            
  819                                                                            
  820                                                                            
  821                                                                            
  822 Klensin                       Informational                    [Page 15]   

  823 RFC 5894                     IDNA Rationale                  August 2010   
  824                                                                            
  825                                                                            
  826    An IRI in network order begins with "http://" followed by domain        
  827    labels in network order, thus "http://abc.def".                         
  828                                                                            
  829    User interface programs are not required to display and allow input     
  830    of IRIs directly but often do so.  Implementers have to choose          
  831    whether the overall direction of these strings will always be left to   
  832    right (or right to left) for an IRI or email address.  The natural      
  833    order for a user typing a domain name on a right-to-left system is      
  834    fed.cba.  Should the right-to-left (RTL) user interface reverse the     
  835    entire domain name each time a domain name is typed?  Does this         
  836    change if the user types "http://" right before typing a domain name,   
  837    thus implying that the user is beginning at the beginning of the        
  838    network-order IRI?  Experience in the 1980s and 1990s with mixing       
  839    systems in which domain name labels were read in network order (left    
  840    to right) and those in which those labels were read right to left       
  841    would predict a great deal of confusion.                                
  842                                                                            
  843    If each implementation of each application makes its own decisions on   
  844    these issues, users will develop heuristics that will sometimes fail    
  845    when switching applications.  However, while some display order         
  846    conventions, voluntarily adopted, would be desirable to reduce          
  847    confusion, such suggestions are beyond the scope of these               
  848    specifications.                                                         
  849                                                                            
  850 4.2.  Entry and Display in Applications                                    
  851                                                                            
  852    Applications can accept and display domain names using any character    
  853    set or character coding system.  The IDNA protocol does not             
  854    necessarily affect the interface between users and applications.  An    
  855    IDNA-aware application can accept and display internationalized         
  856    domain names in two formats: as the internationalized character         
  857    set(s) supported by the application (i.e., an appropriate local         
  858    representation of a U-label) and as an A-label.  Applications may       
  859    allow the display of A-labels, but are encouraged not to do so except   
  860    as an interface for special purposes, possibly for debugging, or to     
  861    cope with display limitations.  In general, they should allow, but      
  862    not encourage, user input of A-labels.  A-labels are opaque and ugly,   
  863    and malicious variations on them are not easily detected by users.      
  864    Where possible, they should thus only be exposed when they are          
  865    absolutely needed.  Because IDN labels can be rendered either as        
  866    A-labels or U-labels, the application may reasonably have an option     
  867    for the user to select the preferred method of display.  Rendering      
  868    the U-label should normally be the default.                             
  869                                                                            
  870    Domain names are often stored and transported in many places.  For      
  871    example, they are part of documents such as mail messages and web       
  872    pages.  They are transported in many parts of many protocols, such as   
  873    both the control commands of SMTP and associated message body parts,    
  874                                                                            
  875                                                                            
  876                                                                            
  877 Klensin                       Informational                    [Page 16]   

  878 RFC 5894                     IDNA Rationale                  August 2010   
  879                                                                            
  880                                                                            
  881    and in the headers and the body content in HTTP.  It is important to    
  882    remember that domain names appear both in domain name slots and in      
  883    the content that is passed over protocols, and it would be helpful if   
  884    protocols explicitly define what their domain name slots are.           
  885                                                                            
  886    In protocols and document formats that define how to handle             
  887    specification or negotiation of charsets, labels can be encoded in      
  888    any charset allowed by the protocol or document format.  If a           
  889    protocol or document format only allows one charset, the labels must    
  890    be given in that charset.  Of course, not all charsets can properly     
  891    represent all labels.  If a U-label cannot be displayed in its          
  892    entirety, the only choice (without loss of information) may be to       
  893    display the A-label.                                                    
  894                                                                            
  895    Where a protocol or document format allows IDNs, labels should be in    
  896    whatever character encoding and escape mechanism the protocol or        
  897    document format uses in the local environment.  This provision is       
  898    intended to prevent situations in which, e.g., UTF-8 domain names       
  899    appear embedded in text that is otherwise in some other character       
  900    coding.                                                                 
  901                                                                            
  902    All protocols that use domain name slots (see Section 2.3.2.6 in the    
  903    Definitions document [RFC5890]) already have the capacity for           
  904    handling domain names in the ASCII charset.  Thus, A-labels can         
  905    inherently be handled by those protocols.                               
  906                                                                            
  907    IDNA2008 does not specify required mappings between one character or    
  908    code point and others.  An extended discussion of mapping issues        
  909    appears in Section 6 and specific recommendations appear in the         
  910    Mapping document [IDNA2008-Mapping].  In general, IDNA2008 prohibits    
  911    characters that would be mapped to others by normalization or other     
  912    rules.  As examples, while mathematical characters based on Latin       
  913    ones are accepted as input to IDNA2003, they are prohibited in          
  914    IDNA2008.  Similarly, uppercase characters, double-width characters,    
  915    and other variations are prohibited as IDNA input although mapping      
  916    them as needed in user interfaces is strongly encouraged.               
  917                                                                            
  918    Since the rules in the Tables document [RFC5892] have the effect that   
  919    only strings that are not transformed by NFKC are valid, if an          
  920    application chooses to perform NFKC normalization before lookup, that   
  921    operation is safe since this will never make the application unable     
  922    to look up any valid string.  However, as discussed above, the          
  923    application cannot guarantee that any other application will perform    
  924    that mapping, so it should be used only with caution and for informed   
  925    users.                                                                  
  926                                                                            
  927                                                                            
  928                                                                            
  929                                                                            
  930                                                                            
  931                                                                            
  932 Klensin                       Informational                    [Page 17]   

  933 RFC 5894                     IDNA Rationale                  August 2010   
  934                                                                            
  935                                                                            
  936    In many cases, these prohibitions should have no effect on what the     
  937    user can type as input to the lookup process.  It is perfectly          
  938    reasonable for systems that support user interfaces to perform some     
  939    character mapping that is appropriate to the local environment.  This   
  940    would normally be done prior to actual invocation of IDNA.  At least    
  941    conceptually, the mapping would be part of the Unicode conversions      
  942    discussed above and in the Protocol document [RFC5891].  However,       
  943    those changes will be local ones only -- local to environments in       
  944    which users will clearly understand that the character forms are        
  945    equivalent.  For use in interchanges among systems, it appears to be    
  946    much more important that U-labels and A-labels can be mapped back and   
  947    forth without loss of information.                                      
  948                                                                            
  949    One specific, and very important, instance of this strategy arises      
  950    with case folding.  In the ASCII-only DNS, names are looked up and      
  951    matched in a case-independent way, but no actual case folding occurs.   
  952    Names can be placed in the DNS in either uppercase or lowercase form    
  953    (or any mixture of them) and that form is preserved, returned in        
  954    queries, and so on.  IDNA2003 approximated that behavior for            
  955    non-ASCII strings by performing case folding at registration time       
  956    (resulting in only lowercase IDNs in the DNS) and when names were       
  957    looked up.                                                              
  958                                                                            
  959    As suggested earlier in this section, it appears to be desirable to     
  960    do as little character mapping as possible as long as Unicode works     
  961    correctly (e.g., Normalization Form C (NFC) mapping to resolve          
  962    different codings for the same character is still necessary although    
  963    the specifications require that it be performed prior to invoking the   
  964    protocol) in order to make the mapping between A-labels and U-labels    
  965    idempotent.  Case mapping is not an exception to this principle.  If    
  966    only lowercase characters can be registered in the DNS (i.e., be        
  967    present in a U-label), then IDNA2008 should prohibit uppercase          
  968    characters as input even though user interfaces to applications         
  969    should probably map those characters.  Some other considerations        
  970    reinforce this conclusion.  For example, in ASCII case mapping for      
  971    individual characters, uppercase(character) is always equal to          
  972    uppercase(lowercase(character)).  That may not be true with IDNs.  In   
  973    some scripts that use case distinctions, there are a few characters     
  974    that do not have counterparts in one case or the other.  The            
  975    relationship between uppercase and lowercase may even be language-      
  976    dependent, with different languages (or even the same language in       
  977    different areas) expecting different mappings.  User interface          
  978    programs can meet the expectations of users who are accustomed to the   
  979    case-insensitive DNS environment by performing case folding prior to    
  980    IDNA processing, but the IDNA procedures themselves should neither      
  981    require such mapping nor expect them when they are not natural to the   
  982    localized environment.                                                  
  983                                                                            
  984                                                                            
  985                                                                            
  986                                                                            
  987 Klensin                       Informational                    [Page 18]   

  988 RFC 5894                     IDNA Rationale                  August 2010   
  989                                                                            
  990                                                                            
  991 4.3.  Linguistic Expectations: Ligatures, Digraphs, and Alternate          
  992       Character Forms                                                      
  993                                                                            
  994    Users have expectations about character matching or equivalence that    
  995    are based on their own languages and the orthography of those           
  996    languages.  These expectations may not always be met in a global        
  997    system, especially if multiple languages are written using the same     
  998    script but using different conventions.  Some examples:                 
  999                                                                            
 1000    o  A Norwegian user might expect a label with the ae-ligature to be     
 1001       treated as the same label as one using the Swedish spelling with     
 1002       a-diaeresis even though applying that mapping to English would be    
 1003       astonishing to users.                                                
 1004                                                                            
 1005    o  A German user might expect a label with an o-umlaut and a label      
 1006       that had "oe" substituted, but was otherwise the same, to be         
 1007       treated as equivalent even though that substitution would be a       
 1008       clear error in Swedish.                                              
 1009                                                                            
 1010    o  A Chinese user might expect automatic matching of Simplified and     
 1011       Traditional Chinese characters, but applying that matching for       
 1012       Korean or Japanese text would create considerable confusion.         
 1013                                                                            
 1014    o  An English user might expect "theater" and "theatre" to match.       
 1015                                                                            
 1016    A number of languages use alphabetic scripts in which single phonemes   
 1017    are written using two characters, termed a "digraph", for example,      
 1018    the "ph" in "pharmacy" and "telephone".  (Such characters can also      
 1019    appear consecutively without forming a digraph, as in "tophat".)        
 1020    Certain digraphs may be indicated typographically by setting the two    
 1021    characters closer together than they would be if used consecutively     
 1022    to represent different phonemes.  Some digraphs are fully joined as     
 1023    ligatures.  For example, the word "encyclopaedia" is sometimes set      
 1024    with a U+00E6 LATIN SMALL LIGATURE AE.  When ligature and digraph       
 1025    forms have the same interpretation across all languages that use a      
 1026    given script, application of Unicode normalization generally resolves   
 1027    the differences and causes them to match.  When they have different     
 1028    interpretations, matching must utilize other methods, presumably        
 1029    chosen at the registry level, or users must be educated to understand   
 1030    that matching will not occur.                                           
 1031                                                                            
 1032    The nature of the problem can be illustrated by many words in the       
 1033    Norwegian language, where the "ae" ligature is the 27th letter of a     
 1034    29-letter extended Latin alphabet.  It is equivalent to the 28th        
 1035    letter of the Swedish alphabet (also containing 29 letters),            
 1036    U+00E4 LATIN SMALL LETTER A WITH DIAERESIS, for which an "ae" cannot    
 1037    be substituted according to current orthographic standards.  That       
 1038    character (U+00E4) is also part of the German alphabet where, unlike    
 1039                                                                            
 1040                                                                            
 1041                                                                            
 1042 Klensin                       Informational                    [Page 19]   

 1043 RFC 5894                     IDNA Rationale                  August 2010   
 1044                                                                            
 1045                                                                            
 1046    in the Nordic languages, the two-character sequence "ae" is usually     
 1047    treated as a fully acceptable alternate orthography for the "umlauted   
 1048    a" character.  The inverse is however not true, and those two           
 1049    characters cannot necessarily be combined into an "umlauted a".  This   
 1050    also applies to another German character, the "umlauted o"              
 1051    (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) which, for example,        
 1052    cannot be used for writing the name of the author "Goethe".  It is      
 1053    also a letter in the Swedish alphabet where, like the "a with           
 1054    diaeresis", it cannot be correctly represented as "oe" and in the       
 1055    Norwegian alphabet, where it is represented, not as "o with             
 1056    diaeresis", but as "slashed o", U+00F8.                                 
 1057                                                                            
 1058    Some of the ligatures that have explicit code points in Unicode were    
 1059    given special handling in IDNA2003 and now pose additional problems     
 1060    in transition.  See Section 7.2.                                        
 1061                                                                            
 1062    Additional cases with alphabets written right to left are described     
 1063    in Section 4.5.                                                         
 1064                                                                            
 1065    Matching and comparison algorithm selection often requires              
 1066    information about the language being used, context, or both --          
 1067    information that is not available to IDNA or the DNS.  Consequently,    
 1068    IDNA2008 makes no attempt to treat combined characters in any special   
 1069    way.  A registry that is aware of the language context in which         
 1070    labels are to be registered, and where that language sometimes (or      
 1071    always) treats the two-character sequences as equivalent to the         
 1072    combined form, should give serious consideration to applying a          
 1073    "variant" model [RFC3743][RFC4290] or to prohibiting registration of    
 1074    one of the forms entirely, to reduce the opportunities for user         
 1075    confusion and fraud that would result from the related strings being    
 1076    registered to different parties.                                        
 1077                                                                            
 1078 4.4.  Case Mapping and Related Issues                                      
 1079                                                                            
 1080    In the DNS, ASCII letters are stored with their case preserved.         
 1081    Matching during the query process is case-independent, but none of      
 1082    the information that might be represented by choices of case has been   
 1083    lost.  That model has been accidentally helpful because, as people      
 1084    have created DNS labels by catenating words (or parts of words) to      
 1085    form labels, case has often been used to distinguish among components   
 1086    and make the labels more memorable.                                     
 1087                                                                            
 1088    Since DNS servers do not get involved in parsing IDNs, they cannot do   
 1089    case-independent matching.  Thus, keeping the cases separate in         
 1090    lookup or registration, and doing matching at the server, is not        
 1091    feasible with IDNA or any similar approach.  Matching of characters     
 1092    that are considered to differ only by case must be done, if desired,    
 1093    by programs invoking IDNA lookup even though it wasn't done by ASCII-   
 1094                                                                            
 1095                                                                            
 1096                                                                            
 1097 Klensin                       Informational                    [Page 20]   

 1098 RFC 5894                     IDNA Rationale                  August 2010   
 1099                                                                            
 1100                                                                            
 1101    only DNS clients.  That situation was recognized in IDNA2003 and        
 1102    nothing in IDNA2008 fundamentally changes it or could do so.  In        
 1103    IDNA2003, all characters are case folded and mapped by clients in a     
 1104    standardized step.                                                      
 1105                                                                            
 1106    Even in scripts that generally support case distinctions, some          
 1107    characters do not have uppercase forms.  For example, the Unicode       
 1108    case-folding operation maps Greek Final Form Sigma (U+03C2) to the      
 1109    medial form (U+03C3) and maps Eszett (German Sharp S, U+00DF) to        
 1110    "ss".  Neither of these mappings is reversible because the uppercase    
 1111    of U+03C3 is the uppercase Sigma (U+03A3) and "ss" is an ASCII          
 1112    string.  IDNA2008 permits, at the risk of some incompatibility,         
 1113    slightly more flexibility in this area by avoiding case folding and     
 1114    treating these characters as themselves.  Approaches to handling one-   
 1115    way mappings are discussed in Section 7.2.                              
 1116                                                                            
 1117    Because IDNA2003 maps Final Sigma and Eszett to other characters, and   
 1118    the reverse mapping is never possible, neither Final Sigma nor Eszett   
 1119    can be represented in the ACE form of IDNA2003 IDN nor in the native    
 1120    character (U-label) form derived from it.  With IDNA2008, both          
 1121    characters can be used in an IDN and so the A-label used for lookup     
 1122    for any U-label containing those characters is now different.  See      
 1123    Section 7.1 for a discussion of what kinds of changes might require     
 1124    the IDNA prefix to change; after extended discussions, the IDNABIS      
 1125    Working Group came to consensus that the change for these characters    
 1126    did not justify a prefix change.                                        
 1127                                                                            
 1128 4.5.  Right-to-Left Text                                                   
 1129                                                                            
 1130    In order to be sure that the directionality of right-to-left text is    
 1131    unambiguous, IDNA2003 required that any label in which right-to-left    
 1132    characters appear both starts and ends with them and that it does not   
 1133    include any characters with strong left-to-right properties (that       
 1134    excludes other alphabetic characters but permits European digits).      
 1135    Any other string that contains a right-to-left character and does not   
 1136    meet those requirements is rejected.  This is one of the few places     
 1137    where the IDNA algorithms (both in IDNA2003 and in IDNA2008) examine    
 1138    an entire label, not just individual characters.  The algorithmic       
 1139    model used in IDNA2003 rejects the label when the final character in    
 1140    a right-to-left string requires a combining mark in order to be         
 1141    correctly represented.                                                  
 1142                                                                            
 1143    That prohibition is not acceptable for writing systems for languages    
 1144    written with consonantal alphabets to which diacritical vocalic         
 1145    systems are applied, and for languages with orthographies derived       
 1146    from them where the combining marks may have different functionality.   
 1147    In both cases, the combining marks can be essential components of the   
 1148    orthography.  Examples of this are Yiddish, written with an extended    
 1149                                                                            
 1150                                                                            
 1151                                                                            
 1152 Klensin                       Informational                    [Page 21]   

 1153 RFC 5894                     IDNA Rationale                  August 2010   
 1154                                                                            
 1155                                                                            
 1156    Hebrew script, and Dhivehi (the official language of Maldives), which   
 1157    is written in the Thaana script (which is, in turn, derived from the    
 1158    Arabic script).  IDNA2008 removes the restriction on final combining    
 1159    characters with a new set of rules for right-to-left scripts and        
 1160    their characters.  Those new rules are specified in the Bidi document   
 1161    [RFC5893].                                                              
 1162                                                                            
 1163 5.  IDNs and the Robustness Principle                                      
 1164                                                                            
 1165    The "Robustness Principle" is often stated as "Be conservative about    
 1166    what you send and liberal in what you accept" (see, e.g., Section       
 1167    1.2.2 of the applications-layer Host Requirements specification         
 1168    [RFC1123]).  This principle applies to IDNA.  In applying the           
 1169    principle to registries as the source ("sender") of all registered      
 1170    and useful IDNs, registries are responsible for being conservative      
 1171    about what they register and put out in the Internet.  For IDNs to      
 1172    work well, zone administrators (registries) must have and require       
 1173    sensible policies about what is registered -- conservative policies     
 1174    -- and implement and enforce them.                                      
 1175                                                                            
 1176    Conversely, lookup applications are expected to reject labels that      
 1177    clearly violate global (protocol) rules (no one has ever seriously      
 1178    claimed that being liberal in what is accepted requires being           
 1179    stupid).  However, once one gets past such global rules and deals       
 1180    with anything sensitive to script or locale, it is necessary to         
 1181    assume that garbage has not been placed into the DNS, i.e., one must    
 1182    be liberal about what one is willing to look up in the DNS rather       
 1183    than guessing about whether it should have been permitted to be         
 1184    registered.                                                             
 1185                                                                            
 1186    If a string cannot be successfully found in the DNS after the lookup    
 1187    processing described here, it makes no difference whether it simply     
 1188    wasn't registered or was prohibited by some rule at the registry.       
 1189    Application implementers should be aware that where DNS wildcards are   
 1190    used, the ability to successfully resolve a name does not guarantee     
 1191    that it was actually registered.                                        
 1192                                                                            
 1193 6.  Front-end and User Interface Processing for Lookup                     
 1194                                                                            
 1195    Domain names may be identified and processed in many contexts.  They    
 1196    may be typed in by users themselves or embedded in an identifier such   
 1197    as an email address, URI, or IRI.  They may occur in running text or    
 1198    be processed by one system after being provided in another.  Systems    
 1199    may try to normalize URLs to determine (or guess) whether a reference   
 1200    is valid or if two references point to the same object without          
 1201    actually looking the objects up (comparison without lookup is           
 1202    necessary for URI types that are not intended to be resolved).  Some    
 1203    of these goals may be more easily and reliably satisfied than others.   
 1204                                                                            
 1205                                                                            
 1206                                                                            
 1207 Klensin                       Informational                    [Page 22]   

 1208 RFC 5894                     IDNA Rationale                  August 2010   
 1209                                                                            
 1210                                                                            
 1211    While there are strong arguments for any domain name that is placed     
 1212    "on the wire" -- transmitted between systems -- to be in the zero-      
 1213    ambiguity forms of A-labels, it is inevitable that programs that        
 1214    process domain names will encounter U-labels or variant forms.          
 1215                                                                            
 1216    An application that implements the IDNA protocol [RFC5891] will         
 1217    always take any user input and convert it to a set of Unicode code      
 1218    points.  That user input may be acquired by any of several different    
 1219    input methods, all with differing conversion processes to be taken      
 1220    into consideration (e.g., typed on a keyboard, written by hand onto     
 1221    some sort of digitizer, spoken into a microphone and interpreted by a   
 1222    speech-to-text engine, etc.).  The process of taking any particular     
 1223    user input and mapping it into a Unicode code point may be a simple     
 1224    one: if a user strikes the "A" key on a US English keyboard, without    
 1225    any modifiers such as the "Shift" key held down, in order to draw a     
 1226    Latin small letter A ("a"), many (perhaps most) modern operating        
 1227    system input methods will produce to the calling application the code   
 1228    point U+0061, encoded in a single octet.                                
 1229                                                                            
 1230    Sometimes the process is somewhat more complicated: a user might        
 1231    strike a particular set of keys to represent a combining macron         
 1232    followed by striking the "A" key in order to draw a Latin small         
 1233    letter A with a macron above it.  Depending on the operating system,    
 1234    the input method chosen by the user, and even the parameters with       
 1235    which the application communicates with the input method, the result    
 1236    might be the code point U+0101 (encoded as two octets in UTF-8 or       
 1237    UTF-16, four octets in UTF-32, etc.), the code point U+0061 followed    
 1238    by the code point U+0304 (again, encoded in three or more octets,       
 1239    depending upon the encoding used) or even the code point U+FF41         
 1240    followed by the code point U+0304 (and encoded in some form).  These    
 1241    examples leave aside the issue of operating systems and input methods   
 1242    that do not use Unicode code points for their character set.            
 1243                                                                            
 1244    In every case, applications (with the help of the operating systems     
 1245    on which they run and the input methods used) need to perform a         
 1246    mapping from user input into Unicode code points.                       
 1247                                                                            
 1248    IDNA2003 used a model whereby input was taken from the user, mapped     
 1249    (via whatever input method mechanisms were used) to a set of Unicode    
 1250    code points, and then further mapped to a set of Unicode code points    
 1251    using the Nameprep profile [RFC3491].  In this procedure, there are     
 1252    two separate mapping steps: first, a mapping done by the input method   
 1253    (which might be controlled by the operating system, the application,    
 1254    or some combination) and then a second mapping performed by the         
 1255    Nameprep portion of the IDNA protocol.  The mapping done in Nameprep    
 1256    includes a particular mapping table to re-map some characters to        
 1257    other characters, a particular normalization, and a set of prohibited   
 1258    characters.                                                             
 1259                                                                            
 1260                                                                            
 1261                                                                            
 1262 Klensin                       Informational                    [Page 23]   

 1263 RFC 5894                     IDNA Rationale                  August 2010   
 1264                                                                            
 1265                                                                            
 1266    Note that the result of the two-step mapping process means that the     
 1267    mapping chosen by the operating system or application in the first      
 1268    step might differ significantly from the mapping supplied by the        
 1269    Nameprep profile in the second step.  This has advantages and           
 1270    disadvantages.  Of course, the second mapping regularizes what gets     
 1271    looked up in the DNS, making for better interoperability between        
 1272    implementations that use the Nameprep mapping.  However, the            
 1273    application or operating system may choose mappings in their input      
 1274    methods, which when passed through the second (Nameprep) mapping        
 1275    result in characters that are "surprising" to the end user.             
 1276                                                                            
 1277    The other important feature of IDNA2003 is that, with very few          
 1278    exceptions, it assumes that any set of Unicode code points provided     
 1279    to the Nameprep mapping can be mapped into a string of Unicode code     
 1280    points that are "sensible", even if that means mapping some code        
 1281    points to nothing (that is, removing the code points from the           
 1282    string).  This allowed maximum flexibility in input strings.            
 1283                                                                            
 1284    The present version of IDNA (IDNA2008) differs significantly in         
 1285    approach from the original version.  First and foremost, it does not    
 1286    provide explicit mapping instructions.  Instead, it assumes that the    
 1287    application (perhaps via an operating system input method) will do      
 1288    whatever mapping it requires to convert input into Unicode code         
 1289    points.  This has the advantage of giving flexibility to the            
 1290    application to choose a mapping that is suitable for its user given     
 1291    specific user requirements, and avoids the two-step mapping of the      
 1292    original protocol.  Instead of a mapping, IDNA2008 provides a set of    
 1293    categories that can be used to specify the valid code points allowed    
 1294    in a domain name.                                                       
 1295                                                                            
 1296    In principle, an application ought to take user input of a domain       
 1297    name and convert it to the set of Unicode code points that represent    
 1298    the domain name the user intends.  As a practical matter, of course,    
 1299    determining user intent is a tricky business, so an application needs   
 1300    to choose a reasonable mapping from user input.  That may differ        
 1301    based on the particular circumstances of a user, depending on locale,   
 1302    language, type of input method, etc.  It is up to the application to    
 1303    make a reasonable choice.                                               
 1304                                                                            
 1305                                                                            
 1306                                                                            
 1307                                                                            
 1308                                                                            
 1309                                                                            
 1310                                                                            
 1311                                                                            
 1312                                                                            
 1313                                                                            
 1314                                                                            
 1315                                                                            
 1316                                                                            
 1317 Klensin                       Informational                    [Page 24]   

 1318 RFC 5894                     IDNA Rationale                  August 2010   
 1319                                                                            
 1320                                                                            
 1321 7.  Migration from IDNA2003 and Unicode Version Synchronization            
 1322                                                                            
 1323 7.1.  Design Criteria                                                      
 1324                                                                            
 1325    As mentioned above and in the IAB review and recommendations for IDNs   
 1326    [RFC4690], two key goals of the IDNA2008 design are:                    
 1327                                                                            
 1328    o  to enable applications to be agnostic about whether they are being   
 1329       run in environments supporting any Unicode version from 3.2          
 1330       onward.                                                              
 1331                                                                            
 1332    o  to permit incrementally adding new characters, character groups,     
 1333       scripts, and other character collections as they are incorporated    
 1334       into Unicode, doing so without disruption and, in the long term,     
 1335       without "heavy" processes (an IETF consensus process is required     
 1336       by the IDNA2008 specifications and is expected to be required and    
 1337       used until significant experience accumulates with IDNA operations   
 1338       and new versions of Unicode).                                        
 1339                                                                            
 1340 7.1.1.  Summary and Discussion of IDNA Validity Criteria                   
 1341                                                                            
 1342    The general criteria for a label to be considered valid under IDNA      
 1343    are (the actual rules are rigorously defined in the Protocol            
 1344    [RFC5891] and Tables [RFC5892] documents):                              
 1345                                                                            
 1346    o  The characters are "letters", marks needed to form letters,          
 1347       numerals, or other code points used to write words in some           
 1348       language.  Symbols, drawing characters, and various notational       
 1349       characters are intended to be permanently excluded.  There is no     
 1350       evidence that they are important enough to Internet operations or    
 1351       internationalization to justify expansion of domain names beyond     
 1352       the general principle of "letters, digits, and hyphen".              
 1353       (Additional discussion and rationale for the symbol decision         
 1354       appears in Section 7.6.)                                             
 1355                                                                            
 1356    o  Other than in very exceptional cases, e.g., where they are needed    
 1357       to write substantially any word of a given language, punctuation     
 1358       characters are excluded.  The fact that a word exists is not proof   
 1359       that it should be usable in a DNS label, and DNS labels are not      
 1360       expected to be usable for multiple-word phrases (although they are   
 1361       certainly not prohibited if the conventions and orthography of a     
 1362       particular language cause that to be possible).                      
 1363                                                                            
 1364    o  Characters that are unassigned (have no character assignment at      
 1365       all) in the version of Unicode being used by the registry or         
 1366       application are not permitted, even on lookup.  The issues           
 1367       involved in this decision are discussed in Section 7.7.              
 1368                                                                            
 1369                                                                            
 1370                                                                            
 1371                                                                            
 1372 Klensin                       Informational                    [Page 25]   

 1373 RFC 5894                     IDNA Rationale                  August 2010   
 1374                                                                            
 1375                                                                            
 1376    o  Any character that is mapped to another character by a current       
 1377       version of NFKC is prohibited as input to IDNA (for either           
 1378       registration or lookup).  With a few exceptions, this principle      
 1379       excludes any character mapped to another by Nameprep [RFC3491].      
 1380                                                                            
 1381    The principles above drive the design of rules that are specified       
 1382    exactly in the Tables document.  Those rules identify the characters    
 1383    that are valid under IDNA.  The rules themselves are normative, and     
 1384    the tables are derived from them, rather than vice versa.               
 1385                                                                            
 1386 7.1.2.  Labels in Registration                                             
 1387                                                                            
 1388    Any label registered in a DNS zone must be validated -- i.e., the       
 1389    criteria for that label must be met -- in order for applications to     
 1390    work as intended.  This principle is not new.  For example, since the   
 1391    DNS was first deployed, zone administrators have been expected to       
 1392    verify that names meet "hostname" requirements [RFC0952] where those    
 1393    requirements are imposed by the expected applications.  Other           
 1394    applications contexts, such as the later addition of special service    
 1395    location formats [RFC2782] imposed new requirements on zone             
 1396    administrators.  For zones that will contain IDNs, support for          
 1397    Unicode version-independence requires restrictions on all strings       
 1398    placed in the zone.  In particular, for such zones (the exact rules     
 1399    appear in Section 4 of the Protocol document [RFC5891]):                
 1400                                                                            
 1401    o  Any label that appears to be an A-label, i.e., any label that        
 1402       starts in "xn--", must be valid under IDNA, i.e., they must be       
 1403       valid A-labels, as discussed in Section 2 above.                     
 1404                                                                            
 1405    o  The Unicode tables (i.e., tables of code points, character           
 1406       classes, and properties) and IDNA tables (i.e., tables of            
 1407       contextual rules such as those that appear in the Tables             
 1408       document), must be consistent on the systems performing or           
 1409       validating labels to be registered.  Note that this does not         
 1410       require that tables reflect the latest version of Unicode, only      
 1411       that all tables used on a given system are consistent with each      
 1412       other.                                                               
 1413                                                                            
 1414    Under this model, registry tables will need to be updated (both the     
 1415    Unicode-associated tables and the tables of permitted IDN characters)   
 1416    to enable a new script or other set of new characters.  The registry    
 1417    will not be affected by newer versions of Unicode, or newly             
 1418    authorized characters, until and unless it wishes to support them.      
 1419    The zone administrator is responsible for verifying validity for IDNA   
 1420    as well as its local policies -- a more extensive set of checks than    
 1421    are required for looking up the labels.  Systems looking up or          
 1422                                                                            
 1423                                                                            
 1424                                                                            
 1425                                                                            
 1426                                                                            
 1427 Klensin                       Informational                    [Page 26]   

 1428 RFC 5894                     IDNA Rationale                  August 2010   
 1429                                                                            
 1430                                                                            
 1431    resolving DNS labels, especially IDN DNS labels, must be able to        
 1432    assume that applicable registration rules were followed for names       
 1433    entered into the DNS.                                                   
 1434                                                                            
 1435 7.1.3.  Labels in Lookup                                                   
 1436                                                                            
 1437    Any application processing a label through IDNA so it can be looked     
 1438    up in a DNS zone is required to (the exact rules appear in Section 5    
 1439    of the Protocol document [RFC5891]):                                    
 1440                                                                            
 1441    o  Maintain IDNA and Unicode tables that are consistent with regard     
 1442       to versions, i.e., unless the application actually executes the      
 1443       classification rules in the Tables document [RFC5892], its IDNA      
 1444       tables must be derived from the version of Unicode that is           
 1445       supported more generally on the system.  As with registration, the   
 1446       tables need not reflect the latest version of Unicode, but they      
 1447       must be consistent.                                                  
 1448                                                                            
 1449    o  Validate the characters in labels to be looked up only to the        
 1450       extent of determining that the U-label does not contain              
 1451       "DISALLOWED" code points or code points that are unassigned in its   
 1452       version of Unicode.                                                  
 1453                                                                            
 1454    o  Validate the label itself for conformance with a small number of     
 1455       whole-label rules.  In particular, it must verify that:              
 1456                                                                            
 1457       *  there are no leading combining marks,                             
 1458                                                                            
 1459       *  the Bidi conditions are met if right-to-left characters appear,   
 1460                                                                            
 1461       *  any required contextual rules are available, and                  
 1462                                                                            
 1463       *  any contextual rules that are associated with joiner characters   
 1464          (and CONTEXTJ characters more generally) are tested.              
 1465                                                                            
 1466    o  Do not reject labels based on other contextual rules about           
 1467       characters, including mixed-script label prohibitions.  Such rules   
 1468       may be used to influence presentation decisions in the user          
 1469       interface, but not to avoid looking up domain names.                 
 1470                                                                            
 1471    To further clarify the rules about handling characters that require     
 1472    contextual rules, note that one can have a context-required character   
 1473    (i.e., one that requires a rule), but no rule.  In that case, the       
 1474    character is treated the same way DISALLOWED characters are treated,    
 1475    until and unless a rule is supplied.  That state is more or less        
 1476    equivalent to "the idea of permitting this character is accepted in     
 1477    principle, but it won't be permitted in practice until consensus is     
 1478    reached on a safe way to use it".                                       
 1479                                                                            
 1480                                                                            
 1481                                                                            
 1482 Klensin                       Informational                    [Page 27]   

 1483 RFC 5894                     IDNA Rationale                  August 2010   
 1484                                                                            
 1485                                                                            
 1486    The ability to add a rule more or less exempts these characters from    
 1487    the prohibition against reclassifying characters from DISALLOWED to     
 1488    PVALID.                                                                 
 1489                                                                            
 1490    And, obviously, "no rule" is different from "have a rule, but the       
 1491    test either succeeds or fails".                                         
 1492                                                                            
 1493    Lookup applications that follow these rules, rather than having their   
 1494    own criteria for rejecting lookup attempts, are not sensitive to        
 1495    version incompatibilities with the particular zone registry             
 1496    associated with the domain name except for labels containing            
 1497    characters recently added to Unicode.                                   
 1498                                                                            
 1499    An application or client that processes names according to this         
 1500    protocol and then resolves them in the DNS will be able to locate any   
 1501    name that is registered, as long as those registrations are valid       
 1502    under IDNA and its version of the IDNA tables is sufficiently up to     
 1503    date to interpret all of the characters in the label.  Messages to      
 1504    users should distinguish between "label contains an unallocated code    
 1505    point" and other types of lookup failures.  A failure on the basis of   
 1506    an old version of Unicode may lead the user to a desire to upgrade to   
 1507    a newer version, but will have no other ill effects (this is            
 1508    consistent with behavior in the transition to the DNS when some hosts   
 1509    could not yet handle some forms of names or record types).              
 1510                                                                            
 1511 7.2.  Changes in Character Interpretations                                 
 1512                                                                            
 1513    As a consequence of the elimination of mapping, the current version     
 1514    of IDNA changes the interpretation of a few characters relative to      
 1515    its predecessors.  This subsection outlines the issues and discusses    
 1516    possible transition strategies.                                         
 1517                                                                            
 1518 7.2.1.  Character Changes: Eszett and Final Sigma                          
 1519                                                                            
 1520    In those scripts that make case distinctions, there are a few           
 1521    characters for which an obvious and unique uppercase character has      
 1522    not historically been available to match a lowercase one, or vice       
 1523    versa.  For those characters, the mappings used in constructing the     
 1524    Stringprep tables for IDNA2003, performed using the Unicode             
 1525    toCaseFold operation (see Section 5.18 of the Unicode Standard          
 1526    [Unicode52]), generate different characters or sets of characters.      
 1527    Those operations are not reversible and lose even more information      
 1528    than traditional uppercase or lowercase transformations, but are more   
 1529    useful than those transformations for comparison purposes.  Two         
 1530    notable characters of this type are the German character Eszett         
 1531    (Sharp S, U+00DF) and the Greek Final Form Sigma (U+03C2).  The         
 1532    former is case folded to the ASCII string "ss", the latter to a         
 1533    medial (lowercase) Sigma (U+03C3).                                      
 1534                                                                            
 1535                                                                            
 1536                                                                            
 1537 Klensin                       Informational                    [Page 28]   

 1538 RFC 5894                     IDNA Rationale                  August 2010   
 1539                                                                            
 1540                                                                            
 1541 7.2.2.  Character Changes: Zero Width Joiner and Zero Width Non-Joiner     
 1542                                                                            
 1543    IDNA2003 mapped both ZERO WIDTH JOINER (ZWJ, U+200D) and ZERO WIDTH     
 1544    NON-JOINER (ZWNJ, U+200C) to nothing, effectively dropping these        
 1545    characters from any label in which they appeared and treating strings   
 1546    containing them as identical to strings that did not.  As discussed     
 1547    in Section 3.1.2 above, those characters are essential for writing      
 1548    many reasonable mnemonics for certain scripts.  However, treating       
 1549    them as valid in IDNA2008, even with contextual restrictions, raises    
 1550    approximately the same problem as exists with Eszett and Final Sigma:   
 1551    strings that were valid under IDNA2003 have different interpretations   
 1552    as labels, and different A-labels, than the same strings under this     
 1553    newer version.                                                          
 1554                                                                            
 1555 7.2.3.  Character Changes and the Need for Transition                      
 1556                                                                            
 1557    The decision to eliminate mandatory and standardized mappings,          
 1558    including case folding, from the IDNA2008 protocol in order to make     
 1559    A-labels and U-labels idempotent made these characters problematic.     
 1560    If they were to be disallowed, important words and mnemonics could      
 1561    not be written in orthographically reasonable ways.  If they were to    
 1562    be permitted as distinct characters, there would be no information      
 1563    loss and registries would have more flexibility, but IDNA2003 and       
 1564    IDNA2008 lookups might result in different A-labels.                    
 1565                                                                            
 1566    With the understanding that there would be incompatibility either way   
 1567    but a judgment that the incompatibility was not significant enough to   
 1568    justify a prefix change, the Working Group concluded that Eszett and    
 1569    Final Form Sigma should be treated as distinct and Protocol-Valid       
 1570    characters.                                                             
 1571                                                                            
 1572    Since these characters are interpreted in different ways under the      
 1573    older and newer versions of IDNA, transition strategies and policies    
 1574    will be necessary.  Some actions can reasonably be taken by             
 1575    applications' client programs (those that perform lookup operations     
 1576    or cause them to be performed), but because of the diversity of         
 1577    situations and uses of the DNS, much of the responsibility will need    
 1578    to fall on registries.                                                  
 1579                                                                            
 1580    Registries, especially those maintaining zones for third parties,       
 1581    must decide how to introduce a new service in a way that does not       
 1582    create confusion or significantly weaken or invalidate existing         
 1583    identifiers.  This is not a new problem; registries were faced with     
 1584    similar issues when IDNs were introduced (potentially, and especially   
 1585    for Latin-based scripts, in conflict with existing labels that had      
 1586    been rendered in ASCII characters by applying more or less              
 1587    standardized conventions) and when other new forms of strings have      
 1588    been permitted as labels.                                               
 1589                                                                            
 1590                                                                            
 1591                                                                            
 1592 Klensin                       Informational                    [Page 29]   

 1593 RFC 5894                     IDNA Rationale                  August 2010   
 1594                                                                            
 1595                                                                            
 1596 7.2.4.  Transition Strategies                                              
 1597                                                                            
 1598    There are several approaches to the introduction of new characters or   
 1599    changes in interpretation of existing characters from their mapped      
 1600    forms in the earlier version of IDNA.  The transition issue is          
 1601    complicated because the forms of these labels after the                 
 1602    ToUnicode(ToASCII()) translation in IDNA2003 not only remain valid      
 1603    but do not provide strong indications of what the registrant            
 1604    intended: a string containing "ss" could have simply been intended to   
 1605    be that string or could have been intended to contain an Eszett; a      
 1606    string containing lowercase Sigma could have been intended to contain   
 1607    Final Sigma (one might make heuristic guesses based on position in a    
 1608    string, but the long tradition of forming labels by concatenating       
 1609    words makes such heuristics unreliable), and strings that do not        
 1610    contain ZWJ or ZWNJ might have been intended to contain them.           
 1611    Without any preference or claim to completeness, some of these, all     
 1612    of which have been used by registries in the past for similar           
 1613    transitions, are:                                                       
 1614                                                                            
 1615    1.  Do not permit use of the newly available character at the           
 1616        registry level.  This might cause lookup failures if a domain       
 1617        name were to be written with the expectation of the IDNA2003        
 1618        mapping behavior, but would eliminate any possibility of false      
 1619        matches.                                                            
 1620                                                                            
 1621    2.  Hold a "sunrise"-like arrangement in which holders of labels        
 1622        containing "ss" in the Eszett case, lowercase Sigma in that case,   
 1623        or that might have contained ZWJ or ZWNJ in context, are given      
 1624        priority (and perhaps other benefits) for registering the           
 1625        corresponding string containing Eszett, Final Sigma, or the         
 1626        appropriate zero-width character respectively.                      
 1627                                                                            
 1628    3.  Adopt some sort of "variant" approach in which registrants obtain   
 1629        labels with both character forms.                                   
 1630                                                                            
 1631    4.  Adopt a different form of "variant" approach in which               
 1632        registration of additional strings that would produce the same      
 1633        A-label if interpreted according to IDNA2003 is either not          
 1634        permitted at all or permitted only by the registrant who already    
 1635        has one of the names.                                               
 1636                                                                            
 1637    5.  Ignore the issue and assume that the marketplace or other           
 1638        mechanisms will sort things out.                                    
 1639                                                                            
 1640    In any event, a registry (at any level of the DNS tree) that chooses    
 1641    to permit labels to be registered that contains these characters, or    
 1642    considers doing so, will have to address the relationship with          
 1643    existing, possibly conflicting, labels in some way, just as             
 1644                                                                            
 1645                                                                            
 1646                                                                            
 1647 Klensin                       Informational                    [Page 30]   

 1648 RFC 5894                     IDNA Rationale                  August 2010   
 1649                                                                            
 1650                                                                            
 1651    registries that already had a considerable number of labels did when    
 1652    IDNs were first introduced.                                             
 1653                                                                            
 1654 7.3.  Elimination of Character Mapping                                     
 1655                                                                            
 1656    As discussed at length in Section 6, IDNA2003, via Nameprep (see        
 1657    Section 7.5), mapped many characters into related ones.  Those          
 1658    mappings no longer exist as requirements in IDNA2008.  These            
 1659    specifications strongly prefer that only A-labels or U-labels be used   
 1660    in protocol contexts and as much as practical more generally.           
 1661    IDNA2008 does anticipate situations in which some mapping at the time   
 1662    of user input into lookup applications is appropriate and desirable.    
 1663    The issues are discussed in Section 6 and specific recommendations      
 1664    are made in the Mapping document [IDNA2008-Mapping].                    
 1665                                                                            
 1666 7.4.  The Question of Prefix Changes                                       
 1667                                                                            
 1668    The conditions that would have required a change in the IDNA ACE        
 1669    prefix ("xn--", used in IDNA2003) were of great concern to the          
 1670    community.  A prefix change would have clearly been necessary if the    
 1671    algorithms were modified in a manner that would have created serious    
 1672    ambiguities during subsequent transition in registrations.  This        
 1673    section summarizes the working group's conclusions about the            
 1674    conditions under which a change in the prefix would have been           
 1675    necessary and the implications of such a change.                        
 1676                                                                            
 1677 7.4.1.  Conditions Requiring a Prefix Change                               
 1678                                                                            
 1679    An IDN prefix change would have been needed if a given string would     
 1680    be looked up or otherwise interpreted differently depending on the      
 1681    version of the protocol or tables being used.  This IDNA upgrade        
 1682    would have required a prefix change if, and only if, one of the         
 1683    following four conditions were met:                                     
 1684                                                                            
 1685    1.  The conversion of an A-label to Unicode (i.e., a U-label) would     
 1686        have yielded one string under IDNA2003 and a different string       
 1687        under IDNA2008.                                                     
 1688                                                                            
 1689    2.  In a significant number of cases, an input string that was valid    
 1690        under IDNA2003 and also valid under IDNA2008 would have yielded     
 1691        two different A-labels with the different versions.  This           
 1692        condition is believed to be essentially equivalent to the one       
 1693        above except for a very small number of edge cases that were not    
 1694        found to justify a prefix change (see Section 7.2).                 
 1695                                                                            
 1696        Note that if the input string was valid under one version and not   
 1697        valid under the other, this condition would not apply.  See the     
 1698        first item in Section 7.4.2, below.                                 
 1699                                                                            
 1700                                                                            
 1701                                                                            
 1702 Klensin                       Informational                    [Page 31]   

 1703 RFC 5894                     IDNA Rationale                  August 2010   
 1704                                                                            
 1705                                                                            
 1706    3.  A fundamental change was made to the semantics of the string that   
 1707        would be inserted in the DNS, e.g., if a decision were made to      
 1708        try to include language or script information in the encoding in    
 1709        addition to the string itself.                                      
 1710                                                                            
 1711    4.  A sufficiently large number of characters were added to Unicode     
 1712        so that the Punycode mechanism for block offsets would no longer    
 1713        reference the higher-numbered planes and blocks.  This condition    
 1714        is unlikely even in the long term and certain not to arise in the   
 1715        next several years.                                                 
 1716                                                                            
 1717 7.4.2.  Conditions Not Requiring a Prefix Change                           
 1718                                                                            
 1719    As a result of the principles described above, none of the following    
 1720    changes required a new prefix:                                          
 1721                                                                            
 1722    1.  Prohibition of some characters as input to IDNA.  Such a            
 1723        prohibition might make names that were previously registered        
 1724        inaccessible, but did not change those names.                       
 1725                                                                            
 1726    2.  Adjustments in IDNA tables or actions, including normalization      
 1727        definitions, that affected characters that were already invalid     
 1728        under IDNA2003.                                                     
 1729                                                                            
 1730    3.  Changes in the style of the IDNA definition that did not alter      
 1731        the actions performed by IDNA.                                      
 1732                                                                            
 1733 7.4.3.  Implications of Prefix Changes                                     
 1734                                                                            
 1735    While it might have been possible to make a prefix change, the costs    
 1736    of such a change are considerable.  Registries could not have           
 1737    converted all IDNA2003 ("xn--") registrations to a new form at the      
 1738    same time and synchronize that change with applications supporting      
 1739    lookup.  Unless all existing registrations were simply to be declared   
 1740    invalid (and perhaps even then), systems that needed to support both    
 1741    labels with old prefixes and labels with new ones would be required     
 1742    to first process a putative label under the IDNA2008 rules and try to   
 1743    look it up and then, if it were not found, would be required to         
 1744    process the label under IDNA2003 rules and look it up again.  That      
 1745    process would probably have significantly slowed down all processing    
 1746    that involved IDNs in the DNS, especially since a fully-qualified       
 1747    name might contain a mixture of labels that were registered with the    
 1748    old and new prefixes.  That would have made DNS caching very            
 1749    difficult.  In addition, looking up the same input string as two        
 1750    separate A-labels would have created some potential for confusion and   
 1751    attacks, since the labels could map to different targets and then       
 1752    resolve to different entries in the DNS.                                
 1753                                                                            
 1754                                                                            
 1755                                                                            
 1756                                                                            
 1757 Klensin                       Informational                    [Page 32]   

 1758 RFC 5894                     IDNA Rationale                  August 2010   
 1759                                                                            
 1760                                                                            
 1761    Consequently, a prefix change should have been, and was, avoided if     
 1762    at all possible, even if it means accepting some IDNA2003 decisions     
 1763    about character distinctions as irreversible and/or giving special      
 1764    treatment to edge cases.                                                
 1765                                                                            
 1766 7.5.  Stringprep Changes and Compatibility                                 
 1767                                                                            
 1768    The Nameprep specification [RFC3491], a key part of IDNA2003, is a      
 1769    profile of Stringprep [RFC3454].  While Nameprep is a Stringprep        
 1770    profile specific to IDNA, Stringprep is used by a number of other       
 1771    protocols.  Were Stringprep to have been modified by IDNA2008, those    
 1772    changes to improve the handling of IDNs could cause problems for        
 1773    non-DNS uses, most notably if they affected identification and          
 1774    authentication protocols.  Several elements of IDNA2008 give            
 1775    interpretations to strings prohibited under IDNA2003 or prohibit        
 1776    strings that IDNA2003 permitted.  Those elements include the new        
 1777    inclusion information in the Tables document [RFC5892], the reduction   
 1778    in the number of characters permitted as input for registration or      
 1779    lookup (Section 3), and even the changes in handling of right-to-left   
 1780    strings as described in the Bidi document [RFC5893].  IDNA2008 does     
 1781    not use Nameprep or Stringprep at all, so there are no side-effect      
 1782    changes to other protocols.                                             
 1783                                                                            
 1784    It is particularly important to keep IDNA processing separate from      
 1785    processing for various security protocols because some of the           
 1786    constraints that are necessary for smooth and comprehensible use of     
 1787    IDNs may be unwanted or undesirable in other contexts.  For example,    
 1788    the criteria for good passwords or passphrases are very different       
 1789    from those for desirable IDNs: passwords should be hard to guess,       
 1790    while domain names should normally be easily memorable.  Similarly,     
 1791    internationalized Small Computer System Interface (SCSI) identifiers    
 1792    and other protocol components are likely to have different              
 1793    requirements than IDNs.                                                 
 1794                                                                            
 1795 7.6.  The Symbol Question                                                  
 1796                                                                            
 1797    One of the major differences between this specification and the         
 1798    original version of IDNA is that IDNA2003 permitted non-letter          
 1799    symbols of various sorts, including punctuation and line-drawing        
 1800    symbols, in the protocol.  They were always discouraged in practice.    
 1801    In particular, both the "IESG Statement" about IDNA and all versions    
 1802    of the ICANN Guidelines specify that only language characters be used   
 1803    in labels.  This specification disallows symbols entirely.  There are   
 1804    several reasons for this, which include:                                
 1805                                                                            
 1806    1.  As discussed elsewhere, the original IDNA specification assumed     
 1807        that as many Unicode characters as possible should be permitted,    
 1808        directly or via mapping to other characters, in IDNs.  This         
 1809                                                                            
 1810                                                                            
 1811                                                                            
 1812 Klensin                       Informational                    [Page 33]   

 1813 RFC 5894                     IDNA Rationale                  August 2010   
 1814                                                                            
 1815                                                                            
 1816        specification operates on an inclusion model, extrapolating from    
 1817        the original "hostname" rules (LDH, see the Definitions document    
 1818        [RFC5890]) -- which have served the Internet very well -- to a      
 1819        Unicode base rather than an ASCII base.                             
 1820                                                                            
 1821    2.  Symbol names are more problematic than letters because there may    
 1822        be no general agreement on whether a particular glyph matches a     
 1823        symbol; there are no uniform conventions for naming; variations     
 1824        such as outline, solid, and shaded forms may or may not exist;      
 1825        and so on.  As just one example, consider a "heart" symbol as it    
 1826        might appear in a logo that might be read as "I love...".  While    
 1827        the user might read such a logo as "I love..." or "I heart...",     
 1828        considerable knowledge of the coding distinctions made in Unicode   
 1829        is needed to know that there is more than one "heart" character     
 1830        (e.g., U+2665, U+2661, and U+2765) and how to describe it.  These   
 1831        issues are of particular importance if strings are expected to be   
 1832        understood or transcribed by the listener after being read out      
 1833        loud.                                                               
 1834                                                                            
 1835    3.  Design of a screen reader used by blind Internet users who must     
 1836        listen to renderings of IDN domain names and possibly reproduce     
 1837        them on the keyboard becomes considerably more complicated when     
 1838        the names of characters are not obvious and intuitive to anyone     
 1839        familiar with the language in question.                             
 1840                                                                            
 1841    4.  As a simplified example of this, assume one wanted to use a         
 1842        "heart" or "star" symbol in a label.  This is problematic because   
 1843        those names are ambiguous in the Unicode system of naming (the      
 1844        actual Unicode names require far more qualification).  A user or    
 1845        would-be registrant has no way to know -- absent careful study of   
 1846        the code tables -- whether it is ambiguous (e.g., where there are   
 1847        multiple "heart" characters) or not.  Conversely, the user seeing   
 1848        the hypothetical label doesn't know whether to read it -- try to    
 1849        transmit it to a colleague by voice -- as "heart", as "love", as    
 1850        "black heart", or as any of the other examples below.               
 1851                                                                            
 1852    5.  The actual situation is even worse than this.  There is no          
 1853        possible way for a normal, casual, user to tell the difference      
 1854        between the hearts of U+2665 and U+2765 and the stars of U+2606     
 1855        and U+2729 without somehow knowing to look for a distinction.  We   
 1856        have a white heart (U+2661) and few black hearts.  Consequently,    
 1857        describing a label as containing a heart is hopelessly ambiguous:   
 1858        we can only know that it contains one of several characters that    
 1859        look like hearts or have "heart" in their names.  In cities where   
 1860        "Square" is a popular part of a location name, one might well       
 1861        want to use a square symbol in a label as well and there are far    
 1862        more squares of various flavors in Unicode than there are hearts    
 1863        or stars.                                                           
 1864                                                                            
 1865                                                                            
 1866                                                                            
 1867 Klensin                       Informational                    [Page 34]   

 1868 RFC 5894                     IDNA Rationale                  August 2010   
 1869                                                                            
 1870                                                                            
 1871    The consequence of these ambiguities is that symbols are a very poor    
 1872    basis for reliable communication.  Consistent with this conclusion,     
 1873    the Unicode standard recommends that strings used in identifiers not    
 1874    contain symbols or punctuation [Unicode-UAX31].  Of course, these       
 1875    difficulties with symbols do not arise with actual pictographic         
 1876    languages and scripts which would be treated like any other language    
 1877    characters; the two should not be confused.                             
 1878                                                                            
 1879 7.7.  Migration between Unicode Versions: Unassigned Code Points           
 1880                                                                            
 1881    In IDNA2003, labels containing unassigned code points are looked up     
 1882    on the assumption that, if they appear in labels and can be mapped      
 1883    and then resolved, the relevant standards must have changed and the     
 1884    registry has properly allocated only assigned values.                   
 1885                                                                            
 1886    In the IDNA2008 protocol, strings containing unassigned code points     
 1887    must not be either looked up or registered.  In summary, the status     
 1888    of an unassigned character with regard to the DISALLOWED,               
 1889    PROTOCOL-VALID, and CONTEXTUAL RULE REQUIRED categories cannot be       
 1890    evaluated until a character is actually assigned and known.  There      
 1891    are several reasons for this, with the most important ones being:       
 1892                                                                            
 1893    o  Tests involving the context of characters (e.g., some characters     
 1894       being permitted only adjacent to others of specific types) and       
 1895       integrity tests on complete labels are needed.  Unassigned code      
 1896       points cannot be permitted because one cannot determine whether      
 1897       particular code points will require contextual rules (and what       
 1898       those rules should be) before characters are assigned to them and    
 1899       the properties of those characters fully understood.                 
 1900                                                                            
 1901    o  It cannot be known in advance, and with sufficient reliability,      
 1902       whether a newly assigned code point will be associated with a        
 1903       character that would be disallowed by the rules in the Tables        
 1904       document [RFC5892] (such as a compatibility character).  In          
 1905       IDNA2003, since there is no direct dependency on NFKC (many of the   
 1906       entries in Stringprep's tables are based on NFKC, but IDNA2003       
 1907       depends only on Stringprep), allocation of a compatibility           
 1908       character might produce some odd situations, but it would not be a   
 1909       problem.  In IDNA2008, where compatibility characters are            
 1910       DISALLOWED unless character-specific exceptions are made,            
 1911       permitting strings containing unassigned characters to be looked     
 1912       up would violate the principle that characters in DISALLOWED are     
 1913       not looked up.                                                       
 1914                                                                            
 1915    o  The Unicode Standard specifies that an unassigned code point         
 1916       normalizes (and, where relevant, case folds) to itself.  If the      
 1917       code point is later assigned to a character, and particularly if     
 1918       the newly assigned code point has a combining class that             
 1919                                                                            
 1920                                                                            
 1921                                                                            
 1922 Klensin                       Informational                    [Page 35]   

 1923 RFC 5894                     IDNA Rationale                  August 2010   
 1924                                                                            
 1925                                                                            
 1926       determines its placement relative to other combining characters,     
 1927       it could normalize to some other code point or sequence.             
 1928                                                                            
 1929    It is possible to argue that the issues above are not important and     
 1930    that, as a consequence, it is better to retain the principle of         
 1931    looking up labels even if they contain unassigned characters because    
 1932    all of the important scripts and characters have been coded as of       
 1933    Unicode 5.2 (or even earlier), and hence unassigned code points will    
 1934    be assigned only to obscure characters or archaic scripts.              
 1935    Unfortunately, that does not appear to be a safe assumption for at      
 1936    least two reasons.  First, much the same claim of completeness has      
 1937    been made for earlier versions of Unicode.  The reality is that a       
 1938    script that is obscure to much of the world may still be very           
 1939    important to those who use it.  Cultural and linguistic preservation    
 1940    principles make it inappropriate to declare the script of no            
 1941    importance in IDNs.  Second, we already have counterexamples, e.g.,     
 1942    in the relationships associated with new Han characters being added     
 1943    (whether in the BMP or in Unicode Plane 2).                             
 1944                                                                            
 1945    Independent of the technical transition issues identified above, it     
 1946    can be observed that any addition of characters to an existing script   
 1947    to make it easier to use or to better accommodate particular            
 1948    languages may lead to transition issues.  Such additions may change     
 1949    the preferred form for writing a particular string, changes that may    
 1950    be reflected, e.g., in keyboard transition modules that would           
 1951    necessarily be different from those for earlier versions of Unicode     
 1952    where the newer characters may not exist.  This creates an inherent     
 1953    transition problem because attempts to access labels may use either     
 1954    the old or the new conventions, requiring registry action whether or    
 1955    not the older conventions were used in labels.  The need to consider    
 1956    transition mechanisms is inherent to evolution of Unicode to better     
 1957    accommodate writing systems and is independent of how IDNs are          
 1958    represented in the DNS or how transitions among versions of those       
 1959    mechanisms occur.  The requirement for transitions of this type is      
 1960    illustrated by the addition of Malayalam Chillu in Unicode 5.1.0.       
 1961                                                                            
 1962 7.8.  Other Compatibility Issues                                           
 1963                                                                            
 1964    The 2003 IDNA model includes several odd artifacts of the context in    
 1965    which it was developed.  Many, if not all, of these are potential       
 1966    avenues for exploits, especially if the registration process permits    
 1967    "source" names (names that have not been processed through IDNA and     
 1968    Nameprep) to be registered.  As one example, since the character        
 1969    Eszett, used in German, is mapped by IDNA2003 into the sequence "ss"    
 1970    rather than being retained as itself or prohibited, a string            
 1971    containing that character, but that is otherwise in ASCII, is not       
 1972    really an IDN (in the U-label sense defined above).  After Nameprep     
 1973    maps out the Eszett, the result is an ASCII string and so it does not   
 1974                                                                            
 1975                                                                            
 1976                                                                            
 1977 Klensin                       Informational                    [Page 36]   

 1978 RFC 5894                     IDNA Rationale                  August 2010   
 1979                                                                            
 1980                                                                            
 1981    get an xn-- prefix, but the string that can be displayed to a user      
 1982    appears to be an IDN.  IDNA2008 eliminates this artifact.  A            
 1983    character is either permitted as itself or it is prohibited; special    
 1984    cases that make sense only in a particular linguistic or cultural       
 1985    context can be dealt with as localization matters where appropriate.    
 1986                                                                            
 1987 8.  Name Server Considerations                                             
 1988                                                                            
 1989 8.1.  Processing Non-ASCII Strings                                         
 1990                                                                            
 1991    Existing DNS servers do not know the IDNA rules for handling            
 1992    non-ASCII forms of IDNs, and therefore need to be shielded from them.   
 1993    All existing channels through which names can enter a DNS server        
 1994    database (for example, master files (as described in RFC 1034) and      
 1995    DNS update messages [RFC2136]) could not be IDNA-aware because they     
 1996    predate IDNA.  Other sections of this document provide the needed       
 1997    shielding by ensuring that internationalized domain names entering      
 1998    DNS server databases through such channels have already been            
 1999    converted to their equivalent ASCII A-label forms.                      
 2000                                                                            
 2001    Because of the distinction made between the algorithms for              
 2002    Registration and Lookup in Sections 4 and 5 (respectively) of the       
 2003    Protocol document [RFC5891] (a domain name containing only ASCII code   
 2004    points cannot be converted to an A-label), there cannot be more than    
 2005    one A-label form for any given U-label.                                 
 2006                                                                            
 2007    As specified in clarifications to the DNS specification [RFC2181],      
 2008    the DNS protocol explicitly allows domain labels to contain octets      
 2009    beyond the ASCII range (0000..007F), and this document does not         
 2010    change that.  However, although the interpretation of octets            
 2011    0080..00FF is well-defined in the DNS, many application protocols       
 2012    support only ASCII labels and there is no defined interpretation of     
 2013    these non-ASCII octets as characters and, in particular, no             
 2014    interpretation of case-independent matching for them (e.g., see the     
 2015    clarification on DNS case insensitivity [RFC4343]).  If labels          
 2016    containing these octets are returned to applications, unpredictable     
 2017    behavior could result.  The A-label form, which cannot contain those    
 2018    characters, is the only standard representation for internationalized   
 2019    labels in the DNS protocol.                                             
 2020                                                                            
 2021 8.2.  Root and Other DNS Server Considerations                             
 2022                                                                            
 2023    IDNs in A-label form will generally be somewhat longer than current     
 2024    domain names, so the bandwidth needed by the root servers is likely     
 2025    to go up by a small amount.  Also, queries and responses for IDNs       
 2026    will probably be somewhat longer than typical queries historically,     
 2027                                                                            
 2028                                                                            
 2029                                                                            
 2030                                                                            
 2031                                                                            
 2032 Klensin                       Informational                    [Page 37]   

 2033 RFC 5894                     IDNA Rationale                  August 2010   
 2034                                                                            
 2035                                                                            
 2036    so Extension Mechanisms for DNS (EDNS0) [RFC2671] support may be more   
 2037    important (otherwise, queries and responses may be forced to go to      
 2038    TCP instead of UDP).                                                    
 2039                                                                            
 2040 9.  Internationalization Considerations                                    
 2041                                                                            
 2042    DNS labels and fully-qualified domain names provide mnemonics that      
 2043    assist in identifying and referring to resources on the Internet.       
 2044    IDNs expand the range of those mnemonics to include those based on      
 2045    languages and character sets other than Western European and Roman-     
 2046    derived ones.  But domain "names" are not, in general, words in any     
 2047    language.  The recommendations of the IETF policy on character sets     
 2048    and languages (BCP 18 [RFC2277]) are applicable to situations in        
 2049    which language identification is used to provide language-specific      
 2050    contexts.  The DNS is, by contrast, global and international and        
 2051    ultimately has nothing to do with languages.  Adding languages (or      
 2052    similar context) to IDNs generally, or to DNS matching in particular,   
 2053    would imply context-dependent matching in DNS, which would be a very    
 2054    significant change to the DNS protocol itself.  It would also imply     
 2055    that users would need to identify the language associated with a        
 2056    particular label in order to look that label up.  That knowledge is     
 2057    generally not available because many labels are not words in any        
 2058    language and some may be words in more than one.                        
 2059                                                                            
 2060 10.  IANA Considerations                                                   
 2061                                                                            
 2062    This section gives an overview of IANA registries required for IDNA.    
 2063    The actual definitions of, and specifications for, the first two,       
 2064    which have been newly created for IDNA2008, appear in the Tables        
 2065    document [RFC5892].  This document describes the registries, but it     
 2066    does not specify any IANA actions.                                      
 2067                                                                            
 2068 10.1.  IDNA Character Registry                                             
 2069                                                                            
 2070    The distinction among the major categories "UNASSIGNED",                
 2071    "DISALLOWED", "PROTOCOL-VALID", and "CONTEXTUAL RULE REQUIRED" is       
 2072    made by special categories and rules that are integral elements of      
 2073    the Tables document.  While not normative, an IANA registry of          
 2074    characters and scripts and their categories, updated for each new       
 2075    version of Unicode and the characters it contains, are convenient for   
 2076    programming and validation purposes.  The details of this registry      
 2077    are specified in the Tables document.                                   
 2078                                                                            
 2079                                                                            
 2080                                                                            
 2081                                                                            
 2082                                                                            
 2083                                                                            
 2084                                                                            
 2085                                                                            
 2086                                                                            
 2087 Klensin                       Informational                    [Page 38]   

 2088 RFC 5894                     IDNA Rationale                  August 2010   
 2089                                                                            
 2090                                                                            
 2091 10.2.  IDNA Context Registry                                               
 2092                                                                            
 2093    IANA has created and now maintains a list of approved contextual        
 2094    rules for characters that are defined in the IDNA Character Registry    
 2095    list as requiring a Contextual Rule (i.e., the types of rules           
 2096    described in Section 3.1.2).  The details for those rules appear in     
 2097    the Tables document.                                                    
 2098                                                                            
 2099 10.3.  IANA Repository of IDN Practices of TLDs                            
 2100                                                                            
 2101    This registry, historically described as the "IANA Language Character   
 2102    Set Registry" or "IANA Script Registry" (both somewhat misleading       
 2103    terms), is maintained by IANA at the request of ICANN.  It is used to   
 2104    provide a central documentation repository of the IDN policies used     
 2105    by top level domain (TLD) registries who volunteer to contribute to     
 2106    it and is used in conjunction with ICANN Guidelines for IDN use.        
 2107                                                                            
 2108    It is not an IETF-managed registry and, while the protocol changes      
 2109    specified here may call for some revisions to the tables, IDNA2008      
 2110    has no direct effect on that registry and no IANA action is required    
 2111    as a result.                                                            
 2112                                                                            
 2113 11.  Security Considerations                                               
 2114                                                                            
 2115 11.1.  General Security Issues with IDNA                                   
 2116                                                                            
 2117    This document is purely explanatory and informational and               
 2118    consequently introduces no new security issues.  It would, of course,   
 2119    be a poor idea for someone to try to implement from it; such an         
 2120    attempt would almost certainly lead to interoperability problems and    
 2121    might lead to security ones.  A discussion of security issues with      
 2122    IDNA, including some relevant history, appears in the Definitions       
 2123    document [RFC5890].                                                     
 2124                                                                            
 2125 12.  Acknowledgments                                                       
 2126                                                                            
 2127    The editor and contributors would like to express their thanks to       
 2128    those who contributed significant early (pre-working group) review      
 2129    comments, sometimes accompanied by text, Paul Hoffman, Simon            
 2130    Josefsson, and Sam Weiler.  In addition, some specific ideas were       
 2131    incorporated from suggestions, text, or comments about sections that    
 2132    were unclear supplied by Vint Cerf, Frank Ellerman, Michael Everson,    
 2133    Asmus Freytag, Erik van der Poel, Michel Suignard, and Ken Whistler.    
 2134    Thanks are also due to Vint Cerf, Lisa Dusseault, Debbie Garside, and   
 2135    Jefsey Morfin for conversations that led to considerable improvements   
 2136    in the content of this document and to several others, including Ben    
 2137                                                                            
 2138                                                                            
 2139                                                                            
 2140                                                                            
 2141                                                                            
 2142 Klensin                       Informational                    [Page 39]   

 2143 RFC 5894                     IDNA Rationale                  August 2010   
 2144                                                                            
 2145                                                                            
 2146    Campbell, Martin Duerst, Subramanian Moonesamy, Peter Saint-Andre,      
 2147    and Dan Winship, for catching specific errors and recommending          
 2148    corrections.                                                            
 2149                                                                            
 2150    A meeting was held on 30 January 2008 to attempt to reconcile           
 2151    differences in perspective and terminology about this set of            
 2152    specifications between the design team and members of the Unicode       
 2153    Technical Consortium.  The discussions at and subsequent to that        
 2154    meeting were very helpful in focusing the issues and in refining the    
 2155    specifications.  The active participants at that meeting were (in       
 2156    alphabetic order, as usual) Harald Alvestrand, Vint Cerf, Tina Dam,     
 2157    Mark Davis, Lisa Dusseault, Patrik Faltstrom (by telephone), Cary       
 2158    Karp, John Klensin, Warren Kumari, Lisa Moore, Erik van der Poel,       
 2159    Michel Suignard, and Ken Whistler.  We express our thanks to Google     
 2160    for support of that meeting and to the participants for their           
 2161    contributions.                                                          
 2162                                                                            
 2163    Useful comments and text on the working group versions of the working   
 2164    draft were received from many participants in the IETF "IDNABIS"        
 2165    working group and a number of document changes resulted from mailing    
 2166    list discussions made by that group.  Marcos Sanz provided specific     
 2167    analysis and suggestions that were exceptionally helpful in refining    
 2168    the text, as did Vint Cerf, Martin Duerst, Andrew Sullivan, and Ken     
 2169    Whistler.  Lisa Dusseault provided extensive editorial suggestions      
 2170    during the spring of 2009, most of which were incorporated.             
 2171                                                                            
 2172 13.  Contributors                                                          
 2173                                                                            
 2174    While the listed editor held the pen, the core of this document and     
 2175    the initial working group version represents the joint work and         
 2176    conclusions of an ad hoc design team consisting of the editor and, in   
 2177    alphabetic order, Harald Alvestrand, Tina Dam, Patrik Faltstrom, and    
 2178    Cary Karp.  Considerable material describing mapping principles has     
 2179    been incorporated from a draft of the Mapping document                  
 2180    [IDNA2008-Mapping] by Pete Resnick and Paul Hoffman.  In addition,      
 2181    there were many specific contributions and helpful comments from        
 2182    those listed in the Acknowledgments section and others who have         
 2183    contributed to the development and use of the IDNA protocols.           
 2184                                                                            
 2185 14.  References                                                            
 2186                                                                            
 2187 14.1.  Normative References                                                
 2188                                                                            
 2189    [RFC3490]    Faltstrom, P., Hoffman, P., and A. Costello,               
 2190                 "Internationalizing Domain Names in Applications           
 2191                 (IDNA)", RFC 3490, March 2003.                             
 2192                                                                            
 2193                                                                            
 2194                                                                            
 2195                                                                            
 2196                                                                            
 2197 Klensin                       Informational                    [Page 40]   

 2198 RFC 5894                     IDNA Rationale                  August 2010   
 2199                                                                            
 2200                                                                            
 2201    [RFC3492]    Costello, A., "Punycode: A Bootstring encoding of          
 2202                 Unicode for Internationalized Domain Names in              
 2203                 Applications (IDNA)", RFC 3492, March 2003.                
 2204                                                                            
 2205    [RFC5890]    Klensin, J., "Internationalized Domain Names for           
 2206                 Applications (IDNA): Definitions and Document              
 2207                 Framework", RFC 5890, August 2010.                         
 2208                                                                            
 2209    [RFC5891]    Klensin, J., "Internationalized Domain Names in            
 2210                 Applications (IDNA): Protocol", RFC 5891, August 2010.     
 2211                                                                            
 2212    [RFC5892]    Faltstrom, P., "The Unicode Code Points and                
 2213                 Internationalized Domain Names for Applications (IDNA)",   
 2214                 RFC 5892, August 2010.                                     
 2215                                                                            
 2216    [RFC5893]    Alvestrand, H. and C. Karp, "Right-to-Left Scripts for     
 2217                 Internationalized Domain Names for Applications (IDNA)",   
 2218                 RFC 5893, August 2010.                                     
 2219                                                                            
 2220    [Unicode52]  The Unicode Consortium.  The Unicode Standard, Version     
 2221                 5.2.0, defined by: "The Unicode Standard, Version          
 2222                 5.2.0", (Mountain View, CA: The Unicode Consortium,        
 2223                 2009. ISBN 978-1-936213-00-9).                             
 2224                 <http://www.unicode.org/versions/Unicode5.2.0/>.           
 2225                                                                            
 2226 14.2.  Informative References                                              
 2227                                                                            
 2228    [IDNA2008-Mapping]                                                      
 2229                 Resnick, P. and P. Hoffman, "Mapping Characters in         
 2230                 Internationalized Domain Names for Applications (IDNA)",   
 2231                 Work in Progress, April 2010.                              
 2232                                                                            
 2233    [RFC0952]    Harrenstien, K., Stahl, M., and E. Feinler, "DoD           
 2234                 Internet host table specification", RFC 952,               
 2235                 October 1985.                                              
 2236                                                                            
 2237    [RFC1034]    Mockapetris, P., "Domain names - concepts and              
 2238                 facilities", STD 13, RFC 1034, November 1987.              
 2239                                                                            
 2240    [RFC1035]    Mockapetris, P., "Domain names - implementation and        
 2241                 specification", STD 13, RFC 1035, November 1987.           
 2242                                                                            
 2243    [RFC1123]    Braden, R., "Requirements for Internet Hosts -             
 2244                 Application and Support", STD 3, RFC 1123, October 1989.   
 2245                                                                            
 2246    [RFC2136]    Vixie, P., Thomson, S., Rekhter, Y., and J.  Bound,        
 2247                 "Dynamic Updates in the Domain Name System (DNS            
 2248                 UPDATE)", RFC 2136, April 1997.                            
 2249                                                                            
 2250                                                                            
 2251                                                                            
 2252 Klensin                       Informational                    [Page 41]   

 2253 RFC 5894                     IDNA Rationale                  August 2010   
 2254                                                                            
 2255                                                                            
 2256    [RFC2181]    Elz, R. and R. Bush, "Clarifications to the DNS            
 2257                 Specification", RFC 2181, July 1997.                       
 2258                                                                            
 2259    [RFC2277]    Alvestrand, H., "IETF Policy on Character Sets and         
 2260                 Languages", BCP 18, RFC 2277, January 1998.                
 2261                                                                            
 2262    [RFC2671]    Vixie, P., "Extension Mechanisms for DNS (EDNS0)",         
 2263                 RFC 2671, August 1999.                                     
 2264                                                                            
 2265    [RFC2782]    Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for   
 2266                 specifying the location of services (DNS SRV)",            
 2267                 RFC 2782, February 2000.                                   
 2268                                                                            
 2269    [RFC3454]    Hoffman, P. and M. Blanchet, "Preparation of               
 2270                 Internationalized Strings ("stringprep")", RFC 3454,       
 2271                 December 2002.                                             
 2272                                                                            
 2273    [RFC3491]    Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep       
 2274                 Profile for Internationalized Domain Names (IDN)",         
 2275                 RFC 3491, March 2003.                                      
 2276                                                                            
 2277    [RFC3743]    Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint        
 2278                 Engineering Team (JET) Guidelines for Internationalized    
 2279                 Domain Names (IDN) Registration and Administration for     
 2280                 Chinese, Japanese, and Korean", RFC 3743, April 2004.      
 2281                                                                            
 2282    [RFC3987]    Duerst, M. and M. Suignard, "Internationalized Resource    
 2283                 Identifiers (IRIs)", RFC 3987, January 2005.               
 2284                                                                            
 2285    [RFC4290]    Klensin, J., "Suggested Practices for Registration of      
 2286                 Internationalized Domain Names (IDN)", RFC 4290,           
 2287                 December 2005.                                             
 2288                                                                            
 2289    [RFC4343]    Eastlake, D., "Domain Name System (DNS) Case               
 2290                 Insensitivity Clarification", RFC 4343, January 2006.      
 2291                                                                            
 2292    [RFC4690]    Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review     
 2293                 and Recommendations for Internationalized Domain Names     
 2294                 (IDNs)", RFC 4690, September 2006.                         
 2295                                                                            
 2296    [RFC4713]    Lee, X., Mao, W., Chen, E., Hsu, N., and J.  Klensin,      
 2297                 "Registration and Administration Recommendations for       
 2298                 Chinese Domain Names", RFC 4713, October 2006.             
 2299                                                                            
 2300                                                                            
 2301                                                                            
 2302                                                                            
 2303                                                                            
 2304                                                                            
 2305                                                                            
 2306                                                                            
 2307 Klensin                       Informational                    [Page 42]   

 2308 RFC 5894                     IDNA Rationale                  August 2010   
 2309                                                                            
 2310                                                                            
 2311    [Unicode-UAX31]                                                         
 2312                 The Unicode Consortium, "Unicode Standard Annex #31:       
 2313                 Unicode Identifier and Pattern Syntax, Revision 11",       
 2314                 September 2009,                                            
 2315                 <http://www.unicode.org/reports/tr31/tr31-11.html>.        
 2316                                                                            
 2317    [Unicode-UTS39]                                                         
 2318                 The Unicode Consortium, "Unicode Technical Standard #39:   
 2319                 Unicode Security Mechanisms, Revision 2", August 2006,     
 2320                 <http://www.unicode.org/reports/tr39/tr39-2.html>.         
 2321                                                                            
 2322 Author's Address                                                           
 2323                                                                            
 2324    John C Klensin                                                          
 2325    1770 Massachusetts Ave, Ste 322                                         
 2326    Cambridge, MA  02140                                                    
 2327    USA                                                                     
 2328                                                                            
 2329    Phone: +1 617 245 1457                                                  
 2330    EMail: john+ietf@jck.com                                                
 2331                                                                            
 2332                                                                            
 2333                                                                            
 2334                                                                            
 2335                                                                            
 2336                                                                            
 2337                                                                            
 2338                                                                            
 2339                                                                            
 2340                                                                            
 2341                                                                            
 2342                                                                            
 2343                                                                            
 2344                                                                            
 2345                                                                            
 2346                                                                            
 2347                                                                            
 2348                                                                            
 2349                                                                            
 2350                                                                            
 2351                                                                            
 2352                                                                            
 2353                                                                            
 2354                                                                            
 2355                                                                            
 2356                                                                            
 2357                                                                            
 2358                                                                            
 2359                                                                            
 2360                                                                            
 2361                                                                            
 2362 Klensin                       Informational                    [Page 43]   
 2363
top ICANNDNS RFC Annotations project
The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.
This RFC is included in the DNS RFCs annotation project whose home page is here.
GLOBAL POTENTIALLY UPDATED
Potentially updated by draft-klensin-idna-rfc5891bis (In IESG processing - I-D Tracker state )