1 Independent Submission                                        P. Resnick   
    2 Request for Comments: 5895                         Qualcomm Incorporated   
    3 Category: Informational                                       P. Hoffman   
    4 ISSN: 2070-1721                                           VPN Consortium   
    5                                                           September 2010   
    8                          Mapping Characters for                            
    9        Internationalized Domain Names in Applications (IDNA) 2008          
   11 Abstract                                                                   
   13    In the original version of the Internationalized Domain Names in        
   14    Applications (IDNA) protocol, any Unicode code points taken from user   
   15    input were mapped into a set of Unicode code points that "made          
   16    sense", and then encoded and passed to the domain name system (DNS).    
   17    The IDNA2008 protocol (described in RFCs 5890, 5891, 5892, and 5893)    
   18    presumes that the input to the protocol comes from a set of             
   19    "permitted" code points, which it then encodes and passes to the DNS,   
   20    but does not specify what to do with the result of user input.  This    
   21    document describes the actions that can be taken by an implementation   
   22    between receiving user input and passing permitted code points to the   
   23    new IDNA protocol.                                                      
   25 Status of This Memo                                                        
   27    This document is not an Internet Standards Track specification; it is   
   28    published for informational purposes.                                   
   30    This is a contribution to the RFC Series, independently of any other    
   31    RFC stream.  The RFC Editor has chosen to publish this document at      
   32    its discretion and makes no statement about its value for               
   33    implementation or deployment.  Documents approved for publication by    
   34    the RFC Editor are not a candidate for any level of Internet            
   35    Standard; see Section 2 of RFC 5741.                                    
   37    Information about the current status of this document, any errata,      
   38    and how to provide feedback on it may be obtained at                    
   39    http://www.rfc-editor.org/info/rfc5895.                                 
   52 Resnick & Hoffman             Informational                     [Page 1]   

   53 RFC 5895                      IDNA Mapping                September 2010   
   56 Copyright Notice                                                           
   58    Copyright (c) 2010 IETF Trust and the persons identified as the         
   59    document authors.  All rights reserved.                                 
   61    This document is subject to BCP 78 and the IETF Trust's Legal           
   62    Provisions Relating to IETF Documents                                   
   63    (http://trustee.ietf.org/license-info) in effect on the date of         
   64    publication of this document.  Please review these documents            
   65    carefully, as they describe your rights and restrictions with respect   
   66    to this document.                                                       
   68 1.  Introduction                                                           
   70    This document describes the operations that can be applied to user      
   71    input in order to get it into a form that is acceptable by the          
   72    Internationalized Domain Names in Applications (IDNA) protocol          
   73    [IDNA2008protocol].  It includes a general implementation procedure     
   74    for mapping.                                                            
   76    It should be noted that this document does not specify the behavior     
   77    of a protocol that appears "on the wire".  It describes an operation    
   78    that is to be applied to user input in order to prepare that user       
   79    input for use in an "on the network" protocol.  As unusual as this      
   80    may be for a document concerning Internet protocols, it is necessary    
   81    to describe this operation for implementors who may have designed       
   82    around the original IDNA protocol (herein referred to as IDNA2003),     
   83    which conflates this user-input operation into the protocol.            
   85    It is very important to note that there are many potential valid        
   86    mappings of characters from user input.  The mapping described in       
   87    this document is the basis for other mappings, and is not likely to     
   88    be useful without modification.  Any useful mapping will have           
   89    features designed to reduce the surprise for users and is likely to     
   90    be slightly (or sometimes radically) different depending on the         
   91    locale of the user, the type of input being used (such as typing,       
   92    copy-and-paste, voice, and so on), the type of application used, etc.   
   93    Although most common mappings will probably produce similar results     
   94    for the same input, there will be subtle differences between            
   95    applications.                                                           
   97 1.1.  The Dividing Line between User Interface and Protocol                
   99    The user interface to applications is much more complicated than most   
  100    network implementers think.  When we say "the user enters an            
  101    internationalized domain name in the application", we are talking       
  102    about a very complex process that encompasses everything from the       
  103    user formulating the name and deciding which symbols to use to          
  107 Resnick & Hoffman             Informational                     [Page 2]   

  108 RFC 5895                      IDNA Mapping                September 2010   
  111    express that name, to the user entering the symbols into the computer   
  112    using some input method (be it a keyboard, a stylus, or even a voice    
  113    recognition program), to the computer interpreting that input (be it    
  114    keyboard scan codes, a graphical representation, or digitized sounds)   
  115    into some representation of those symbols, through finally              
  116    normalizing those symbols into a particular character repertoire in     
  117    an encoding recognizable to IDNA processes and the domain name          
  118    system.                                                                 
  120    Considerations for a user interface for internationalized domain        
  121    names involves taking into account culture, context, and locale for     
  122    any given user.  A simple and well-known example is the lowercasing     
  123    of the letter LATIN CAPITAL LETTER I (U+0049) when it is used in the    
  124    Turkish and other languages.  A capital "I" in Turkish is properly      
  125    lowercased to a LATIN SMALL LETTER DOTLESS I (U+0131), not to a LATIN   
  126    SMALL LETTER I (U+0069).  This lowercasing is clearly dependent on      
  127    the locale of the system and/or the locale of the user.  Using a        
  128    single context-free mapping without considering the user interface      
  129    properties has the potential of doing exactly the wrong thing for the   
  130    user.                                                                   
  132    The original version of IDNA conflated user interface processing and    
  133    protocol.  It took whatever characters the user produced in whatever    
  134    encoding the application used, assumed some conversion to Unicode       
  135    code points, and then without regard to context, locale, or anything    
  136    about the user's intentions, mapped them into a particular set of       
  137    other characters, and then re-encoded them in Punycode, in order to     
  138    have the entire operation be contained within the protocol.  Ignoring   
  139    context, locale, and user preference in the IDNA protocol made life     
  140    significantly less complicated for the application developer, but at    
  141    the expense of violating the principle of "least user surprise" for     
  142    consumers and producers of domain names.                                
  144    In IDNA2008, the dividing line between "user interface" and             
  145    "protocol" is clear.  The IDNA2008 specification defines the protocol   
  146    part of IDNA: it explicitly does not deal with the user interface.      
  147    Mappings such as the one described in this document explicitly deal     
  148    with the user interface and not the protocol.  That is, a mapping is    
  149    only to be applied before a string of characters is treated as a        
  150    domain name (in the "user interface") and is never to be applied        
  151    during domain name processing (in the "protocol").                      
  153 1.2.  The Design of This Mapping                                           
  155    The user interface mapping in this document is a set of expansions to   
  156    IDNA2008 that are meant to be sensible and friendly and mostly          
  157    obvious to people throughout the world when using typical               
  158    applications with domain names that are entered by hand.  It is also    
  162 Resnick & Hoffman             Informational                     [Page 3]   

  163 RFC 5895                      IDNA Mapping                September 2010   
  166    designed to let applications be mostly backwards compatible with        
  167    IDNA2003.  By definition, it cannot meet all of those design goals      
  168    for all people, and in fact is known to fail on some of those goals     
  169    for quite large populations of people.                                  
  171    A good mapping in the real world might use the "sensible and friendly   
  172    and mostly obvious" design goal but come up with a different            
  173    algorithm.  Many algorithms will have results that are close to what    
  174    is described here, but will differ in assumptions about the users'      
  175    way of thinking or typing.  Having said that, it is likely that some    
  176    mappings will be significantly different.  For example, a mapping       
  177    might apply to a spoken user interface instead of a typed one.          
  178    Another example is that a mapping might be different for users that     
  179    are typing than for users that are copying-and-pasting from different   
  180    applications.  Yet another example is that a user interface that        
  181    allows typed input that is transliterated from Latin characters could   
  182    have very different mappings than one that applies to typing in other   
  183    character sets; this would be typical in a Pinyin input method for      
  184    Chinese characters.                                                     
  186 2.  The General Procedure                                                  
  188    This section defines a general algorithm that applications ought to     
  189    implement in order to produce Unicode code points that will be valid    
  190    under the IDNA protocol.  An application might implement the full       
  191    mapping as described below, or it can choose a different mapping.       
  192    This mapping is very general and was designed to be acceptable to the   
  193    widest user community, but as stated above, it does not take into       
  194    account any particular context, culture, or locale.                     
  196    The general algorithm that an application (or the input method          
  197    provided by an operating system) ought to use is relatively             
  198    straightforward:                                                        
  200    1.  Uppercase characters are mapped to their lowercase equivalents by   
  201        using the algorithm for mapping case in Unicode characters.  This   
  202        step was chosen because the output will behave more like ASCII      
  203        host names behave.                                                  
  205    2.  Fullwidth and halfwidth characters (those defined with              
  206        Decomposition Types <wide> and <narrow>) are mapped to their        
  207        decomposition mappings as shown in the Unicode character            
  208        database.  This step was chosen because many input mechanisms,      
  209        particularly in Asia, do not allow you to easily enter characters   
  210        in the form used by IDNA2008.  Even if they do allow the correct    
  211        character form, the user might not know which form they are         
  212        entering.                                                           
  217 Resnick & Hoffman             Informational                     [Page 4]   

  218 RFC 5895                      IDNA Mapping                September 2010   
  221    3.  All characters are mapped using Unicode Normalization Form C        
  222        (NFC).  This step was chosen because it maps combinations of        
  223        combining characters into canonical composed form.  As with the     
  224        fullwidth/halfwidth mapping, users are not generally aware of the   
  225        particular form of characters that they are entering, and           
  226        IDNA2008 requires that only the canonical composed forms from NFC   
  227        be used.                                                            
  229    4.  [IDNA2008protocol] is specified such that the protocol acts on      
  230        the individual labels of the domain name.  If an implementation     
  231        of this mapping is also performing the step of separation of the    
  232        parts of a domain name into labels by using the FULL STOP           
  233        character (U+002E), the IDEOGRAPHIC FULL STOP character (U+3002)    
  234        can be mapped to the FULL STOP before label separation occurs.      
  235        There are other characters that are used as "full stops" that one   
  236        could consider mapping as label separators, but their use as such   
  237        has not been investigated thoroughly.  This step was chosen         
  238        because some input mechanisms do not allow the user to easily       
  239        enter proper label separators.  Only the IDEOGRAPHIC FULL STOP      
  240        character (U+3002) is added in this mapping because the authors     
  241        have not fully investigated the applicability of other characters   
  242        and the environments where they should and should not be            
  243        considered domain name label separators.                            
  245    Note that the steps above are ordered.                                  
  247    Definitions for the rules in this algorithm can be found in             
  248    [Unicode52].  Specifically:                                             
  250    o  Unicode Normalization Form C can be found in Annex #15 of            
  251       [Unicode-UAX15].                                                     
  253    o  In order to map uppercase characters to their lowercase              
  254       equivalents (defined in Section 3.13 of [Unicode52]), first map      
  255       characters to the "Lowercase_Mapping" property (the "<lower>"        
  256       entry in the second column) in                                       
  257       <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any.   
  258       Then, map characters to the "Simple_Lowercase_Mapping" property      
  259       (the fourteenth column) in                                           
  260       <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.     
  262    o  In order to map fullwidth and halfwidth characters to their          
  263       decomposition mappings, map any character whose                      
  264       "Decomposition_Type" (contained in the first part of the sixth       
  265       column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>   
  266       is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of   
  267       that character (contained in the second part of the sixth column)    
  268       in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>.          
  272 Resnick & Hoffman             Informational                     [Page 5]   

  273 RFC 5895                      IDNA Mapping                September 2010   
  276    o  The Unicode Character Database [TR44] has useful descriptions of     
  277       the contents of these files.                                         
  279    If the mappings in this document are applied to versions of Unicode     
  280    later than Unicode 5.2, the later versions of the Unicode Standard      
  281    should be consulted.                                                    
  283    These form a minimal set of mappings that an application should         
  284    strongly consider doing.  Of course, there are many others that might   
  285    be done.                                                                
  287 3.  Implementing This Mapping                                              
  289    If you are implementing a mapping for an application or operating       
  290    system by using exactly the four steps in Section 2, the authors of     
  291    this document have a request: please don't.  We mean it.  Section 2     
  292    does not describe a universal mapping algorithm because, as we said,    
  293    there is no universally-applicable mapping algorithm.                   
  295    If you read the material in Section 2 without reading Section 1, go     
  296    back and carefully read all of Section 1; in many ways, Section 1 is    
  297    more important than Section 2.  Further, you can probably think of      
  298    user interface considerations that we did not list in Section 1.  If    
  299    you did read Section 1 but somehow decided that the algorithm in        
  300    Section 2 is completely correct for the intended users of your          
  301    application or operating system, you are probably not thinking hard     
  302    enough about your intended users.                                       
  304 4.  Security Considerations                                                
  306    This document suggests creating mappings that might cause confusion     
  307    for some users while alleviating confusion in other users.  Such        
  308    confusion is not covered in any depth in this document (nor in the      
  309    other IDNA-related documents).                                          
  311 5.  Acknowledgements                                                       
  313    This document is the product of many contributions from numerous        
  314    people in the IETF.                                                     
  327 Resnick & Hoffman             Informational                     [Page 6]   

  328 RFC 5895                      IDNA Mapping                September 2010   
  331 6.  Normative References                                                   
  333    [IDNA2008protocol]  Klensin, J., "Internationalized Domain Names in     
  334                        Applications (IDNA): Protocol", RFC 5891,           
  335                        August 2010.                                        
  337    [TR44]              The Unicode Consortium, "Unicode Technical Report   
  338                        #44: Unicode Character Database", September 2009,   
  339                        <http://www.unicode.org/reports/tr44/               
  340                        tr44-4.html>.                                       
  342    [Unicode-UAX15]     The Unicode Consortium, "Unicode Standard Annex     
  343                        #15: Unicode Normalization Forms, Revision 31",     
  344                        September 2009, <http://www.unicode.org/reports/    
  345                        tr15/tr15-31.html>.                                 
  347    [Unicode52]         The Unicode Consortium.  The Unicode Standard,      
  348                        Version 5.2.0, defined by: "The Unicode Standard,   
  349                        Version 5.2.0", (Mountain View, CA: The Unicode     
  350                        Consortium, 2009. ISBN 978-1-936213-00-9).          
  351                        <http://www.unicode.org/versions/Unicode5.2.0/>.    
  353 Authors' Addresses                                                         
  355    Peter W. Resnick                                                        
  356    Qualcomm Incorporated                                                   
  357    5775 Morehouse Drive                                                    
  358    San Diego, CA  92121-1714                                               
  359    US                                                                      
  361    Phone: +1 858 651 4478                                                  
  362    EMail: presnick@qualcomm.com                                            
  363    URI:   http://www.qualcomm.com/~presnick/                               
  366    Paul Hoffman                                                            
  367    VPN Consortium                                                          
  368    127 Segre Place                                                         
  369    Santa Cruz, CA  95060                                                   
  370    US                                                                      
  372    Phone: 1-831-426-9827                                                   
  373    EMail: paul.hoffman@vpnc.org                                            
  382 Resnick & Hoffman             Informational                     [Page 7]   

The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.

This RFC is included in the DNS RFCs annotation project whose home page is here.