RFC 5893

    1 Internet Engineering Task Force (IETF)                H. Alvestrand, Ed.   
    2 Request for Comments: 5893                                        Google   
    3 Category: Standards Track                                        C. Karp   
    4 ISSN: 2070-1721                        Swedish Museum of Natural History   
    5                                                              August 2010   
    6                                                                            
    7                                                                            
    8                        Right-to-Left Scripts for                           
    9          Internationalized Domain Names for Applications (IDNA)            
   10                                                                            
   11 Abstract                                                                   
   12                                                                            
   13    The use of right-to-left scripts in Internationalized Domain Names      
   14    (IDNs) has presented several challenges.  This memo provides a new      
   15    Bidi rule for Internationalized Domain Names for Applications (IDNA)    
   16    labels, based on the encountered problems with some scripts and some    
   17    shortcomings in the 2003 IDNA Bidi criterion.                           
   18                                                                            
   19 Status of This Memo                                                        
   20                                                                            
   21    This is an Internet Standards Track document.                           
   22                                                                            
   23    This document is a product of the Internet Engineering Task Force       
   24    (IETF).  It represents the consensus of the IETF community.  It has     
   25    received public review and has been approved for publication by the     
   26    Internet Engineering Steering Group (IESG).  Further information on     
   27    Internet Standards is available in Section 2 of RFC 5741.               
   28                                                                            
   29    Information about the current status of this document, any errata,      
   30    and how to provide feedback on it may be obtained at                    
   31    http://www.rfc-editor.org/info/rfc5893.                                 
   32                                                                            
   33 Copyright Notice                                                           
   34                                                                            
   35    Copyright (c) 2010 IETF Trust and the persons identified as the         
   36    document authors.  All rights reserved.                                 
   37                                                                            
   38    This document is subject to BCP 78 and the IETF Trust's Legal           
   39    Provisions Relating to IETF Documents                                   
   40    (http://trustee.ietf.org/license-info) in effect on the date of         
   41    publication of this document.  Please review these documents            
   42    carefully, as they describe your rights and restrictions with respect   
   43    to this document.  Code Components extracted from this document must    
   44    include Simplified BSD License text as described in Section 4.e of      
   45    the Trust Legal Provisions and are provided without warranty as         
   46    described in the Simplified BSD License.                                
   47                                                                            
   48                                                                            
   49                                                                            
   50                                                                            
   51                                                                            
   52 Alvestrand & Karp            Standards Track                    [Page 1]   

   53 RFC 5893                   IDNA Right to Left                August 2010   
   54                                                                            
   55                                                                            
   56 Table of Contents                                                          
   57                                                                            
   58    1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2   
   59      1.1.  Purpose and Applicability  . . . . . . . . . . . . . . . .  2   
   60      1.2.  Background and History . . . . . . . . . . . . . . . . . .  3   
   61      1.3.  Structure of the Rest of This Document . . . . . . . . . .  3   
   62      1.4.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4   
   63    2.  The Bidi Rule  . . . . . . . . . . . . . . . . . . . . . . . .  6   
   64    3.  The Requirement Set for the Bidi Rule  . . . . . . . . . . . .  6   
   65    4.  Examples of Issues Found with RFC 3454 . . . . . . . . . . . .  9   
   66      4.1.  Dhivehi  . . . . . . . . . . . . . . . . . . . . . . . . .  9   
   67      4.2.  Yiddish  . . . . . . . . . . . . . . . . . . . . . . . . . 10   
   68      4.3.  Strings with Numbers . . . . . . . . . . . . . . . . . . . 12   
   69    5.  Troublesome Situations and Guidelines  . . . . . . . . . . . . 12   
   70    6.  Other Issues in Need of Resolution . . . . . . . . . . . . . . 13   
   71    7.  Compatibility Considerations . . . . . . . . . . . . . . . . . 14   
   72      7.1.  Backwards Compatibility Considerations . . . . . . . . . . 14   
   73      7.2.  Forward Compatibility Considerations . . . . . . . . . . . 15   
   74    8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 15   
   75    9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16   
   76    10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16   
   77      10.1. Normative References . . . . . . . . . . . . . . . . . . . 16   
   78      10.2. Informative References . . . . . . . . . . . . . . . . . . 17   
   79                                                                            
   80 1.  Introduction                                                           
   81                                                                            
   82 1.1.  Purpose and Applicability                                            
   83                                                                            
   84    The purpose of this document is to establish a rule that can be         
   85    applied to Internationalized Domain Name (IDN) labels in Unicode form   
   86    (U-labels) containing characters from scripts that are written from     
   87    right to left.  It is part of the revised IDNA protocol [RFC5891].      
   88                                                                            
   89    When labels satisfy the rule, and when certain other conditions are     
   90    satisfied, there is only a minimal chance of these labels being         
   91    displayed in a confusing way by the Unicode bidirectional display       
   92    algorithm.                                                              
   93                                                                            
   94    The other normative documents in the IDNA2008 document set establish    
   95    criteria for valid labels, including listing the permitted              
   96    characters.  This document establishes additional validity criteria     
   97    for labels in scripts normally written from right to left.              
   98                                                                            
   99    This specification is not intended to place any requirements on         
  100    domain names that do not contain characters from such scripts.          
  101                                                                            
  102                                                                            
  103                                                                            
  104                                                                            
  105                                                                            
  106                                                                            
  107 Alvestrand & Karp            Standards Track                    [Page 2]   

  108 RFC 5893                   IDNA Right to Left                August 2010   
  109                                                                            
  110                                                                            
  111 1.2.  Background and History                                               
  112                                                                            
  113    The "Stringprep" specification [RFC3454], part of IDNA2003, made the    
  114    following statement in its Section 6 on the Bidi algorithm:             
  115                                                                            
  116       3) If a string contains any RandALCat character, a RandALCat         
  117       character MUST be the first character of the string, and a           
  118       RandALCat character MUST be the last character of the string.        
  119                                                                            
  120    (A RandALCat character is a character with unambiguously                
  121    right-to-left directionality.)                                          
  122                                                                            
  123    The reasoning behind this prohibition was to ensure that every          
  124    component of a displayed domain name has an unambiguously preferred     
  125    direction.  However, this made certain words in languages written       
  126    with right-to-left scripts invalid as IDN labels, and in at least one   
  127    case (Dhivehi) meant that all the words of an entire language were      
  128    forbidden as IDN labels.                                                
  129                                                                            
  130    This is illustrated below with examples taken from the Dhivehi and      
  131    Yiddish languages, as written with the Thaana and Hebrew scripts,       
  132    respectively.                                                           
  133                                                                            
  134    RFC 3454 did not explicitly state the requirement to be fulfilled.      
  135    Therefore, it is impossible to determine whether a simple relaxation    
  136    of the rule would continue to fulfill the requirement.                  
  137                                                                            
  138    While this document specifies rules quite different from RFC 3454,      
  139    most reasonable labels that were allowed under RFC 3454 will also be    
  140    allowed under this specification (the most important example of         
  141    non-permitted labels being labels that mix Arabic and European digits   
  142    (AN and EN) inside an RTL label, and labels that use AN in an LTR       
  143    label -- see Section 1.4 for terminology), so the operational impact    
  144    of using the new rule in the updated IDNA specification is limited.     
  145                                                                            
  146 1.3.  Structure of the Rest of This Document                               
  147                                                                            
  148    Section 2 defines a rule, the "Bidi rule", which can be used on a       
  149    domain name label to check how safe it is to use in a domain name of    
  150    possibly mixed directionality.  The primary initial use of this rule    
  151    is as part of the IDNA2008 protocol [RFC5891].                          
  152                                                                            
  153    Section 3 sets out the requirements for defining the Bidi rule.         
  154                                                                            
  155    Section 4 gives detailed examples that serve as justification for the   
  156    new rule.                                                               
  157                                                                            
  158                                                                            
  159                                                                            
  160                                                                            
  161                                                                            
  162 Alvestrand & Karp            Standards Track                    [Page 3]   

  163 RFC 5893                   IDNA Right to Left                August 2010   
  164                                                                            
  165                                                                            
  166    Section 5 to Section 8 describe various situations that can occur       
  167    when dealing with domain names with characters of different             
  168    directionality.                                                         
  169                                                                            
  170    Only Section 1.4 and Section 2 are normative.                           
  171                                                                            
  172 1.4.  Terminology                                                          
  173                                                                            
  174    The terminology used to describe IDNA concepts is defined in the        
  175    Definitions document [RFC5890].                                         
  176                                                                            
  177    The terminology used for the Bidi properties of Unicode characters is   
  178    taken from the Unicode Standard [Unicode52].                            
  179                                                                            
  180    The Unicode Standard specifies a Bidi property for each character.      
  181    That property controls the character's behavior in the Unicode          
  182    bidirectional algorithm [Unicode-UAX9].  For reference, here are the    
  183    values that the Unicode Bidi property can have:                         
  184                                                                            
  185    o  L - Left to right - most letters in LTR scripts                      
  186                                                                            
  187    o  R - Right to left - most letters in non-Arabic RTL scripts           
  188                                                                            
  189    o  AL - Arabic letters - most letters in the Arabic script              
  190                                                                            
  191    o  EN - European Number (0-9, and Extended Arabic-Indic numbers)        
  192                                                                            
  193    o  ES - European Number Separator (+ and -)                             
  194                                                                            
  195    o  ET - European Number Terminator (currency symbols, the hash sign,    
  196       the percent sign and so on)                                          
  197                                                                            
  198    o  AN - Arabic Number; this encompasses the Arabic-Indic numbers, but   
  199       not the Extended Arabic-Indic numbers                                
  200                                                                            
  201    o  CS - Common Number Separator (. , / : et al)                         
  202                                                                            
  203    o  NSM - Nonspacing Mark - most combining accents                       
  204                                                                            
  205    o  BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)   
  206                                                                            
  207    o  B - Paragraph Separator                                              
  208                                                                            
  209    o  S - Segment Separator                                                
  210                                                                            
  211    o  WS - Whitespace, including the SPACE character                       
  212                                                                            
  213    o  ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT         
  214                                                                            
  215                                                                            
  216                                                                            
  217 Alvestrand & Karp            Standards Track                    [Page 4]   

  218 RFC 5893                   IDNA Right to Left                August 2010   
  219                                                                            
  220                                                                            
  221    o  LRE, LRO, RLE, RLO, PDF - these are "directional control             
  222       characters" and are not used in IDNA labels.                         
  223                                                                            
  224    In this memo, we use "network order" to describe the sequence of        
  225    characters as transmitted on the wire or stored in a file; the terms    
  226    "first", "next", "previous", "beginning", "end", "before", and          
  227    "after" are used to refer to the relationship of characters and         
  228    labels in network order.                                                
  229                                                                            
  230    We use "display order" to talk about the sequence of characters as      
  231    imaged on a display medium; the terms "left" and "right" are used to    
  232    refer to the relationship of characters and labels in display order.    
  233                                                                            
  234    Most of the time, the examples use the abbreviations for the Unicode    
  235    Bidi classes to denote the directionality of the characters; the        
  236    example string CS L consists of one character of class CS and one       
  237    character of class L.  In some examples, the convention that            
  238    uppercase characters are of class R or AL, and lowercase characters     
  239    are of class L is used -- thus, the example string ABC.abc would        
  240    consist of three right-to-left characters and three left-to-right       
  241    characters.                                                             
  242                                                                            
  243    The directionality of such examples is determined by context -- for     
  244    instance, in the sentence "ABC.abc is displayed as CBA.abc", the        
  245    first example string is in network order, the second example string     
  246    is in display order.                                                    
  247                                                                            
  248    The term "paragraph" is used in the sense of the Unicode Bidi           
  249    specification [Unicode-UAX9].  It means "a block of text that has an    
  250    overall direction, either left to right or right to left",              
  251    approximately; see the "Unicode Bidirectional Algorithm"                
  252    [Unicode-UAX9] for details.                                             
  253                                                                            
  254    "RTL" and "LTR" are abbreviations for "right to left" and "left to      
  255    right", respectively.                                                   
  256                                                                            
  257    An RTL label is a label that contains at least one character of type    
  258    R, AL, or AN.                                                           
  259                                                                            
  260    An LTR label is any label that is not an RTL label.                     
  261                                                                            
  262    A "Bidi domain name" is a domain name that contains at least one RTL    
  263    label.  (Note: This definition includes domain names containing only    
  264    dots and right-to-left characters.  Providing a separate category of    
  265    "RTL domain names" would not make this specification simpler, so it     
  266    has not been done.)                                                     
  267                                                                            
  268                                                                            
  269                                                                            
  270                                                                            
  271                                                                            
  272 Alvestrand & Karp            Standards Track                    [Page 5]   

  273 RFC 5893                   IDNA Right to Left                August 2010   
  274                                                                            
  275                                                                            
  276 2.  The Bidi Rule                                                          
  277                                                                            
  278    The following rule, consisting of six conditions, applies to labels     
  279    in Bidi domain names.  The requirements that this rule satisfies are    
  280    described in Section 3.  All of the conditions must be satisfied for    
  281    the rule to be satisfied.                                               
  282                                                                            
  283    1.  The first character must be a character with Bidi property L, R,    
  284        or AL.  If it has the R or AL property, it is an RTL label; if it   
  285        has the L property, it is an LTR label.                             
  286                                                                            
  287    2.  In an RTL label, only characters with the Bidi properties R, AL,    
  288        AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.                     
  289                                                                            
  290    3.  In an RTL label, the end of the label must be a character with      
  291        Bidi property R, AL, EN, or AN, followed by zero or more            
  292        characters with Bidi property NSM.                                  
  293                                                                            
  294    4.  In an RTL label, if an EN is present, no AN may be present, and     
  295        vice versa.                                                         
  296                                                                            
  297    5.  In an LTR label, only characters with the Bidi properties L, EN,    
  298        ES, CS, ET, ON, BN, or NSM are allowed.                             
  299                                                                            
  300    6.  In an LTR label, the end of the label must be a character with      
  301        Bidi property L or EN, followed by zero or more characters with     
  302        Bidi property NSM.                                                  
  303                                                                            
  304    The following guarantees can be made based on the above:                
  305                                                                            
  306    o  In a domain name consisting of only labels that satisfy the rule,    
  307       the requirements of Section 3 are satisfied.  Note that even LTR     
  308       labels and pure ASCII labels have to be tested.                      
  309                                                                            
  310    o  In a domain name consisting of only LDH labels (as defined in the    
  311       Definitions document [RFC5890]) and labels that satisfy the rule,    
  312       the requirements of Section 3 are satisfied as long as a label       
  313       that starts with an ASCII digit does not come after a                
  314       right-to-left label.                                                 
  315                                                                            
  316    No guarantee is given for other combinations.                           
  317                                                                            
  318 3.  The Requirement Set for the Bidi Rule                                  
  319                                                                            
  320    This document, unlike RFC 3454 [RFC3454], provides an explicit          
  321    justification for the Bidi rule, and states a set of requirements for   
  322    which it is possible to test whether or not the modified rule           
  323    fulfills the requirement.                                               
  324                                                                            
  325                                                                            
  326                                                                            
  327 Alvestrand & Karp            Standards Track                    [Page 6]   

  328 RFC 5893                   IDNA Right to Left                August 2010   
  329                                                                            
  330                                                                            
  331    All the text in this document assumes that text containing the labels   
  332    under consideration will be displayed using the Unicode bidirectional   
  333    algorithm [Unicode-UAX9].                                               
  334                                                                            
  335    The requirements proposed are these:                                    
  336                                                                            
  337    o  Label Uniqueness: No two labels, when presented in display order     
  338       in the same paragraph, should have the same sequence of characters   
  339       without also having the same sequence of characters in network       
  340       order, both when the paragraph has LTR direction and when the        
  341       paragraph has RTL direction.  (This is the criterion that is         
  342       explicit in RFC 3454).  (Note that a label displayed in an RTL       
  343       paragraph may display the same as a different label displayed in     
  344       an LTR paragraph and still satisfy this criterion.)                  
  345                                                                            
  346    o  Character Grouping: When displaying a string of labels, using the    
  347       Unicode Bidi algorithm to reorder the characters for display, the    
  348       characters of each label should remain grouped between the           
  349       characters delimiting the labels, both when the string is embedded   
  350       in a paragraph with LTR direction and when it is embedded in a       
  351       paragraph with RTL direction.                                        
  352                                                                            
  353    Several stronger statements were considered and rejected, because       
  354    they seem to be impossible to fulfill within the constraints of the     
  355    Unicode bidirectional algorithm.  These include:                        
  356                                                                            
  357    o  The appearance of a label should be unaffected by its embedding      
  358       context.  This proved impossible even for ASCII labels; the label    
  359       "123-A" will have a different display order in an RTL context than   
  360       in an LTR context.  (This particular example is, however,            
  361       disallowed anyway.)                                                  
  362                                                                            
  363    o  The sequence of labels should be consistent with network order.      
  364       This proved impossible -- a domain name consisting of the labels     
  365       (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in   
  366       an LTR context.  (In an RTL context, it will be displayed as         
  367       L4.R3.R2.L1).                                                        
  368                                                                            
  369    o  No two domain names should be displayed the same, even under         
  370       differing directionality.  This was shown to be unsound, since the   
  371       domain name (in network order) ABC.abc will have display order       
  372       CBA.abc in an LTR context and abc.CBA in an RTL context, while the   
  373       domain name (network) abc.ABC will have display order abc.CBA in     
  374       an LTR context and CBA.abc in an RTL context.                        
  375                                                                            
  376                                                                            
  377                                                                            
  378                                                                            
  379                                                                            
  380                                                                            
  381                                                                            
  382 Alvestrand & Karp            Standards Track                    [Page 7]   

  383 RFC 5893                   IDNA Right to Left                August 2010   
  384                                                                            
  385                                                                            
  386    One possible requirement was thought to be problematic, but turned      
  387    out to be satisfied by a string that obeys the proposed rules:          
  388                                                                            
  389    o  The Character Grouping requirement should be satisfied when          
  390       directional controls (LRE, RLE, RLO, LRO, PDF) are used in the       
  391       same paragraph (outside of the labels).  Because these controls      
  392       affect presentation order in non-obvious ways, by affecting the      
  393       "sor" and "eor" properties of the Unicode Bidi algorithm, the        
  394       conditions above require extra testing in order to figure out        
  395       whether or not they influence the display of the domain name.        
  396       Testing found that for the strings allowed under the rule            
  397       presented in this document, directional controls do not influence    
  398       the display of the domain name.                                      
  399                                                                            
  400    This is still not stated as a requirement, since it did not seem as     
  401    important as the stated requirements, but it is useful to know that     
  402    Bidi domain names where the labels satisfy the rule have this           
  403    property.                                                               
  404                                                                            
  405    In the following descriptions, first-level bullets are used to          
  406    indicate rules or normative statements; second-level bullets are        
  407    commentary.                                                             
  408                                                                            
  409    The Character Grouping requirement can be more formally stated as:      
  410                                                                            
  411    o  Let "Delimiterchars" be a set of characters with the Unicode Bidi    
  412       properties CS, WS, ON.  (These are commonly used to delimit labels   
  413       -- both the FULL STOP and the space are included.  They are not      
  414       allowed in domain labels.)                                           
  415                                                                            
  416       *  ET, though it commonly occurs next to domain names in practice,   
  417          is problematic: the context R CS L EN ET (for instance A.a1%)     
  418          makes the label L EN not satisfy the character grouping           
  419          requirement.                                                      
  420                                                                            
  421       *  ES commonly occurs in labels as HYPHEN-MINUS, but could also be   
  422          used as a delimiter (for instance, the plus sign).  It is left    
  423          out here.                                                         
  424                                                                            
  425    o  Let "unproblematic label" be a label that either satisfies the       
  426       requirements or does not contain any character with the Bidi         
  427       properties R, AL, or AN and does not begin with a character with     
  428       the Bidi property EN.  (Informally, "it does not start with a        
  429       number".)                                                            
  430                                                                            
  431                                                                            
  432                                                                            
  433                                                                            
  434                                                                            
  435                                                                            
  436                                                                            
  437 Alvestrand & Karp            Standards Track                    [Page 8]   

  438 RFC 5893                   IDNA Right to Left                August 2010   
  439                                                                            
  440                                                                            
  441    A label X satisfies the Character Grouping requirement when, for any    
  442    Delimiter Character D1 and D2, and for any label S1 and S2 that is an   
  443    unproblematic label or an empty string, the following holds true:       
  444                                                                            
  445    If the string formed by concatenating S1, D1, X, D2, and S2 is          
  446    reordered according to the Bidi algorithm, then all the characters of   
  447    X in the reordered string are between D1 and D2, and no other           
  448    characters are between D1 and D2, both if the overall paragraph         
  449    direction is LTR and if the overall paragraph direction is RTL.         
  450                                                                            
  451    Note that the definition is self-referential, since S1 and S2 are       
  452    constrained to be "legal" by this definition.  This makes testing       
  453    changes to proposed rules a little complex, but does not create         
  454    problems for testing whether or not a given proposed rule satisfies     
  455    the criterion.                                                          
  456                                                                            
  457    The "zero-length" case represents the case where a domain name is       
  458    next to something that isn't a domain name, separated by a delimiter    
  459    character.                                                              
  460                                                                            
  461    Note about the position of BN: The Unicode bidirectional algorithm      
  462    specifies that a BN has an effect on the adjoining characters in        
  463    network order, not in display order, and are therefore treated as if    
  464    removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule     
  465    X9 and Section 5.3).  Therefore, the question of "what position does    
  466    a BN have after reordering" is not meaningful.  It has been ignored     
  467    while developing the rules here.                                        
  468                                                                            
  469    The Label Uniqueness requirement can be formally stated as:             
  470                                                                            
  471    If two non-identical labels X and Y, embedded as for the test above,    
  472    displayed in paragraphs with the same directionality, are reordered     
  473    by the Bidi algorithm into the same sequence of code points, the        
  474    labels X and Y cannot both be legal.                                    
  475                                                                            
  476 4.  Examples of Issues Found with RFC 3454                                 
  477                                                                            
  478 4.1.  Dhivehi                                                              
  479                                                                            
  480    Dhivehi, the official language of the Maldives, is written with the     
  481    Thaana script.  This script displays some of the characteristics of     
  482    the Arabic script, including its directional properties, and the        
  483    indication of vowels by the diacritical marking of consonantal base     
  484    characters.  This marking is obligatory, and both two consecutive       
  485    vowels and syllable-final consonants are indicated with unvoiced        
  486    combining marks.  Every Dhivehi word therefore ends with a combining    
  487    mark.                                                                   
  488                                                                            
  489                                                                            
  490                                                                            
  491                                                                            
  492 Alvestrand & Karp            Standards Track                    [Page 9]   

  493 RFC 5893                   IDNA Right to Left                August 2010   
  494                                                                            
  495                                                                            
  496    The word for "computer", which is romanized as "konpeetaru", is         
  497    written with the following sequence of Unicode code points:             
  498                                                                            
  499       U+0786 THAANA LETTER KAAFU (AL)                                      
  500                                                                            
  501       U+07AE THAANA OBOFILI (NSM)                                          
  502                                                                            
  503       U+0782 THAANA LETTER NOONU (AL)                                      
  504                                                                            
  505       U+07B0 THAANA SUKUN (NSM)                                            
  506                                                                            
  507       U+0795 THAANA LETTER PAVIYANI (AL)                                   
  508                                                                            
  509       U+07A9 THAANA LETTER EEBEEFILI (AL)                                  
  510                                                                            
  511       U+0793 THAANA LETTER TAVIYANI (AL)                                   
  512                                                                            
  513       U+07A6 THAANA ABAFILI (NSM)                                          
  514                                                                            
  515       U+0783 THAANA LETTER RAA (AL)                                        
  516                                                                            
  517       U+07AA THAANA UBUFILI (NSM)                                          
  518                                                                            
  519    The directionality class of U+07AA in the Unicode database              
  520    [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a           
  521    conformant implementation of the IDNA2003 algorithm will say that       
  522    "this is not in RandALCat" and refuse to encode the string.             
  523                                                                            
  524 4.2.  Yiddish                                                              
  525                                                                            
  526    Yiddish is one of several languages written with the Hebrew script      
  527    (others include Hebrew and Ladino).  This is basically a consonantal    
  528    alphabet (also termed an "abjad"), but Yiddish is written using an      
  529    extended form that is fully vocalic.  The vowels are indicated in       
  530    several ways, one of which is by repurposing letters that are           
  531    consonants in Hebrew.  Other letters are used both as vowels and        
  532    consonants, with combining marks, called "points", used to              
  533    differentiate between them.  Finally, some base characters can          
  534    indicate several different vowels, which are also disambiguated by      
  535    combining marks.  Pointed characters can appear in word-final           
  536    position and may therefore also be needed at the end of labels.  This   
  537    is not an invariable attribute of a Yiddish string and there is thus    
  538    greater latitude here than there is with Dhivehi.                       
  539                                                                            
  540    The organization now known as the "YIVO Institute for Jewish            
  541    Research" developed orthographic rules for modern Standard Yiddish      
  542    during the 1930s on the basis of work conducted in several venues       
  543    since earlier in that century.  These are given in, "The Standardized   
  544                                                                            
  545                                                                            
  546                                                                            
  547 Alvestrand & Karp            Standards Track                   [Page 10]   

  548 RFC 5893                   IDNA Right to Left                August 2010   
  549                                                                            
  550                                                                            
  551    Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken    
  552    as normatively descriptive of modern Standard Yiddish in any context    
  553    where that notion is deemed relevant.  They have been applied           
  554    exclusively in all formal Yiddish dictionaries published since their    
  555    establishment, and are similarly dominant in academic and               
  556    bibliographic regards.                                                  
  557                                                                            
  558    It therefore appears appropriate for this repertoire also to be         
  559    supported fully by IDNA.  This presents no difficulty with characters   
  560    in initial and medial positions, but pointed characters are regularly   
  561    used in final position as well.  All of the characters in the SYO       
  562    repertoire appear in both marked and unmarked form with one             
  563    exception: the HEBREW LETTER PE (U+05E4).  The SYO only permits this    
  564    with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent   
  565    to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent    
  566    to the Latin letter "f".  There is, however, a separate unpointed       
  567    allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter          
  568    character when it appears in final position.  The constraint on the     
  569    use of the SYO repertoire resulting from the proscription of            
  570    combining marks at the end of RTL strings thus reduces to nothing       
  571    more, or less, than the equivalent of saying that a string of Latin     
  572    characters cannot end with the letter "p".  It must also be noted       
  573    that the HEBREW LETTER PE with the HEBREW POINT DAGESH is               
  574    characteristic of almost all traditional Yiddish orthographies that     
  575    predate (or remain in use in parallel to) the SYO, being the first      
  576    pointed character to appear in any of them.                             
  577                                                                            
  578    A more general instantiation of the basic problem can be seen in the    
  579    representation of the YIVO acronym.  This acronym is written with the   
  580    Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and       
  581    QAMATS are combining points.  The Unicode code points are:              
  582                                                                            
  583       U+05D9 HEBREW LETTER YOD (R)                                         
  584                                                                            
  585       U+05B4 HEBREW POINT HIRIQ (NSM)                                      
  586                                                                            
  587       U+05D5 HEBREW LETTER VAV (R)                                         
  588                                                                            
  589       U+05D0 HEBREW LETTER ALEF (R)                                        
  590                                                                            
  591       U+05B8 HEBREW POINT QAMATS (NSM)                                     
  592                                                                            
  593    The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode   
  594    database is NSM, which again causes the IDNA2003 algorithm to reject    
  595    the string.                                                             
  596                                                                            
  597                                                                            
  598                                                                            
  599                                                                            
  600                                                                            
  601                                                                            
  602 Alvestrand & Karp            Standards Track                   [Page 11]   

  603 RFC 5893                   IDNA Right to Left                August 2010   
  604                                                                            
  605                                                                            
  606    It may also be noted that all of the combined characters mentioned      
  607    above exist in precomposed form at separate positions in the Unicode    
  608    chart.  However, by invoking Stringprep, the IDNA2003 algorithm also    
  609    rejects those code points, for reasons not discussed here.              
  610                                                                            
  611 4.3.  Strings with Numbers                                                 
  612                                                                            
  613    By requiring that the first or last character of a string be a member   
  614    of category R or AL, the Stringprep specification [RFC3454]             
  615    prohibited a string containing right-to-left characters from ending     
  616    with a number.                                                          
  617                                                                            
  618    Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5     
  619    ALEF.  Displayed in an LTR context, the first one will be displayed     
  620    from left to right as 5 ALEF (with the 5 being considered right to      
  621    left because of the leading ALEF), while 5 ALEF will be displayed in    
  622    exactly the same order (5 taking the direction from context).           
  623    Clearly, only one of those should be permitted as a registered label,   
  624    but barring them both seems unnecessary.                                
  625                                                                            
  626 5.  Troublesome Situations and Guidelines                                  
  627                                                                            
  628    There are situations in which labels that satisfy the rule above will   
  629    be displayed in a surprising fashion.  The most important of these is   
  630    the case where a label ending in a character with Bidi property AL,     
  631    AN, or R occurs before a label beginning with a character of Bidi       
  632    property EN.  In that case, the number will appear to move into the     
  633    label containing the right-to-left character, violating the Character   
  634    Grouping requirement.                                                   
  635                                                                            
  636    If the label that occurs after the right-to-left label itself           
  637    satisfies the Bidi criterion, the requirements will be satisfied in     
  638    all cases (this is the reason why the criterion talks about strings     
  639    containing L in some cases).  However, the IDNABIS WG concluded that    
  640    this could not be required for several reasons:                         
  641                                                                            
  642    o  There is a large current deployment of ASCII domain names starting   
  643       with digits.  These cannot possibly be invalidated.                  
  644                                                                            
  645    o  Domain names are often constructed piecemeal, for instance, by       
  646       combining a string with the content of a search list.  This may      
  647       occur after IDNA processing, and thus in part of the code that is    
  648       not IDNA-aware, making detection of the undesirable combination      
  649       impossible.                                                          
  650                                                                            
  651                                                                            
  652                                                                            
  653                                                                            
  654                                                                            
  655                                                                            
  656                                                                            
  657 Alvestrand & Karp            Standards Track                   [Page 12]   

  658 RFC 5893                   IDNA Right to Left                August 2010   
  659                                                                            
  660                                                                            
  661    o  Even if a label is registered under a "safe" label, there may be a   
  662       DNAME [RFC2672] with an "unsafe" label that points to the "safe"     
  663       label, thus creating seemingly valid names that would not satisfy    
  664       the criterion.                                                       
  665                                                                            
  666    o  Wildcards create the odd situation where a label is "valid" (can     
  667       be looked up successfully) without the zone owner knowing that       
  668       this label exists.  So an owner of a zone whose name starts with a   
  669       digit and contains a wildcard has no way of controlling whether or   
  670       not names with RTL labels in them are looked up in his zone.         
  671                                                                            
  672    Rather than trying to suggest rules that disallow all such              
  673    undesirable situations, this document merely warns about the            
  674    possibility, and leaves it to application developers to take whatever   
  675    measures they deem appropriate to avoid problematic situations.         
  676                                                                            
  677 6.  Other Issues in Need of Resolution                                     
  678                                                                            
  679    This document concerns itself only with the rules that are needed       
  680    when dealing with domain names with characters that have differing      
  681    Bidi properties, and considers characters only in terms of their Bidi   
  682    properties.  All other issues with scripts that are written from        
  683    right to left must be considered in other contexts.                     
  684                                                                            
  685    One such issue is the need to keep numbers separate.  Several scripts   
  686    are used with multiple sets of numbers -- most commonly they use        
  687    Latin numbers and a script-specific set of numbers, but in the case     
  688    of Arabic, there are two sets of "Arabic-Indic" digits involved.        
  689                                                                            
  690    The algorithm in this document disallows occurrences of AN-class        
  691    characters ("Arabic-Indic digits", U+0660 to U+0669) together with      
  692    EN-class characters (which includes "European" digits, U+0030 to        
  693    U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but       
  694    does not help in preventing the mixing of, for instance, Bengali        
  695    digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),       
  696    both of which have Bidi class L.  A registry or script community that   
  697    wishes to create rules restricting the mixing of digits in a label      
  698    will be able to specify these restrictions at the registry level.       
  699    Some rules are also specified at the protocol level.                    
  700                                                                            
  701    Another set of issues concerns the proper display of IDNs with a        
  702    mixture of LTR and RTL labels, or only RTL labels.                      
  703                                                                            
  704    It is unrealistic to expect that applications will display domain       
  705    names using embedded formatting codes between their labels (for one     
  706    thing, no reliable algorithms for identifying domain names in running   
  707    text exist); thus, the display order will be determined by the Bidi     
  708    algorithm.  Thus, a sequence (in network order) of R1.R2.ltr will be    
  709                                                                            
  710                                                                            
  711                                                                            
  712 Alvestrand & Karp            Standards Track                   [Page 13]   

  713 RFC 5893                   IDNA Right to Left                August 2010   
  714                                                                            
  715                                                                            
  716    displayed in the order 2R.1R.ltr in an LTR context, which might         
  717    surprise someone expecting to see labels displayed in hierarchical      
  718    order.  People used to working with text that mixes LTR and RTL         
  719    strings might not be so surprised by this.  Again, this memo does not   
  720    attempt to suggest a solution to this problem.                          
  721                                                                            
  722 7.  Compatibility Considerations                                           
  723                                                                            
  724 7.1.  Backwards Compatibility Considerations                               
  725                                                                            
  726    As with any change to an existing standard, it is important to          
  727    consider what happens with existing implementations when the change     
  728    is introduced.  Some troublesome cases include:                         
  729                                                                            
  730    o  An old program used to input the newly allowed label.  If the old    
  731       program checks the input against RFC 3454, some labels will not be   
  732       allowed, and domain names containing those labels will remain        
  733       inaccessible.                                                        
  734                                                                            
  735    o  An old program is asked to display the newly allowed label, and      
  736       checks it against RFC 3454 before displaying.  The program will      
  737       perform some kind of fallback, most likely displaying the label in   
  738       A-label form.                                                        
  739                                                                            
  740    o  An old program tries to display the newly allowed label.  If the     
  741       old program has code for displaying the last character of a label    
  742       that is different from the code used to display the characters in    
  743       the middle of the label, the display may be inconsistent and cause   
  744       confusion.                                                           
  745                                                                            
  746    One particular example of the last case is if a program chooses to      
  747    examine the last character (in network order) of a string in order to   
  748    determine its directionality, rather than its first.  If it finds an    
  749    NSM character and tries to display the string as if it was a            
  750    left-to-right string, the resulting display may be interesting, but     
  751    not useful.                                                             
  752                                                                            
  753    The editors believe that these cases will have a less harmful impact    
  754    in practice than continuing to deny the use of words from the           
  755    languages for which these strings are necessary as IDN labels.          
  756                                                                            
  757    This specification does not forbid using leading European digits in     
  758    ASCII-only labels, since this would conflict with a large installed     
  759    base of such labels, and would increase the scope of the                
  760    specification from RTL labels to all labels.  The harm resulting from   
  761    this limitation of scope is described in Section 5.  Registries and     
  762    private zone managers can check for this particular condition before    
  763    they allow registration of any RTL label.  Generally, it is best to     
  764                                                                            
  765                                                                            
  766                                                                            
  767 Alvestrand & Karp            Standards Track                   [Page 14]   

  768 RFC 5893                   IDNA Right to Left                August 2010   
  769                                                                            
  770                                                                            
  771    disallow registration of any right-to-left strings in a zone where      
  772    the label at the level above begins with a digit.                       
  773                                                                            
  774 7.2.  Forward Compatibility Considerations                                 
  775                                                                            
  776    This text is intentionally specified strictly in terms of the Unicode   
  777    Bidi properties.  The determination that the condition is sufficient    
  778    to fulfill the criteria depends on the Unicode Bidi algorithm; it is    
  779    unlikely that drastic changes will be made to this algorithm.           
  780                                                                            
  781    However, the determination of validity for any string depends on the    
  782    Unicode Bidi property values, which are not declared immutable by the   
  783    Unicode Consortium.  Furthermore, the behavior of the algorithm for     
  784    any given character is likely to be linguistically and culturally       
  785    sensitive, so while it should occur rarely, it is possible that later   
  786    versions of the Unicode Standard may change the Bidi properties         
  787    assigned to certain Unicode characters.                                 
  788                                                                            
  789    This memo does not propose a solution for this problem.                 
  790                                                                            
  791 8.  Security Considerations                                                
  792                                                                            
  793    The display behavior of mixed-direction text can be extremely           
  794    surprising to users who are not used to it; for instance, cut and       
  795    paste of a piece of text can cause the text to display differently at   
  796    the destination, if the destination is in another directionality        
  797    context, and adding a character in one place of a text can cause        
  798    characters some distance from the point of insertion to change their    
  799    display position.  This is, however, not a phenomenon unique to the     
  800    display of domain names.                                                
  801                                                                            
  802    The new IDNA protocol, and particularly these new Bidi rules, will      
  803    allow some strings to be used in IDNA contexts that are not allowed     
  804    today.  It is possible that differences in the interpretation of        
  805    labels between implementations of IDNA2003 and IDNA2008 could pose a    
  806    security risk, but it is difficult to envision any specific             
  807    instantiation of this.                                                  
  808                                                                            
  809    Any rational attempt to compute, for instance, a hash over an           
  810    identifier processed by IDNA would use network order for its            
  811    computation, and thus be unaffected by the new rules proposed here.     
  812                                                                            
  813    While it is not believed to pose a problem, if display routines had     
  814    been written with specific knowledge of the RFC 3454 IDNA               
  815    prohibitions, it is possible that the potential problems noted under    
  816    "Backwards Compatibility Considerations" could cause new kinds of       
  817    confusion.                                                              
  818                                                                            
  819                                                                            
  820                                                                            
  821                                                                            
  822 Alvestrand & Karp            Standards Track                   [Page 15]   

  823 RFC 5893                   IDNA Right to Left                August 2010   
  824                                                                            
  825                                                                            
  826 9.  Acknowledgements                                                       
  827                                                                            
  828    While the listed editors held the pen, this document represents the     
  829    joint work and conclusions of an ad hoc design team.  In addition to    
  830    the editors, this consisted of, in alphabetic order, Tina Dam, Patrik   
  831    Faltstrom, and John Klensin.  Many further specific contributions and   
  832    helpful comments were received from the people listed below, and        
  833    others who have contributed to the development and use of the IDNA      
  834    protocols.                                                              
  835                                                                            
  836    The particular formulation of the Bidi rule in Section 2 was            
  837    suggested by Matitiahu Allouche.                                        
  838                                                                            
  839    The team wishes, in particular, to thank Roozbeh Pournader for          
  840    calling its attention to the issue with the Thaana script, Paul         
  841    Hoffman for pointing out the need to be explicit about backwards        
  842    compatibility considerations, Ken Whistler for suggesting the basis     
  843    of the formalized "Character Grouping" requirement, Mark Davis for      
  844    commentary, Erik van der Poel for careful review, comments, and         
  845    verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete    
  846    Resnick for reviews, and Vint Cerf for chairing the working group and   
  847    contributing massively to getting the documents finished.               
  848                                                                            
  849 10.  References                                                            
  850                                                                            
  851 10.1.  Normative References                                                
  852                                                                            
  853    [RFC5890]      Klensin, J., "Internationalized Domain Names for         
  854                   Applications (IDNA): Definitions and Document            
  855                   Framework", RFC 5890, August 2010.                       
  856                                                                            
  857    [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:      
  858                   Unicode Bidirectional Algorithm", September 2009,        
  859                   <http://www.unicode.org/reports/tr9/>.                   
  860                                                                            
  861    [Unicode52]    The Unicode Consortium.  The Unicode Standard, Version   
  862                   5.2.0, defined by: "The Unicode Standard, Version        
  863                   5.2.0", (Mountain View, CA: The Unicode Consortium,      
  864                   2009. ISBN 978-1-936213-00-9).                           
  865                   <http://www.unicode.org/versions/Unicode5.2.0/>.         
  866                                                                            
  867                                                                            
  868                                                                            
  869                                                                            
  870                                                                            
  871                                                                            
  872                                                                            
  873                                                                            
  874                                                                            
  875                                                                            
  876                                                                            
  877 Alvestrand & Karp            Standards Track                   [Page 16]   

  878 RFC 5893                   IDNA Right to Left                August 2010   
  879                                                                            
  880                                                                            
  881 10.2.  Informative References                                              
  882                                                                            
  883    [RFC2672]      Crawford, M., "Non-Terminal DNS Name Redirection",       
  884                   RFC 2672, August 1999.                                   
  885                                                                            
  886    [RFC3454]      Hoffman, P. and M. Blanchet, "Preparation of             
  887                   Internationalized Strings ("stringprep")", RFC 3454,     
  888                   December 2002.                                           
  889                                                                            
  890    [RFC5891]      Klensin, J., "Internationalized Domain Names in          
  891                   Applications (IDNA): Protocol", RFC 5891, August 2010.   
  892                                                                            
  893    [SYO]          "The Standardized Yiddish Orthography: Rules of          
  894                   Yiddish Spelling, 6th ed., New York, ISBN                
  895                   0-914512-25-0", 1999.                                    
  896                                                                            
  897 Authors' Addresses                                                         
  898                                                                            
  899    Harald Tveit Alvestrand (editor)                                        
  900    Google                                                                  
  901    Beddingen 10                                                            
  902    Trondheim,   7014                                                       
  903    Norway                                                                  
  904                                                                            
  905    EMail: harald@alvestrand.no                                             
  906                                                                            
  907                                                                            
  908    Cary Karp                                                               
  909    Swedish Museum of Natural History                                       
  910    Frescativ. 40                                                           
  911    Stockholm,   10405                                                      
  912    Sweden                                                                  
  913                                                                            
  914    Phone: +46 8 5195 4055                                                  
  915    Fax:                                                                    
  916    EMail: ck@nic.museum                                                    
  917                                                                            
  918                                                                            
  919                                                                            
  920                                                                            
  921                                                                            
  922                                                                            
  923                                                                            
  924                                                                            
  925                                                                            
  926                                                                            
  927                                                                            
  928                                                                            
  929                                                                            
  930                                                                            
  931                                                                            
  932 Alvestrand & Karp            Standards Track                   [Page 17]   
  933
top ICANNDNS RFC Annotations project
The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.
This RFC is included in the DNS RFCs annotation project whose home page is here.