RFC 3492

    1 Network Working Group                                        A. Costello   
    2 Request for Comments: 3492                 Univ. of California, Berkeley   
    3 Category: Standards Track                                     March 2003   
    4                                                                            
    5                                                                            
    6               Punycode: A Bootstring encoding of Unicode                   
    7        for Internationalized Domain Names in Applications (IDNA)           
    8                                                                            
    9 Status of this Memo                                                        
   10                                                                            
   11    This document specifies an Internet standards track protocol for the    
   12    Internet community, and requests discussion and suggestions for         
   13    improvements.  Please refer to the current edition of the "Internet     
   14    Official Protocol Standards" (STD 1) for the standardization state      
   15    and status of this protocol.  Distribution of this memo is unlimited.   
   16                                                                            
   17 Copyright Notice                                                           
   18                                                                            
   19    Copyright (C) The Internet Society (2003).  All Rights Reserved.        
   20                                                                            
   21 Abstract                                                                   
   22                                                                            
   23    Punycode is a simple and efficient transfer encoding syntax designed    
   24    for use with Internationalized Domain Names in Applications (IDNA).     
   25    It uniquely and reversibly transforms a Unicode string into an ASCII    
   26    string.  ASCII characters in the Unicode string are represented         
   27    literally, and non-ASCII characters are represented by ASCII            
   28    characters that are allowed in host name labels (letters, digits, and   
   29    hyphens).  This document defines a general algorithm called             
   30    Bootstring that allows a string of basic code points to uniquely        
   31    represent any string of code points drawn from a larger set.            
   32    Punycode is an instance of Bootstring that uses particular parameter    
   33    values specified by this document, appropriate for IDNA.                
   34                                                                            
   35 Table of Contents                                                          
   36                                                                            
   37    1. Introduction...............................................2         
   38        1.1 Features..............................................2         
   39        1.2 Interaction of protocol parts.........................3         
   40    2. Terminology................................................3         
   41    3. Bootstring description.....................................4         
   42        3.1 Basic code point segregation..........................4         
   43        3.2 Insertion unsort coding...............................4         
   44        3.3 Generalized variable-length integers..................5         
   45        3.4 Bias adaptation.......................................7         
   46    4. Bootstring parameters......................................8         
   47    5. Parameter values for Punycode..............................8         
   48    6. Bootstring algorithms......................................9         
   49                                                                            
   50                                                                            
   51                                                                            
   52 Costello                    Standards Track                     [Page 1]   

   53 RFC 3492                     IDNA Punycode                    March 2003   
   54                                                                            
   55                                                                            
   56        6.1 Bias adaptation function.............................10         
   57        6.2 Decoding procedure...................................11         
   58        6.3 Encoding procedure...................................12         
   59        6.4 Overflow handling....................................13         
   60    7. Punycode examples.........................................14         
   61        7.1 Sample strings.......................................14         
   62        7.2 Decoding traces......................................17         
   63        7.3 Encoding traces......................................19         
   64    8. Security Considerations...................................20         
   65    9. References................................................21         
   66        9.1 Normative References.................................21         
   67        9.2 Informative References...............................21         
   68    A. Mixed-case annotation.....................................22         
   69    B. Disclaimer and license....................................22         
   70    C. Punycode sample implementation............................23         
   71    Author's Address.............................................34         
   72    Full Copyright Statement.....................................35         
   73                                                                            
   74 1. Introduction                                                            
   75                                                                            
   76    [IDNA] describes an architecture for supporting internationalized       
   77    domain names.  Labels containing non-ASCII characters can be            
   78    represented by ACE labels, which begin with a special ACE prefix and    
   79    contain only ASCII characters.  The remainder of the label after the    
   80    prefix is a Punycode encoding of a Unicode string satisfying certain    
   81    constraints.  For the details of the prefix and constraints, see        
   82    [IDNA] and [NAMEPREP].                                                  
   83                                                                            
   84    Punycode is an instance of a more general algorithm called              
   85    Bootstring, which allows strings composed from a small set of "basic"   
   86    code points to uniquely represent any string of code points drawn       
   87    from a larger set.  Punycode is Bootstring with particular parameter    
   88    values appropriate for IDNA.                                            
   89                                                                            
   90 1.1 Features                                                               
   91                                                                            
   92    Bootstring has been designed to have the following features:            
   93                                                                            
   94    *  Completeness:  Every extended string (sequence of arbitrary code     
   95       points) can be represented by a basic string (sequence of basic      
   96       code points).  Restrictions on what strings are allowed, and on      
   97       length, can be imposed by higher layers.                             
   98                                                                            
   99    *  Uniqueness:  There is at most one basic string that represents a     
  100       given extended string.                                               
  101                                                                            
  102    *  Reversibility:  Any extended string mapped to a basic string can     
  103       be recovered from that basic string.                                 
  104                                                                            
  105                                                                            
  106                                                                            
  107 Costello                    Standards Track                     [Page 2]   

  108 RFC 3492                     IDNA Punycode                    March 2003   
  109                                                                            
  110                                                                            
  111    *  Efficient encoding:  The ratio of basic string length to extended    
  112       string length is small.  This is important in the context of         
  113       domain names because RFC 1034 [RFC1034] restricts the length of a    
  114       domain label to 63 characters.                                       
  115                                                                            
  116    *  Simplicity:  The encoding and decoding algorithms are reasonably     
  117       simple to implement.  The goals of efficiency and simplicity are     
  118       at odds; Bootstring aims at a good balance between them.             
  119                                                                            
  120    *  Readability:  Basic code points appearing in the extended string     
  121       are represented as themselves in the basic string (although the      
  122       main purpose is to improve efficiency, not readability).             
  123                                                                            
  124    Punycode can also support an additional feature that is not used by     
  125    the ToASCII and ToUnicode operations of [IDNA].  When extended          
  126    strings are case-folded prior to encoding, the basic string can use     
  127    mixed case to tell how to convert the folded string into a mixed-case   
  128    string.  See appendix A "Mixed-case annotation".                        
  129                                                                            
  130 1.2 Interaction of protocol parts                                          
  131                                                                            
  132    Punycode is used by the IDNA protocol [IDNA] for converting domain      
  133    labels into ASCII; it is not designed for any other purpose.  It is     
  134    explicitly not designed for processing arbitrary free text.             
  135                                                                            
  136 2. Terminology                                                             
  137                                                                            
  138    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",     
  139    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this    
  140    document are to be interpreted as described in BCP 14, RFC 2119         
  141    [RFC2119].                                                              
  142                                                                            
  143    A code point is an integral value associated with a character in a      
  144    coded character set.                                                    
  145                                                                            
  146    As in the Unicode Standard [UNICODE], Unicode code points are denoted   
  147    by "U+" followed by four to six hexadecimal digits, while a range of    
  148    code points is denoted by two hexadecimal numbers separated by "..",    
  149    with no prefixes.                                                       
  150                                                                            
  151    The operators div and mod perform integer division; (x div y) is the    
  152    quotient of x divided by y, discarding the remainder, and (x mod y)     
  153    is the remainder, so (x div y) * y + (x mod y) == x.  Bootstring uses   
  154    these operators only with nonnegative operands, so the quotient and     
  155    remainder are always nonnegative.                                       
  156                                                                            
  157    The break statement jumps out of the innermost loop (as in C).          
  158                                                                            
  159                                                                            
  160                                                                            
  161                                                                            
  162 Costello                    Standards Track                     [Page 3]   

  163 RFC 3492                     IDNA Punycode                    March 2003   
  164                                                                            
  165                                                                            
  166    An overflow is an attempt to compute a value that exceeds the maximum   
  167    value of an integer variable.                                           
  168                                                                            
  169 3. Bootstring description                                                  
  170                                                                            
  171    Bootstring represents an arbitrary sequence of code points (the         
  172    "extended string") as a sequence of basic code points (the "basic       
  173    string").  This section describes the representation.  Section 6        
  174    "Bootstring algorithms" presents the algorithms as pseudocode.          
  175    Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the      
  176    algorithms for sample inputs.                                           
  177                                                                            
  178    The following sections describe the four techniques used in             
  179    Bootstring.  "Basic code point segregation" is a very simple and        
  180    efficient encoding for basic code points occurring in the extended      
  181    string: they are simply copied all at once.  "Insertion unsort          
  182    coding" encodes the non-basic code points as deltas, and processes      
  183    the code points in numerical order rather than in order of              
  184    appearance, which typically results in smaller deltas.  The deltas      
  185    are represented as "generalized variable-length integers", which use    
  186    basic code points to represent nonnegative integers.  The parameters    
  187    of this integer representation are dynamically adjusted using "bias     
  188    adaptation", to improve efficiency when consecutive deltas have         
  189    similar magnitudes.                                                     
  190                                                                            
  191 3.1 Basic code point segregation                                           
  192                                                                            
  193    All basic code points appearing in the extended string are              
  194    represented literally at the beginning of the basic string, in their    
  195    original order, followed by a delimiter if (and only if) the number     
  196    of basic code points is nonzero.  The delimiter is a particular basic   
  197    code point, which never appears in the remainder of the basic string.   
  198    The decoder can therefore find the end of the literal portion (if       
  199    there is one) by scanning for the last delimiter.                       
  200                                                                            
  201 3.2 Insertion unsort coding                                                
  202                                                                            
  203    The remainder of the basic string (after the last delimiter if there    
  204    is one) represents a sequence of nonnegative integral deltas as         
  205    generalized variable-length integers, described in section 3.3.  The    
  206    meaning of the deltas is best understood in terms of the decoder.       
  207                                                                            
  208    The decoder builds the extended string incrementally.  Initially, the   
  209    extended string is a copy of the literal portion of the basic string    
  210    (excluding the last delimiter).  The decoder inserts non-basic code     
  211    points, one for each delta, into the extended string, ultimately        
  212    arriving at the final decoded string.                                   
  213                                                                            
  214                                                                            
  215                                                                            
  216                                                                            
  217 Costello                    Standards Track                     [Page 4]   

  218 RFC 3492                     IDNA Punycode                    March 2003   
  219                                                                            
  220                                                                            
  221    At the heart of this process is a state machine with two state          
  222    variables: an index i and a counter n.  The index i refers to a         
  223    position in the extended string; it ranges from 0 (the first            
  224    position) to the current length of the extended string (which refers    
  225    to a potential position beyond the current end).  If the current        
  226    state is <n,i>, the next state is <n,i+1> if i is less than the         
  227    length of the extended string, or <n+1,0> if i equals the length of     
  228    the extended string.  In other words, each state change causes i to     
  229    increment, wrapping around to zero if necessary, and n counts the       
  230    number of wrap-arounds.                                                 
  231                                                                            
  232    Notice that the state always advances monotonically (there is no way    
  233    for the decoder to return to an earlier state).  At each state, an      
  234    insertion is either performed or not performed.  At most one            
  235    insertion is performed in a given state.  An insertion inserts the      
  236    value of n at position i in the extended string.  The deltas are a      
  237    run-length encoding of this sequence of events: they are the lengths    
  238    of the runs of non-insertion states preceeding the insertion states.    
  239    Hence, for each delta, the decoder performs delta state changes, then   
  240    an insertion, and then one more state change.  (An implementation       
  241    need not perform each state change individually, but can instead use    
  242    division and remainder calculations to compute the next insertion       
  243    state directly.)  It is an error if the inserted code point is a        
  244    basic code point (because basic code points were supposed to be         
  245    segregated as described in section 3.1).                                
  246                                                                            
  247    The encoder's main task is to derive the sequence of deltas that will   
  248    cause the decoder to construct the desired string.  It can do this by   
  249    repeatedly scanning the extended string for the next code point that    
  250    the decoder would need to insert, and counting the number of state      
  251    changes the decoder would need to perform, mindful of the fact that     
  252    the decoder's extended string will include only those code points       
  253    that have already been inserted.  Section 6.3 "Encoding procedure"      
  254    gives a precise algorithm.                                              
  255                                                                            
  256 3.3 Generalized variable-length integers                                   
  257                                                                            
  258    In a conventional integer representation the base is the number of      
  259    distinct symbols for digits, whose values are 0 through base-1.  Let    
  260    digit_0 denote the least significant digit, digit_1 the next least      
  261    significant, and so on.  The value represented is the sum over j of     
  262    digit_j * w(j), where w(j) = base^j is the weight (scale factor) for    
  263    position j.  For example, in the base 8 integer 437, the digits are     
  264    7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 +      
  265    3*8 + 4*64 = 287.  This representation has two disadvantages:  First,   
  266    there are multiple encodings of each value (because there can be        
  267    extra zeros in the most significant positions), which is inconvenient   
  268                                                                            
  269                                                                            
  270                                                                            
  271                                                                            
  272 Costello                    Standards Track                     [Page 5]   

  273 RFC 3492                     IDNA Punycode                    March 2003   
  274                                                                            
  275                                                                            
  276    when unique encodings are needed.  Second, the integer is not self-     
  277    delimiting, so if multiple integers are concatenated the boundaries     
  278    between them are lost.                                                  
  279                                                                            
  280    The generalized variable-length representation solves these two         
  281    problems.  The digit values are still 0 through base-1, but now the     
  282    integer is self-delimiting by means of thresholds t(j), each of which   
  283    is in the range 0 through base-1.  Exactly one digit, the most          
  284    significant, satisfies digit_j < t(j).  Therefore, if several           
  285    integers are concatenated, it is easy to separate them, starting with   
  286    the first if they are little-endian (least significant digit first),    
  287    or starting with the last if they are big-endian (most significant      
  288    digit first).  As before, the value is the sum over j of digit_j *      
  289    w(j), but the weights are different:                                    
  290                                                                            
  291       w(0) = 1                                                             
  292       w(j) = w(j-1) * (base - t(j-1)) for j > 0                            
  293                                                                            
  294    For example, consider the little-endian sequence of base 8 digits       
  295    734251...  Suppose the thresholds are 2, 3, 5, 5, 5, 5...  This         
  296    implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) =   
  297    90, 90*(8-5) = 270, and so on.  7 is not less than 2, and 3 is not      
  298    less than 3, but 4 is less than 5, so 4 is the last digit.  The value   
  299    of 734 is 7*1 + 3*6 + 4*30 = 145.  The next integer is 251, with        
  300    value 2*1 + 5*6 + 1*30 = 62.  Decoding this representation is very      
  301    similar to decoding a conventional integer:  Start with a current       
  302    value of N = 0 and a weight w = 1.  Fetch the next digit d and          
  303    increase N by d * w.  If d is less than the current threshold (t)       
  304    then stop, otherwise increase w by a factor of (base - t), update t     
  305    for the next position, and repeat.                                      
  306                                                                            
  307    Encoding this representation is similar to encoding a conventional      
  308    integer:  If N < t then output one digit for N and stop, otherwise      
  309    output the digit for t + ((N - t) mod (base - t)), then replace N       
  310    with (N - t) div (base - t), update t for the next position, and        
  311    repeat.                                                                 
  312                                                                            
  313    For any particular set of values of t(j), there is exactly one          
  314    generalized variable-length representation of each nonnegative          
  315    integral value.                                                         
  316                                                                            
  317    Bootstring uses little-endian ordering so that the deltas can be        
  318    separated starting with the first.  The t(j) values are defined in      
  319    terms of the constants base, tmin, and tmax, and a state variable       
  320    called bias:                                                            
  321                                                                            
  322       t(j) = base * (j + 1) - bias,                                        
  323       clamped to the range tmin through tmax                               
  324                                                                            
  325                                                                            
  326                                                                            
  327 Costello                    Standards Track                     [Page 6]   

  328 RFC 3492                     IDNA Punycode                    March 2003   
  329                                                                            
  330                                                                            
  331    The clamping means that if the formula yields a value less than tmin    
  332    or greater than tmax, then t(j) = tmin or tmax, respectively.  (In      
  333    the pseudocode in section 6 "Bootstring algorithms", the expression     
  334    base * (j + 1) is denoted by k for performance reasons.)  These t(j)    
  335    values cause the representation to favor integers within a particular   
  336    range determined by the bias.                                           
  337                                                                            
  338 3.4 Bias adaptation                                                        
  339                                                                            
  340    After each delta is encoded or decoded, bias is set for the next        
  341    delta as follows:                                                       
  342                                                                            
  343    1. Delta is scaled in order to avoid overflow in the next step:         
  344                                                                            
  345          let delta = delta div 2                                           
  346                                                                            
  347       But when this is the very first delta, the divisor is not 2, but     
  348       instead a constant called damp.  This compensates for the fact       
  349       that the second delta is usually much smaller than the first.        
  350                                                                            
  351    2. Delta is increased to compensate for the fact that the next delta    
  352       will be inserting into a longer string:                              
  353                                                                            
  354          let delta = delta + (delta div numpoints)                         
  355                                                                            
  356       numpoints is the total number of code points encoded/decoded so      
  357       far (including the one corresponding to this delta itself, and       
  358       including the basic code points).                                    
  359                                                                            
  360    3. Delta is repeatedly divided until it falls within a threshold, to    
  361       predict the minimum number of digits needed to represent the next    
  362       delta:                                                               
  363                                                                            
  364          while delta > ((base - tmin) * tmax) div 2                        
  365          do let delta = delta div (base - tmin)                            
  366                                                                            
  367    4. The bias is set:                                                     
  368                                                                            
  369          let bias =                                                        
  370            (base * the number of divisions performed in step 3) +          
  371            (((base - tmin + 1) * delta) div (delta + skew))                
  372                                                                            
  373       The motivation for this procedure is that the current delta          
  374       provides a hint about the likely size of the next delta, and so      
  375       t(j) is set to tmax for the more significant digits starting with    
  376       the one expected to be last, tmin for the less significant digits    
  377       up through the one expected to be third-last, and somewhere          
  378       between tmin and tmax for the digit expected to be second-last       
  379                                                                            
  380                                                                            
  381                                                                            
  382 Costello                    Standards Track                     [Page 7]   

  383 RFC 3492                     IDNA Punycode                    March 2003   
  384                                                                            
  385                                                                            
  386       (balancing the hope of the expected-last digit being unnecessary     
  387       against the danger of it being insufficient).                        
  388                                                                            
  389 4. Bootstring parameters                                                   
  390                                                                            
  391    Given a set of basic code points, one needs to be designated as the     
  392    delimiter.  The base cannot be greater than the number of               
  393    distinguishable basic code points remaining.  The digit-values in the   
  394    range 0 through base-1 need to be associated with distinct non-         
  395    delimiter basic code points.  In some cases multiple code points need   
  396    to have the same digit-value; for example, uppercase and lowercase      
  397    versions of the same letter need to be equivalent if basic strings      
  398    are case-insensitive.                                                   
  399                                                                            
  400    The initial value of n cannot be greater than the minimum non-basic     
  401    code point that could appear in extended strings.                       
  402                                                                            
  403    The remaining five parameters (tmin, tmax, skew, damp, and the          
  404    initial value of bias) need to satisfy the following constraints:       
  405                                                                            
  406       0 <= tmin <= tmax <= base-1                                          
  407       skew >= 1                                                            
  408       damp >= 2                                                            
  409       initial_bias mod base <= base - tmin                                 
  410                                                                            
  411    Provided the constraints are satisfied, these five parameters affect    
  412    efficiency but not correctness.  They are best chosen empirically.      
  413                                                                            
  414    If support for mixed-case annotation is desired (see appendix A),       
  415    make sure that the code points corresponding to 0 through tmax-1 all    
  416    have both uppercase and lowercase forms.                                
  417                                                                            
  418 5. Parameter values for Punycode                                           
  419                                                                            
  420    Punycode uses the following Bootstring parameter values:                
  421                                                                            
  422       base         = 36                                                    
  423       tmin         = 1                                                     
  424       tmax         = 26                                                    
  425       skew         = 38                                                    
  426       damp         = 700                                                   
  427       initial_bias = 72                                                    
  428       initial_n    = 128 = 0x80                                            
  429                                                                            
  430    Although the only restriction Punycode imposes on the input integers    
  431    is that they be nonnegative, these parameters are especially designed   
  432    to work well with Unicode [UNICODE] code points, which are integers     
  433    in the range 0..10FFFF (but not D800..DFFF, which are reserved for      
  434                                                                            
  435                                                                            
  436                                                                            
  437 Costello                    Standards Track                     [Page 8]   

  438 RFC 3492                     IDNA Punycode                    March 2003   
  439                                                                            
  440                                                                            
  441    use by the UTF-16 encoding of Unicode).  The basic code points are      
  442    the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the       
  443    delimiter, and some of the others have digit-values as follows:         
  444                                                                            
  445       code points    digit-values                                          
  446       ------------   ----------------------                                
  447       41..5A (A-Z) =  0 to 25, respectively                                
  448       61..7A (a-z) =  0 to 25, respectively                                
  449       30..39 (0-9) = 26 to 35, respectively                                
  450                                                                            
  451    Using hyphen-minus as the delimiter implies that the encoded string     
  452    can end with a hyphen-minus only if the Unicode string consists         
  453    entirely of basic code points, but IDNA forbids such strings from       
  454    being encoded.  The encoded string can begin with a hyphen-minus, but   
  455    IDNA prepends a prefix.  Therefore IDNA using Punycode conforms to      
  456    the RFC 952 rule that host name labels neither begin nor end with a     
  457    hyphen-minus [RFC952].                                                  
  458                                                                            
  459    A decoder MUST recognize the letters in both uppercase and lowercase    
  460    forms (including mixtures of both forms).  An encoder SHOULD output     
  461    only uppercase forms or only lowercase forms, unless it uses mixed-     
  462    case annotation (see appendix A).                                       
  463                                                                            
  464    Presumably most users will not manually write or type encoded strings   
  465    (as opposed to cutting and pasting them), but those who do will need    
  466    to be alert to the potential visual ambiguity between the following     
  467    sets of characters:                                                     
  468                                                                            
  469       G 6                                                                  
  470       I l 1                                                                
  471       O 0                                                                  
  472       S 5                                                                  
  473       U V                                                                  
  474       Z 2                                                                  
  475                                                                            
  476    Such ambiguities are usually resolved by context, but in a Punycode     
  477    encoded string there is no context apparent to humans.                  
  478                                                                            
  479 6. Bootstring algorithms                                                   
  480                                                                            
  481    Some parts of the pseudocode can be omitted if the parameters satisfy   
  482    certain conditions (for which Punycode qualifies).  These parts are     
  483    enclosed in {braces}, and notes immediately following the pseudocode    
  484    explain the conditions under which they can be omitted.                 
  485                                                                            
  486                                                                            
  487                                                                            
  488                                                                            
  489                                                                            
  490                                                                            
  491                                                                            
  492 Costello                    Standards Track                     [Page 9]   

  493 RFC 3492                     IDNA Punycode                    March 2003   
  494                                                                            
  495                                                                            
  496    Formally, code points are integers, and hence the pseudocode assumes    
  497    that arithmetic operations can be performed directly on code points.    
  498    In some programming languages, explicit conversion between code         
  499    points and integers might be necessary.                                 
  500                                                                            
  501 6.1 Bias adaptation function                                               
  502                                                                            
  503    function adapt(delta,numpoints,firsttime):                              
  504      if firsttime then let delta = delta div damp                          
  505      else let delta = delta div 2                                          
  506      let delta = delta + (delta div numpoints)                             
  507      let k = 0                                                             
  508      while delta > ((base - tmin) * tmax) div 2 do begin                   
  509        let delta = delta div (base - tmin)                                 
  510        let k = k + base                                                    
  511      end                                                                   
  512      return k + (((base - tmin + 1) * delta) div (delta + skew))           
  513                                                                            
  514    It does not matter whether the modifications to delta and k inside      
  515    adapt() affect variables of the same name inside the                    
  516    encoding/decoding procedures, because after calling adapt() the         
  517    caller does not read those variables before overwriting them.           
  518                                                                            
  519                                                                            
  520                                                                            
  521                                                                            
  522                                                                            
  523                                                                            
  524                                                                            
  525                                                                            
  526                                                                            
  527                                                                            
  528                                                                            
  529                                                                            
  530                                                                            
  531                                                                            
  532                                                                            
  533                                                                            
  534                                                                            
  535                                                                            
  536                                                                            
  537                                                                            
  538                                                                            
  539                                                                            
  540                                                                            
  541                                                                            
  542                                                                            
  543                                                                            
  544                                                                            
  545                                                                            
  546                                                                            
  547 Costello                    Standards Track                    [Page 10]   

  548 RFC 3492                     IDNA Punycode                    March 2003   
  549                                                                            
  550                                                                            
  551 6.2 Decoding procedure                                                     
  552                                                                            
  553    let n = initial_n                                                       
  554    let i = 0                                                               
  555    let bias = initial_bias                                                 
  556    let output = an empty string indexed from 0                             
  557    consume all code points before the last delimiter (if there is one)     
  558      and copy them to output, fail on any non-basic code point             
  559    if more than zero code points were consumed then consume one more       
  560      (which will be the last delimiter)                                    
  561    while the input is not exhausted do begin                               
  562      let oldi = i                                                          
  563      let w = 1                                                             
  564      for k = base to infinity in steps of base do begin                    
  565        consume a code point, or fail if there was none to consume          
  566        let digit = the code point's digit-value, fail if it has none       
  567        let i = i + digit * w, fail on overflow                             
  568        let t = tmin if k <= bias {+ tmin}, or                              
  569                tmax if k >= bias + tmax, or k - bias otherwise             
  570        if digit < t then break                                             
  571        let w = w * (base - t), fail on overflow                            
  572      end                                                                   
  573      let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)       
  574      let n = n + i div (length(output) + 1), fail on overflow              
  575      let i = i mod (length(output) + 1)                                    
  576      {if n is a basic code point then fail}                                
  577      insert n into output at position i                                    
  578      increment i                                                           
  579    end                                                                     
  580                                                                            
  581    The full statement enclosed in braces (checking whether n is a basic    
  582    code point) can be omitted if initial_n exceeds all basic code points   
  583    (which is true for Punycode), because n is never less than initial_n.   
  584                                                                            
  585    In the assignment of t, where t is clamped to the range tmin through    
  586    tmax, "+ tmin" can always be omitted.  This makes the clamping          
  587    calculation incorrect when bias < k < bias + tmin, but that cannot      
  588    happen because of the way bias is computed and because of the           
  589    constraints on the parameters.                                          
  590                                                                            
  591    Because the decoder state can only advance monotonically, and there     
  592    is only one representation of any delta, there is therefore only one    
  593    encoded string that can represent a given sequence of integers.  The    
  594    only error conditions are invalid code points, unexpected end-of-       
  595    input, overflow, and basic code points encoded using deltas instead     
  596    of appearing literally.  If the decoder fails on these errors as        
  597    shown above, then it cannot produce the same output for two distinct    
  598    inputs.  Without this property it would have been necessary to re-      
  599                                                                            
  600                                                                            
  601                                                                            
  602 Costello                    Standards Track                    [Page 11]   

  603 RFC 3492                     IDNA Punycode                    March 2003   
  604                                                                            
  605                                                                            
  606    encode the output and verify that it matches the input in order to      
  607    guarantee the uniqueness of the encoding.                               
  608                                                                            
  609 6.3 Encoding procedure                                                     
  610                                                                            
  611    let n = initial_n                                                       
  612    let delta = 0                                                           
  613    let bias = initial_bias                                                 
  614    let h = b = the number of basic code points in the input                
  615    copy them to the output in order, followed by a delimiter if b > 0      
  616    {if the input contains a non-basic code point < n then fail}            
  617    while h < length(input) do begin                                        
  618      let m = the minimum {non-basic} code point >= n in the input          
  619      let delta = delta + (m - n) * (h + 1), fail on overflow               
  620      let n = m                                                             
  621      for each code point c in the input (in order) do begin                
  622        if c < n {or c is basic} then increment delta, fail on overflow     
  623        if c == n then begin                                                
  624          let q = delta                                                     
  625          for k = base to infinity in steps of base do begin                
  626            let t = tmin if k <= bias {+ tmin}, or                          
  627                    tmax if k >= bias + tmax, or k - bias otherwise         
  628            if q < t then break                                             
  629            output the code point for digit t + ((q - t) mod (base - t))    
  630            let q = (q - t) div (base - t)                                  
  631          end                                                               
  632          output the code point for digit q                                 
  633          let bias = adapt(delta, h + 1, test h equals b?)                  
  634          let delta = 0                                                     
  635          increment h                                                       
  636        end                                                                 
  637      end                                                                   
  638      increment delta and n                                                 
  639    end                                                                     
  640                                                                            
  641    The full statement enclosed in braces (checking whether the input       
  642    contains a non-basic code point less than n) can be omitted if all      
  643    code points less than initial_n are basic code points (which is true    
  644    for Punycode if code points are unsigned).                              
  645                                                                            
  646    The brace-enclosed conditions "non-basic" and "or c is basic" can be    
  647    omitted if initial_n exceeds all basic code points (which is true for   
  648    Punycode), because the code point being tested is never less than       
  649    initial_n.                                                              
  650                                                                            
  651    In the assignment of t, where t is clamped to the range tmin through    
  652    tmax, "+ tmin" can always be omitted.  This makes the clamping          
  653    calculation incorrect when bias < k < bias + tmin, but that cannot      
  654                                                                            
  655                                                                            
  656                                                                            
  657 Costello                    Standards Track                    [Page 12]   

  658 RFC 3492                     IDNA Punycode                    March 2003   
  659                                                                            
  660                                                                            
  661    happen because of the way bias is computed and because of the           
  662    constraints on the parameters.                                          
  663                                                                            
  664    The checks for overflow are necessary to avoid producing invalid        
  665    output when the input contains very large values or is very long.       
  666                                                                            
  667    The increment of delta at the bottom of the outer loop cannot           
  668    overflow because delta < length(input) before the increment, and        
  669    length(input) is already assumed to be representable.  The increment    
  670    of n could overflow, but only if h == length(input), in which case      
  671    the procedure is finished anyway.                                       
  672                                                                            
  673 6.4 Overflow handling                                                      
  674                                                                            
  675    For IDNA, 26-bit unsigned integers are sufficient to handle all valid   
  676    IDNA labels without overflow, because any string that needed a 27-bit   
  677    delta would have to exceed either the code point limit (0..10FFFF) or   
  678    the label length limit (63 characters).  However, overflow handling     
  679    is necessary because the inputs are not necessarily valid IDNA          
  680    labels.                                                                 
  681                                                                            
  682    If the programming language does not provide overflow detection, the    
  683    following technique can be used.  Suppose A, B, and C are               
  684    representable nonnegative integers and C is nonzero.  Then A + B        
  685    overflows if and only if B > maxint - A, and A + (B * C) overflows if

top ICANNDNS RFC Annotations project

The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.

This RFC is included in the DNS RFCs annotation project whose home page is here.

GLOBAL UPDATED

Updated by RFC5891

GLOBAL HAS ERRATA

Has errata: #265, #3026

GLOBAL V. Risk, ISC.orgBIND 9 implementation note2022-08-15

This RFC is implemented in BIND 9.18 (all versions).

GLOBAL P. Hoffman, ICANNLoose update by RFC 5891

RFC5891 says that it updates RFC 3492, but it doesn't really do so. Instead, Section 4.4 says:

This document does not update or alter the
Punycode algorithm specified in RFC 3492 in any way.  RFC 3492 does
make a non-normative reference to the information about the value and
construction of the ACE prefix that appears in RFC 3490 or Nameprep
[RFC3491].  For consistency and reader convenience, IDNA2008
effectively updates that reference to point to this document.  That
change does not alter the prefix itself.  The prefix, "xn--", is the
same in both sets of documents.

  796    (I) Russian (Cyrillic):                                                 
  797        U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E      
  798        u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440      
  799        u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A      
  800        u+0438                                                              
  801        Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l                          
  802                                                                            
  803    (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol     
  804        U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070      
  805        u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070      
  806        u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061      
  807        u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070      
  808        u+0061 u+00F1 u+006F u+006C                                         
  809        Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a             
  810                                                                            
  811    (K) Vietnamese:                                                         
  812        T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\        
  813        <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t            
  814        U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B      
  815        u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068      
  816        u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067      
  817        U+0056 u+0069 u+1EC7 u+0074                                         
  818        Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g              
  819                                                                            
  820                                                                            
  821                                                                            
  822 Costello                    Standards Track                    [Page 15]   

  823 RFC 3492                     IDNA Punycode                    March 2003   
  824                                                                            
  825                                                                            
  826    The next several examples are all names of Japanese music artists,      
  827    song titles, and TV programs, just because the author happens to have   
  828    them handy (but Japanese is useful for providing examples of single-    
  829    row text, two-row text, ideographic text, and various mixtures          
  830    thereof).                                                               
  831                                                                            
  832    (L) 3<nen>B<gumi><kinpachi><sensei>                                     
  833        u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F             
  834        Punycode: 3B-ww4c5e180e575a65lsy2b                                  
  835                                                                            
  836    (M) <amuro><namie>-with-SUPER-MONKEYS                                   
  837        u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074      
  838        u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D      
  839        U+004F U+004E U+004B U+0045 U+0059 U+0053                           
  840        Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n                  
  841                                                                            
  842    (N) Hello-Another-Way-<sorezore><no><basho>                             
  843        U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F      
  844        u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D      
  845        u+305D u+308C u+305E u+308C u+306E u+5834 u+6240                    
  846        Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b                  
  847                                                                            
  848    (O) <hitotsu><yane><no><shita>2                                         
  849        u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032             
  850        Punycode: 2-u9tlzr9756bt3uc0v                                       
  851                                                                            
  852    (P) Maji<de>Koi<suru>5<byou><mae>                                       
  853        U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059      
  854        u+308B u+0035 u+79D2 u+524D                                         
  855        Punycode: MajiKoi5-783gue6qz075azm5e                                
  856                                                                            
  857    (Q) <pafii>de<runba>                                                    
  858        u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0      
  859        Punycode: de-jg4avhby1noc0d                                         
  860                                                                            
  861    (R) <sono><supiido><de>                                                 
  862        u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067                    
  863        Punycode: d9juau41awczczp                                           
  864                                                                            
  865    The last example is an ASCII string that breaks the existing rules      
  866    for host name labels.  (It is not a realistic example for IDNA,         
  867    because IDNA never encodes pure ASCII labels.)                          
  868                                                                            
  869    (S) -> $1.00 <-                                                         
  870        u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020      
  871        u+003C u+002D                                                       
  872        Punycode: -> $1.00 <--                                              
  873                                                                            
  874                                                                            
  875                                                                            
  876                                                                            
  877 Costello                    Standards Track                    [Page 16]   

  878 RFC 3492                     IDNA Punycode                    March 2003   
  879                                                                            
  880                                                                            
  881 7.2 Decoding traces                                                        
  882                                                                            
  883    In the following traces, the evolving state of the decoder is shown     
  884    as a sequence of hexadecimal values, representing the code points in    
  885    the extended string.  An asterisk appears just after the most           
  886    recently inserted code point, indicating both n (the value preceeding   
  887    the asterisk) and i (the position of the value just after the           
  888    asterisk).  Other numerical values are decimal.                         
  889                                                                            
  890    Decoding trace of example B from section 7.1:                           
  891                                                                            
  892    n is 128, i is 0, bias is 72                                            
  893    input is "ihqwcrb4cv8a8dqg056pqjye"                                     
  894    there is no delimiter, so extended string starts empty                  
  895    delta "ihq" decodes to 19853                                            
  896    bias becomes 21                                                         
  897    4E0D *                                                                  
  898    delta "wc" decodes to 64                                                
  899    bias becomes 20                                                         
  900    4E0D 4E2D *                                                             
  901    delta "rb" decodes to 37                                                
  902    bias becomes 13                                                         
  903    4E3A * 4E0D 4E2D                                                        
  904    delta "4c" decodes to 56                                                
  905    bias becomes 17                                                         
  906    4E3A 4E48 * 4E0D 4E2D                                                   
  907    delta "v8a" decodes to 599                                              
  908    bias becomes 32                                                         
  909    4E3A 4EC0 * 4E48 4E0D 4E2D                                              
  910    delta "8d" decodes to 130                                               
  911    bias becomes 23                                                         
  912    4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D                                         
  913    delta "qg" decodes to 154                                               
  914    bias becomes 25                                                         
  915    4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D                                    
  916    delta "056p" decodes to 46301                                           
  917    bias becomes 84                                                         
  918    4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *                               
  919    delta "qjye" decodes to 88531                                           
  920    bias becomes 90                                                         
  921    4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587                          
  922                                                                            
  923                                                                            
  924                                                                            
  925                                                                            
  926                                                                            
  927                                                                            
  928                                                                            
  929                                                                            
  930                                                                            
  931                                                                            
  932 Costello                    Standards Track                    [Page 17]   

  933 RFC 3492                     IDNA Punycode                    March 2003   
  934                                                                            
  935                                                                            
  936    Decoding trace of example L from section 7.1:                           
  937                                                                            
  938    n is 128, i is 0, bias is 72                                            
  939    input is "3B-ww4c5e180e575a65lsy2b"                                     
  940    literal portion is "3B-", so extended string starts as:                 
  941    0033 0042                                                               
  942    delta "ww4c" decodes to 62042                                           
  943    bias becomes 27                                                         
  944    0033 0042 5148 *                                                        
  945    delta "5e" decodes to 139                                               
  946    bias becomes 24                                                         
  947    0033 0042 516B * 5148                                                   
  948    delta "180e" decodes to 16683                                           
  949    bias becomes 67                                                         
  950    0033 5E74 * 0042 516B 5148                                              
  951    delta "575a" decodes to 34821                                           
  952    bias becomes 82                                                         
  953    0033 5E74 0042 516B 5148 751F *                                         
  954    delta "65l" decodes to 14592                                            
  955    bias becomes 67                                                         
  956    0033 5E74 0042 7D44 * 516B 5148 751F                                    
  957    delta "sy2b" decodes to 42088                                           
  958    bias becomes 84                                                         
  959    0033 5E74 0042 7D44 91D1 * 516B 5148 751F                               
  960                                                                            
  961                                                                            
  962                                                                            
  963                                                                            
  964                                                                            
  965                                                                            
  966                                                                            
  967                                                                            
  968                                                                            
  969                                                                            
  970                                                                            
  971                                                                            
  972                                                                            
  973                                                                            
  974                                                                            
  975                                                                            
  976                                                                            
  977                                                                            
  978                                                                            
  979                                                                            
  980                                                                            
  981                                                                            
  982                                                                            
  983                                                                            
  984                                                                            
  985                                                                            
  986                                                                            
  987 Costello                    Standards Track                    [Page 18]   

  988 RFC 3492                     IDNA Punycode                    March 2003   
  989                                                                            
  990                                                                            
  991 7.3 Encoding traces                                                        
  992                                                                            
  993    In the following traces, code point values are hexadecimal, while       
  994    other numerical values are decimal.                                     
  995                                                                            
  996    Encoding trace of example B from section 7.1:                           
  997                                                                            
  998    bias is 72                                                              
  999    input is:                                                               
 1000    4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587                            
 1001    there are no basic code points, so no literal portion                   
 1002    next code point to insert is 4E0D                                       
 1003    needed delta is 19853, encodes as "ihq"                                 
 1004    bias becomes 21                                                         
 1005    next code point to insert is 4E2D                                       
 1006    needed delta is 64, encodes as "wc"                                     
 1007    bias becomes 20                                                         
 1008    next code point to insert is 4E3A                                       
 1009    needed delta is 37, encodes as "rb"                                     
 1010    bias becomes 13                                                         
 1011    next code point to insert is 4E48                                       
 1012    needed delta is 56, encodes as "4c"                                     
 1013    bias becomes 17                                                         
 1014    next code point to insert is 4EC0                                       
 1015    needed delta is 599, encodes as "v8a"                                   
 1016    bias becomes 32                                                         
 1017    next code point to insert is 4ED6                                       
 1018    needed delta is 130, encodes as "8d"                                    
 1019    bias becomes 23                                                         
 1020    next code point to insert is 4EEC                                       
 1021    needed delta is 154, encodes as "qg"                                    
 1022    bias becomes 25                                                         
 1023    next code point to insert is 6587                                       
 1024    needed delta is 46301, encodes as "056p"                                
 1025    bias becomes 84                                                         
 1026    next code point to insert is 8BF4                                       
 1027    needed delta is 88531, encodes as "qjye"                                
 1028    bias becomes 90                                                         
 1029    output is "ihqwcrb4cv8a8dqg056pqjye"                                    
 1030                                                                            
 1031                                                                            
 1032                                                                            
 1033                                                                            
 1034                                                                            
 1035                                                                            
 1036                                                                            
 1037                                                                            
 1038                                                                            
 1039                                                                            
 1040                                                                            
 1041                                                                            
 1042 Costello                    Standards Track                    [Page 19]   

 1043 RFC 3492                     IDNA Punycode                    March 2003   
 1044                                                                            
 1045                                                                            
 1046    Encoding trace of example L from section 7.1:                           
 1047                                                                            
 1048    bias is 72                                                              
 1049    input is:                                                               
 1050    0033 5E74 0042 7D44 91D1 516B 5148 751F                                 
 1051    basic code points (0033, 0042) are copied to literal portion: "3B-"     
 1052    next code point to insert is 5148                                       
 1053    needed delta is 62042, encodes as "ww4c"                                
 1054    bias becomes 27                                                         
 1055    next code point to insert is 516B                                       
 1056    needed delta is 139, encodes as "5e"                                    
 1057    bias becomes 24                                                         
 1058    next code point to insert is 5E74                                       
 1059    needed delta is 16683, encodes as "180e"                                
 1060    bias becomes 67                                                         
 1061    next code point to insert is 751F                                       
 1062    needed delta is 34821, encodes as "575a"                                
 1063    bias becomes 82                                                         
 1064    next code point to insert is 7D44                                       
 1065    needed delta is 14592, encodes as "65l"                                 
 1066    bias becomes 67                                                         
 1067    next code point to insert is 91D1                                       
 1068    needed delta is 42088, encodes as "sy2b"                                
 1069    bias becomes 84                                                         
 1070    output is "3B-ww4c5e180e575a65lsy2b"                                    
 1071                                                                            
 1072 8. Security Considerations                                                 
 1073                                                                            
 1074    Users expect each domain name in DNS to be controlled by a single       
 1075    authority.  If a Unicode string intended for use as a domain label      
 1076    could map to multiple ACE labels, then an internationalized domain      
 1077    name could map to multiple ASCII domain names, each controlled by a     
 1078    different authority, some of which could be spoofs that hijack          
 1079    service requests intended for another.  Therefore Punycode is           
 1080    designed so that each Unicode string has a unique encoding.             
 1081                                                                            
 1082    However, there can still be multiple Unicode representations of the     
 1083    "same" text, for various definitions of "same".  This problem is        
 1084    addressed to some extent by the Unicode standard under the topic of     
 1085    canonicalization, and this work is leveraged for domain names by        
 1086    Nameprep [NAMEPREP].                                                    
 1087                                                                            
 1088                                                                            
 1089                                                                            
 1090                                                                            
 1091                                                                            
 1092                                                                            
 1093                                                                            
 1094                                                                            
 1095                                                                            
 1096                                                                            
 1097 Costello                    Standards Track                    [Page 20]   

 1098 RFC 3492                     IDNA Punycode                    March 2003   
 1099                                                                            
 1100                                                                            
 1101 9. References                                                              
 1102                                                                            
 1103 9.1 Normative References                                                   
 1104                                                                            
 1105    [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate        
 1106                 Requirement Levels", BCP 14, RFC 2119, March 1997.         
 1107                                                                            
 1108 9.2 Informative References                                                 
 1109                                                                            
 1110    [RFC952]     Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet   
 1111                 Host Table Specification", RFC 952, October 1985.          
 1112                                                                            
 1113    [RFC1034]    Mockapetris, P., "Domain Names - Concepts and              
 1114                 Facilities", STD 13, RFC 1034, November 1987.              
 1115                                                                            
 1116    [IDNA]       Faltstrom, P., Hoffman, P. and A. Costello,                
 1117                 "Internationalizing Domain Names in Applications           
 1118                 (IDNA)", RFC 3490, March 2003.                             
 1119                                                                            
 1120    [NAMEPREP]   Hoffman, P. and  M. Blanchet, "Nameprep: A Stringprep      
 1121                 Profile for Internationalized Domain Names (IDN)", RFC     
 1122                 3491, March 2003.                                          
 1123                                                                            
 1124    [ASCII]      Cerf, V., "ASCII format for Network Interchange", RFC      
 1125                 20, October 1969.                                          
 1126                                                                            
 1127    [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page",        
 1128                 http://www.trigeminal.com/samples/provincial.html.         
 1129                                                                            
 1130    [UNICODE]    The Unicode Consortium, "The Unicode Standard",            
 1131                 http://www.unicode.org/unicode/standard/standard.html.     
 1132                                                                            
 1133                                                                            
 1134                                                                            
 1135                                                                            
 1136                                                                            
 1137                                                                            
 1138                                                                            
 1139                                                                            
 1140                                                                            
 1141                                                                            
 1142                                                                            
 1143                                                                            
 1144                                                                            
 1145                                                                            
 1146                                                                            
 1147                                                                            
 1148                                                                            
 1149                                                                            
 1150                                                                            
 1151                                                                            
 1152 Costello                    Standards Track                    [Page 21]   

 1153 RFC 3492                     IDNA Punycode                    March 2003   
 1154                                                                            
 1155                                                                            
 1156 A. Mixed-case annotation                                                   
 1157                                                                            
 1158    In order to use Punycode to represent case-insensitive strings,         
 1159    higher layers need to case-fold the strings prior to Punycode           
 1160    encoding.  The encoded string can use mixed case as an annotation       
 1161    telling how to convert the folded string into a mixed-case string for   
 1162    display purposes.  Note, however, that mixed-case annotation is not     
 1163    used by the ToASCII and ToUnicode operations specified in [IDNA], and   
 1164    therefore implementors of IDNA can disregard this appendix.             
 1165                                                                            
 1166    Basic code points can use mixed case directly, because the decoder      
 1167    copies them verbatim, leaving lowercase code points lowercase, and      
 1168    leaving uppercase code points uppercase.  Each non-basic code point     
 1169    is represented by a delta, which is represented by a sequence of        
 1170    basic code points, the last of which provides the annotation.  If it    
 1171    is uppercase, it is a suggestion to map the non-basic code point to     
 1172    uppercase (if possible); if it is lowercase, it is a suggestion to      
 1173    map the non-basic code point to lowercase (if possible).                
 1174                                                                            
 1175    These annotations do not alter the code points returned by decoders;    
 1176    the annotations are returned separately, for the caller to use or       
 1177    ignore.  Encoders can accept annotations in addition to code points,    
 1178    but the annotations do not alter the output, except to influence the    
 1179    uppercase/lowercase form of ASCII letters.                              
 1180                                                                            
 1181    Punycode encoders and decoders need not support these annotations,      
 1182    and higher layers need not use them.                                    
 1183                                                                            
 1184 B. Disclaimer and license                                                  
 1185                                                                            
 1186    Regarding this entire document or any portion of it (including the      
 1187    pseudocode and C code), the author makes no guarantees and is not       
 1188    responsible for any damage resulting from its use.  The author grants   
 1189    irrevocable permission to anyone to use, modify, and distribute it in   
 1190    any way that does not diminish the rights of anyone else to use,        
 1191    modify, and distribute it, provided that redistributed derivative       
 1192    works do not contain misleading author or version information.          
 1193    Derivative works need not be licensed under similar terms.              
 1194                                                                            
 1195                                                                            
 1196                                                                            
 1197                                                                            
 1198                                                                            
 1199                                                                            
 1200                                                                            
 1201                                                                            
 1202                                                                            
 1203                                                                            
 1204                                                                            
 1205                                                                            
 1206                                                                            
 1207 Costello                    Standards Track                    [Page 22]   

 1208 RFC 3492                     IDNA Punycode                    March 2003   
 1209                                                                            
 1210                                                                            
 1211 C. Punycode sample implementation                                          
 1212                                                                            
 1213 /*                                                                         
 1214 punycode.c from RFC 3492                                                   
 1215 http://www.nicemice.net/idn/                                               
 1216 Adam M. Costello                                                           
 1217 http://www.nicemice.net/amc/                                               
 1218                                                                            
 1219 This is ANSI C code (C89) implementing Punycode (RFC 3492).                
 1220                                                                            
 1221 */                                                                         
 1222                                                                            
 1223                                                                            
 1224 /************************************************************/             
 1225 /* Public interface (would normally go in its own .h file): */             
 1226                                                                            
 1227 #include <limits.h>                                                        
 1228                                                                            
 1229 enum punycode_status {                                                     
 1230   punycode_success,                                                        
 1231   punycode_bad_input,   /* Input is invalid.                       */      
 1232   punycode_big_output,  /* Output would exceed the space provided. */      
 1233   punycode_overflow     /* Input needs wider integers to process.  */      
 1234 };                                                                         
 1235                                                                            
 1236 #if UINT_MAX >= (1 << 26) - 1                                              
 1237 typedef unsigned int punycode_uint;                                        
 1238 #else                                                                      
 1239 typedef unsigned long punycode_uint;                                       
 1240 #endif                                                                     
 1241                                                                            
 1242 enum punycode_status punycode_encode(                                      
 1243   punycode_uint input_length,                                              
 1244   const punycode_uint input[],                                             
 1245   const unsigned char case_flags[],                                        
 1246   punycode_uint *output_length,                                            
 1247   char output[] );                                                         
 1248                                                                            
 1249     /* punycode_encode() converts Unicode to Punycode.  The input     */   
 1250     /* is represented as an array of Unicode code points (not code    */   
 1251     /* units; surrogate pairs are not allowed), and the output        */   
 1252     /* will be represented as an array of ASCII code points.  The     */   
 1253     /* output string is *not* null-terminated; it will contain        */   
 1254     /* zeros if and only if the input contains zeros.  (Of course     */   
 1255     /* the caller can leave room for a terminator and add one if      */   
 1256     /* needed.)  The input_length is the number of code points in     */   
 1257     /* the input.  The output_length is an in/out argument: the       */   
 1258     /* caller passes in the maximum number of code points that it     */   
 1259                                                                            
 1260                                                                            
 1261                                                                            
 1262 Costello                    Standards Track                    [Page 23]   

 1263 RFC 3492                     IDNA Punycode                    March 2003   
 1264                                                                            
 1265                                                                            
 1266     /* can receive, and on successful return it will contain the      */   
 1267     /* number of code points actually output.  The case_flags array   */   
 1268     /* holds input_length boolean values, where nonzero suggests that */   
 1269     /* the corresponding Unicode character be forced to uppercase     */   
 1270     /* after being decoded (if possible), and zero suggests that      */   
 1271     /* it be forced to lowercase (if possible).  ASCII code points    */   
 1272     /* are encoded literally, except that ASCII letters are forced    */   
 1273     /* to uppercase or lowercase according to the corresponding       */   
 1274     /* uppercase flags.  If case_flags is a null pointer then ASCII   */   
 1275     /* letters are left as they are, and other code points are        */   
 1276     /* treated as if their uppercase flags were zero.  The return     */   
 1277     /* value can be any of the punycode_status values defined above   */   
 1278     /* except punycode_bad_input; if not punycode_success, then       */   
 1279     /* output_size and output might contain garbage.                  */   
 1280                                                                            
 1281 enum punycode_status punycode_decode(                                      
 1282   punycode_uint input_length,                                              
 1283   const char input[],                                                      
 1284   punycode_uint *output_length,                                            
 1285   punycode_uint output[],                                                  
 1286   unsigned char case_flags[] );                                            
 1287                                                                            
 1288     /* punycode_decode() converts Punycode to Unicode.  The input is  */   
 1289     /* represented as an array of ASCII code points, and the output   */   
 1290     /* will be represented as an array of Unicode code points.  The   */   
 1291     /* input_length is the number of code points in the input.  The   */   
 1292     /* output_length is an in/out argument: the caller passes in      */   
 1293     /* the maximum number of code points that it can receive, and     */   
 1294     /* on successful return it will contain the actual number of      */   
 1295     /* code points output.  The case_flags array needs room for at    */   
 1296     /* least output_length values, or it can be a null pointer if the */   
 1297     /* case information is not needed.  A nonzero flag suggests that  */   
 1298     /* the corresponding Unicode character be forced to uppercase     */   
 1299     /* by the caller (if possible), while zero suggests that it be    */   
 1300     /* forced to lowercase (if possible).  ASCII code points are      */   
 1301     /* output already in the proper case, but their flags will be set */   
 1302     /* appropriately so that applying the flags would be harmless.    */   
 1303     /* The return value can be any of the punycode_status values      */   
 1304     /* defined above; if not punycode_success, then output_length,    */   
 1305     /* output, and case_flags might contain garbage.  On success, the */   
 1306     /* decoder will never need to write an output_length greater than */   
 1307     /* input_length, because of how the encoding is defined.          */   
 1308                                                                            
 1309 /**********************************************************/               
 1310 /* Implementation (would normally go in its own .c file): */               
 1311                                                                            
 1312 #include <string.h>                                                        
 1313                                                                            
 1314                                                                            
 1315                                                                            
 1316                                                                            
 1317 Costello                    Standards Track                    [Page 24]   

 1318 RFC 3492                     IDNA Punycode                    March 2003   
 1319                                                                            
 1320                                                                            
 1321 /*** Bootstring parameters for Punycode ***/                               
 1322                                                                            
 1323 enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,              
 1324        initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };            
 1325                                                                            
 1326 /* basic(cp) tests whether cp is a basic code point: */                    
 1327 #define basic(cp) ((punycode_uint)(cp) < 0x80)                             
 1328                                                                            
 1329 /* delim(cp) tests whether cp is a delimiter: */                           
 1330 #define delim(cp) ((cp) == delimiter)                                      
 1331                                                                            
 1332 /* decode_digit(cp) returns the numeric value of a basic code */           
 1333 /* point (for use in representing integers) in the range 0 to */           
 1334 /* base-1, or base if cp is does not represent a value.       */           
 1335                                                                            
 1336 static punycode_uint decode_digit(punycode_uint cp)                        
 1337 {                                                                          
 1338   return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :               
 1339           cp - 97 < 26 ? cp - 97 :  base;                                  
 1340 }                                                                          
 1341                                                                            
 1342 /* encode_digit(d,flag) returns the basic code point whose value      */   
 1343 /* (when used for representing integers) is d, which needs to be in   */   
 1344 /* the range 0 to base-1.  The lowercase form is used unless flag is  */   
 1345 /* nonzero, in which case the uppercase form is used.  The behavior   */   
 1346 /* is undefined if flag is nonzero and digit d has no uppercase form. */   
 1347                                                                            
 1348 static char encode_digit(punycode_uint d, int flag)                        
 1349 {                                                                          
 1350   return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);                      
 1351   /*  0..25 map to ASCII a..z or A..Z */                                   
 1352   /* 26..35 map to ASCII 0..9         */                                   
 1353 }                                                                          
 1354                                                                            
 1355 /* flagged(bcp) tests whether a basic code point is flagged */             
 1356 /* (uppercase).  The behavior is undefined if bcp is not a  */             
 1357 /* basic code point.                                        */             
 1358                                                                            
 1359 #define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)                      
 1360                                                                            
 1361 /* encode_basic(bcp,flag) forces a basic code point to lowercase */        
 1362 /* if flag is zero, uppercase if flag is nonzero, and returns    */        
 1363 /* the resulting code point.  The code point is unchanged if it  */        
 1364 /* is caseless.  The behavior is undefined if bcp is not a basic */        
 1365 /* code point.                                                   */        
 1366                                                                            
 1367 static char encode_basic(punycode_uint bcp, int flag)                      
 1368 {                                                                          
 1369                                                                            
 1370                                                                            
 1371                                                                            
 1372 Costello                    Standards Track                    [Page 25]   

 1373 RFC 3492                     IDNA Punycode                    March 2003   
 1374                                                                            
 1375                                                                            
 1376   bcp -= (bcp - 97 < 26) << 5;                                             
 1377   return bcp + ((!flag && (bcp - 65 < 26)) << 5);                          
 1378 }                                                                          
 1379                                                                            
 1380 /*** Platform-specific constants ***/                                      
 1381                                                                            
 1382 /* maxint is the maximum value of a punycode_uint variable: */             
 1383 static const punycode_uint maxint = -1;                                    
 1384 /* Because maxint is unsigned, -1 becomes the maximum value. */            
 1385                                                                            
 1386 /*** Bias adaptation function ***/                                         
 1387                                                                            
 1388 static punycode_uint adapt(                                                
 1389   punycode_uint delta, punycode_uint numpoints, int firsttime )            
 1390 {                                                                          
 1391   punycode_uint k;                                                         
 1392                                                                            
 1393   delta = firsttime ? delta / damp : delta >> 1;                           
 1394   /* delta >> 1 is a faster way of doing delta / 2 */                      
 1395   delta += delta / numpoints;                                              
 1396                                                                            
 1397   for (k = 0;  delta > ((base - tmin) * tmax) / 2;  k += base) {           
 1398     delta /= base - tmin;                                                  
 1399   }                                                                        
 1400                                                                            
 1401   return k + (base - tmin + 1) * delta / (delta + skew);                   
 1402 }                                                                          
 1403                                                                            
 1404 /*** Main encode function ***/                                             
 1405                                                                            
 1406 enum punycode_status punycode_encode(                                      
 1407   punycode_uint input_length,                                              
 1408   const punycode_uint input[],                                             
 1409   const unsigned char case_flags[],                                        
 1410   punycode_uint *output_length,                                            
 1411   char output[] )                                                          
 1412 {                                                                          
 1413   punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;         
 1414                                                                            
 1415   /* Initialize the state: */                                              
 1416                                                                            
 1417   n = initial_n;                                                           
 1418   delta = out = 0;                                                         
 1419   max_out = *output_length;                                                
 1420   bias = initial_bias;                                                     
 1421                                                                            
 1422   /* Handle the basic code points: */                                      
 1423                                                                            
 1424                                                                            
 1425                                                                            
 1426                                                                            
 1427 Costello                    Standards Track                    [Page 26]   

 1428 RFC 3492                     IDNA Punycode                    March 2003   
 1429                                                                            
 1430                                                                            
 1431   for (j = 0;  j < input_length;  ++j) {                                   
 1432     if (basic(input[j])) {                                                 
 1433       if (max_out - out < 2) return punycode_big_output;                   
 1434       output[out++] =                                                      
 1435         case_flags ?  encode_basic(input[j], case_flags[j]) : input[j];    
 1436     }                                                                      
 1437     /* else if (input[j] < n) return punycode_bad_input; */                
 1438     /* (not needed for Punycode with unsigned code points) */              
 1439   }                                                                        
 1440                                                                            
 1441   h = b = out;                                                             
 1442                                                                            
 1443   /* h is the number of code points that have been handled, b is the  */   
 1444   /* number of basic code points, and out is the number of characters */   
 1445   /* that have been output.                                           */   
 1446                                                                            
 1447   if (b > 0) output[out++] = delimiter;                                    
 1448                                                                            
 1449   /* Main encoding loop: */                                                
 1450                                                                            
 1451   while (h < input_length) {                                               
 1452     /* All non-basic code points < n have been     */                      
 1453     /* handled already.  Find the next larger one: */                      
 1454                                                                            
 1455     for (m = maxint, j = 0;  j < input_length;  ++j) {                     
 1456       /* if (basic(input[j])) continue; */                                 
 1457       /* (not needed for Punycode) */                                      
 1458       if (input[j] >= n && input[j] < m) m = input[j];                     
 1459     }                                                                      
 1460                                                                            
 1461     /* Increase delta enough to advance the decoder's    */                
 1462     /* <n,i> state to <m,0>, but guard against overflow: */                
 1463                                                                            
 1464     if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;      
 1465     delta += (m - n) * (h + 1);                                            
 1466     n = m;                                                                 
 1467                                                                            
 1468     for (j = 0;  j < input_length;  ++j) {                                 
 1469       /* Punycode does not need to check whether input[j] is basic: */     
 1470       if (input[j] < n /* || basic(input[j]) */ ) {                        
 1471         if (++delta == 0) return punycode_overflow;                        
 1472       }                                                                    
 1473                                                                            
 1474       if (input[j] == n) {                                                 
 1475         /* Represent delta as a generalized variable-length integer: */    
 1476                                                                            
 1477         for (q = delta, k = base;  ;  k += base) {                         
 1478           if (out >= max_out) return punycode_big_output;                  
 1479                                                                            
 1480                                                                            
 1481                                                                            
 1482 Costello                    Standards Track                    [Page 27]   

 1483 RFC 3492                     IDNA Punycode                    March 2003   
 1484                                                                            
 1485                                                                            
 1486           t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */   
 1487               k >= bias + tmax ? tmax : k - bias;                          
 1488           if (q < t) break;                                                
 1489           output[out++] = encode_digit(t + (q - t) % (base - t), 0);       
 1490           q = (q - t) / (base - t);                                        
 1491         }                                                                  
 1492                                                                            
 1493         output[out++] = encode_digit(q, case_flags && case_flags[j]);      
 1494         bias = adapt(delta, h + 1, h == b);                                
 1495         delta = 0;                                                         
 1496         ++h;                                                               
 1497       }                                                                    
 1498     }                                                                      
 1499                                                                            
 1500     ++delta, ++n;                                                          
 1501   }                                                                        
 1502                                                                            
 1503   *output_length = out;                                                    
 1504   return punycode_success;                                                 
 1505 }                                                                          
 1506                                                                            
 1507 /*** Main decode function ***/                                             
 1508                                                                            
 1509 enum punycode_status punycode_decode(                                      
 1510   punycode_uint input_length,                                              
 1511   const char input[],                                                      
 1512   punycode_uint *output_length,                                            
 1513   punycode_uint output[],                                                  
 1514   unsigned char case_flags[] )                                             
 1515 {                                                                          
 1516   punycode_uint n, out, i, max_out, bias,                                  
 1517                  b, j, in, oldi, w, k, digit, t;                           
 1518                                                                            
 1519   /* Initialize the state: */                                              
 1520                                                                            
 1521   n = initial_n;                                                           
 1522   out = i = 0;                                                             
 1523   max_out = *output_length;                                                
 1524   bias = initial_bias;                                                     
 1525                                                                            
 1526   /* Handle the basic code points:  Let b be the number of input code */   
 1527   /* points before the last delimiter, or 0 if there is none, then    */   
 1528   /* copy the first b code points to the output.                      */   
 1529                                                                            
 1530   for (b = j = 0;  j < input_length;  ++j) if (delim(input[j])) b = j;     
 1531   if (b > max_out) return punycode_big_output;                             
 1532                                                                            
 1533   for (j = 0;  j < b;  ++j) {                                              
 1534                                                                            
 1535                                                                            
 1536                                                                            
 1537 Costello                    Standards Track                    [Page 28]   

 1538 RFC 3492                     IDNA Punycode                    March 2003   
 1539                                                                            
 1540                                                                            
 1541     if (case_flags) case_flags[out] = flagged(input[j]);                   
 1542     if (!basic(input[j])) return punycode_bad_input;                       
 1543     output[out++] = input[j];                                              
 1544   }                                                                        
 1545                                                                            
 1546   /* Main decoding loop:  Start just after the last delimiter if any  */   
 1547   /* basic code points were copied; start at the beginning otherwise. */   
 1548                                                                            
 1549   for (in = b > 0 ? b + 1 : 0;  in < input_length;  ++out) {               
 1550                                                                            
 1551     /* in is the index of the next character to be consumed, and */        
 1552     /* out is the number of code points in the output array.     */        
 1553                                                                            
 1554     /* Decode a generalized variable-length integer into delta,  */        
 1555     /* which gets added to i.  The overflow checking is easier   */        
 1556     /* if we increase i as we go, then subtract off its starting */        
 1557     /* value at the end to obtain delta.                         */        
 1558                                                                            
 1559     for (oldi = i, w = 1, k = base;  ;  k += base) {                       
 1560       if (in >= input_length) return punycode_bad_input;                   
 1561       digit = decode_digit(input[in++]);                                   
 1562       if (digit >= base) return punycode_bad_input;                        
 1563       if (digit > (maxint - i) / w) return punycode_overflow;              
 1564       i += digit * w;                                                      
 1565       t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */       
 1566           k >= bias + tmax ? tmax : k - bias;                              
 1567       if (digit < t) break;                                                
 1568       if (w > maxint / (base - t)) return punycode_overflow;               
 1569       w *= (base - t);                                                     
 1570     }                                                                      
 1571                                                                            
 1572     bias = adapt(i - oldi, out + 1, oldi == 0);                            
 1573                                                                            
 1574     /* i was supposed to wrap around from out+1 to 0,   */                 
 1575     /* incrementing n each time, so we'll fix that now: */                 
 1576                                                                            
 1577     if (i / (out + 1) > maxint - n) return punycode_overflow;              
 1578     n += i / (out + 1);                                                    
 1579     i %= (out + 1);                                                        
 1580                                                                            
 1581     /* Insert n at position i of the output: */                            
 1582                                                                            
 1583     /* not needed for Punycode: */                                         
 1584     /* if (decode_digit(n) <= base) return punycode_invalid_input; */      
 1585     if (out >= max_out) return punycode_big_output;                        
 1586                                                                            
 1587     if (case_flags) {                                                      
 1588       memmove(case_flags + i + 1, case_flags + i, out - i);                
 1589                                                                            
 1590                                                                            
 1591                                                                            
 1592 Costello                    Standards Track                    [Page 29]   

 1593 RFC 3492                     IDNA Punycode                    March 2003   
 1594                                                                            
 1595                                                                            
 1596       /* Case of last character determines uppercase flag: */              
 1597       case_flags[i] = flagged(input[in - 1]);                              
 1598     }                                                                      
 1599                                                                            
 1600     memmove(output + i + 1, output + i, (out - i) * sizeof *output);       
 1601     output[i++] = n;                                                       
 1602   }                                                                        
 1603                                                                            
 1604   *output_length = out;                                                    
 1605   return punycode_success;                                                 
 1606 }                                                                          
 1607                                                                            
 1608 /******************************************************************/       
 1609 /* Wrapper for testing (would normally go in a separate .c file): */       
 1610                                                                            
 1611 #include <assert.h>                                                        
 1612 #include <stdio.h>                                                         
 1613 #include <stdlib.h>                                                        
 1614 #include <string.h>                                                        
 1615                                                                            
 1616 /* For testing, we'll just set some compile-time limits rather than */     
 1617 /* use malloc(), and set a compile-time option rather than using a  */     
 1618 /* command-line option.                                             */     
 1619                                                                            
 1620 enum {                                                                     
 1621   unicode_max_length = 256,                                                
 1622   ace_max_length = 256                                                     
 1623 };                                                                         
 1624                                                                            
 1625 static void usage(char **argv)                                             
 1626 {                                                                          
 1627   fprintf(stderr,                                                          
 1628     "\n"                                                                   
 1629     "%s -e reads code points and writes a Punycode string.\n"              
 1630     "%s -d reads a Punycode string and writes code points.\n"              
 1631     "\n"                                                                   
 1632     "Input and output are plain text in the native character set.\n"       
 1633     "Code points are in the form u+hex separated by whitespace.\n"         
 1634     "Although the specification allows Punycode strings to contain\n"      
 1635     "any characters from the ASCII repertoire, this test code\n"           
 1636     "supports only the printable characters, and needs the Punycode\n"     
 1637     "string to be followed by a newline.\n"                                
 1638     "The case of the u in u+hex is the force-to-uppercase flag.\n"         
 1639     , argv[0], argv[0]);                                                   
 1640   exit(EXIT_FAILURE);                                                      
 1641 }                                                                          
 1642                                                                            
 1643 static void fail(const char *msg)                                          
 1644                                                                            
 1645                                                                            
 1646                                                                            
 1647 Costello                    Standards Track                    [Page 30]   

 1648 RFC 3492                     IDNA Punycode                    March 2003   
 1649                                                                            
 1650                                                                            
 1651 {                                                                          
 1652   fputs(msg,stderr);                                                       
 1653   exit(EXIT_FAILURE);                                                      
 1654 }                                                                          
 1655                                                                            
 1656 static const char too_big[] =                                              
 1657   "input or output is too large, recompile with larger limits\n";          
 1658 static const char invalid_input[] = "invalid input\n";                     
 1659 static const char overflow[] = "arithmetic overflow\n";                    
 1660 static const char io_error[] = "I/O error\n";                              
 1661                                                                            
 1662 /* The following string is used to convert printable */                    
 1663 /* characters between ASCII and the native charset:  */                    
 1664                                                                            
 1665 static const char print_ascii[] =                                          
 1666   "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"                                       
 1667   "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"                                       
 1668   " !\"#$%&'()*+,-./"                                                      
 1669   "0123456789:;<=>?"                                                       
 1670   "@ABCDEFGHIJKLMNO"                                                       
 1671   "PQRSTUVWXYZ[\\]^_"                                                      
 1672   "`abcdefghijklmno"                                                       
 1673   "pqrstuvwxyz{|}~\n";                                                     
 1674                                                                            
 1675 int main(int argc, char **argv)                                            
 1676 {                                                                          
 1677   enum punycode_status status;                                             
 1678   int r;                                                                   
 1679   unsigned int input_length, output_length, j;                             
 1680   unsigned char case_flags[unicode_max_length];                            
 1681                                                                            
 1682   if (argc != 2) usage(argv);                                              
 1683   if (argv[1][0] != '-') usage(argv);                                      
 1684   if (argv[1][2] != 0) usage(argv);                                        
 1685                                                                            
 1686   if (argv[1][1] == 'e') {                                                 
 1687     punycode_uint input[unicode_max_length];                               
 1688     unsigned long codept;                                                  
 1689     char output[ace_max_length+1], uplus[3];                               
 1690     int c;                                                                 
 1691                                                                            
 1692     /* Read the input code points: */                                      
 1693                                                                            
 1694     input_length = 0;                                                      
 1695                                                                            
 1696     for (;;) {                                                             
 1697       r = scanf("%2s%lx", uplus, &codept);                                 
 1698       if (ferror(stdin)) fail(io_error);                                   
 1699                                                                            
 1700                                                                            
 1701                                                                            
 1702 Costello                    Standards Track                    [Page 31]   

 1703 RFC 3492                     IDNA Punycode                    March 2003   
 1704                                                                            
 1705                                                                            
 1706       if (r == EOF || r == 0) break;                                       
 1707                                                                            
 1708       if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {       
 1709         fail(invalid_input);                                               
 1710       }                                                                    
 1711                                                                            
 1712       if (input_length == unicode_max_length) fail(too_big);               
 1713                                                                            
 1714       if (uplus[0] == 'u') case_flags[input_length] = 0;                   
 1715       else if (uplus[0] == 'U') case_flags[input_length] = 1;              
 1716       else fail(invalid_input);                                            
 1717                                                                            
 1718       input[input_length++] = codept;                                      
 1719     }                                                                      
 1720                                                                            
 1721     /* Encode: */                                                          
 1722                                                                            
 1723     output_length = ace_max_length;                                        
 1724     status = punycode_encode(input_length, input, case_flags,              
 1725                              &output_length, output);                      
 1726     if (status == punycode_bad_input) fail(invalid_input);                 
 1727     if (status == punycode_big_output) fail(too_big);                      
 1728     if (status == punycode_overflow) fail(overflow);                       
 1729     assert(status == punycode_success);                                    
 1730                                                                            
 1731     /* Convert to native charset and output: */                            
 1732                                                                            
 1733     for (j = 0;  j < output_length;  ++j) {                                
 1734       c = output[j];                                                       
 1735       assert(c >= 0 && c <= 127);                                          
 1736       if (print_ascii[c] == 0) fail(invalid_input);                        
 1737       output[j] = print_ascii[c];                                          
 1738     }                                                                      
 1739                                                                            
 1740     output[j] = 0;                                                         
 1741     r = puts(output);                                                      
 1742     if (r == EOF) fail(io_error);                                          
 1743     return EXIT_SUCCESS;                                                   
 1744   }                                                                        
 1745                                                                            
 1746   if (argv[1][1] == 'd') {                                                 
 1747     char input[ace_max_length+2], *p, *pp;                                 
 1748     punycode_uint output[unicode_max_length];                              
 1749                                                                            
 1750     /* Read the Punycode input string and convert to ASCII: */             
 1751                                                                            
 1752     fgets(input, ace_max_length+2, stdin);                                 
 1753     if (ferror(stdin)) fail(io_error);                                     
 1754                                                                            
 1755                                                                            
 1756                                                                            
 1757 Costello                    Standards Track                    [Page 32]   

 1758 RFC 3492                     IDNA Punycode                    March 2003   
 1759                                                                            
 1760                                                                            
 1761     if (feof(stdin)) fail(invalid_input);                                  
 1762     input_length = strlen(input) - 1;                                      
 1763     if (input[input_length] != '\n') fail(too_big);                        
 1764     input[input_length] = 0;                                               
 1765                                                                            
 1766     for (p = input;  *p != 0;  ++p) {                                      
 1767       pp = strchr(print_ascii, *p);                                        
 1768       if (pp == 0) fail(invalid_input);                                    
 1769       *p = pp - print_ascii;                                               
 1770     }                                                                      
 1771                                                                            
 1772     /* Decode: */                                                          
 1773                                                                            
 1774     output_length = unicode_max_length;                                    
 1775     status = punycode_decode(input_length, input, &output_length,          
 1776                              output, case_flags);                          
 1777     if (status == punycode_bad_input) fail(invalid_input);                 
 1778     if (status == punycode_big_output) fail(too_big);                      
 1779     if (status == punycode_overflow) fail(overflow);                       
 1780     assert(status == punycode_success);                                    
 1781                                                                            
 1782     /* Output the result: */                                               
 1783                                                                            
 1784     for (j = 0;  j < output_length;  ++j) {                                
 1785       r = printf("%s+%04lX\n",                                             
 1786                  case_flags[j] ? "U" : "u",                                
 1787                  (unsigned long) output[j] );                              
 1788       if (r < 0) fail(io_error);                                           
 1789     }                                                                      
 1790                                                                            
 1791     return EXIT_SUCCESS;                                                   
 1792   }                                                                        
 1793                                                                            
 1794   usage(argv);                                                             
 1795   return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */     
 1796 }                                                                          
 1797                                                                            
 1798                                                                            
 1799                                                                            
 1800                                                                            
 1801                                                                            
 1802                                                                            
 1803                                                                            
 1804                                                                            
 1805                                                                            
 1806                                                                            
 1807                                                                            
 1808                                                                            
 1809                                                                            
 1810                                                                            
 1811                                                                            
 1812 Costello                    Standards Track                    [Page 33]   

 1813 RFC 3492                     IDNA Punycode                    March 2003   
 1814                                                                            
 1815                                                                            
 1816 Author's Address                                                           
 1817                                                                            
 1818    Adam M. Costello                                                        
 1819    University of California, Berkeley                                      
 1820    http://www.nicemice.net/amc/                                            
 1821                                                                            
 1822                                                                            
 1823                                                                            
 1824                                                                            
 1825                                                                            
 1826                                                                            
 1827                                                                            
 1828                                                                            
 1829                                                                            
 1830                                                                            
 1831                                                                            
 1832                                                                            
 1833                                                                            
 1834                                                                            
 1835                                                                            
 1836                                                                            
 1837                                                                            
 1838                                                                            
 1839                                                                            
 1840                                                                            
 1841                                                                            
 1842                                                                            
 1843                                                                            
 1844                                                                            
 1845                                                                            
 1846                                                                            
 1847                                                                            
 1848                                                                            
 1849                                                                            
 1850                                                                            
 1851                                                                            
 1852                                                                            
 1853                                                                            
 1854                                                                            
 1855                                                                            
 1856                                                                            
 1857                                                                            
 1858                                                                            
 1859                                                                            
 1860                                                                            
 1861                                                                            
 1862                                                                            
 1863                                                                            
 1864                                                                            
 1865                                                                            
 1866                                                                            
 1867 Costello                    Standards Track                    [Page 34]   

 1868 RFC 3492                     IDNA Punycode                    March 2003   
 1869                                                                            
 1870                                                                            
 1871 Full Copyright Statement                                                   
 1872                                                                            
 1873    Copyright (C) The Internet Society (2003).  All Rights Reserved.        
 1874                                                                            
 1875    This document and translations of it may be copied and furnished to     
 1876    others, and derivative works that comment on or otherwise explain it    
 1877    or assist in its implementation may be prepared, copied, published      
 1878    and distributed, in whole or in part, without restriction of any        
 1879    kind, provided that the above copyright notice and this paragraph are   
 1880    included on all such copies and derivative works.  However, this        
 1881    document itself may not be modified in any way, such as by removing     
 1882    the copyright notice or references to the Internet Society or other     
 1883    Internet organizations, except as needed for the purpose of             
 1884    developing Internet standards in which case the procedures for          
 1885    copyrights defined in the Internet Standards process must be            
 1886    followed, or as required to translate it into languages other than      
 1887    English.                                                                
 1888                                                                            
 1889    The limited permissions granted above are perpetual and will not be     
 1890    revoked by the Internet Society or its successors or assigns.           
 1891                                                                            
 1892    This document and the information contained herein is provided on an    
 1893    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING     
 1894    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING      
 1895    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION         
 1896    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF        
 1897    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.                    
 1898                                                                            
 1899 Acknowledgement                                                            
 1900                                                                            
 1901    Funding for the RFC Editor function is currently provided by the        
 1902    Internet Society.                                                       
 1903                                                                            
 1904                                                                            
 1905                                                                            
 1906                                                                            
 1907                                                                            
 1908                                                                            
 1909                                                                            
 1910                                                                            
 1911                                                                            
 1912                                                                            
 1913                                                                            
 1914                                                                            
 1915                                                                            
 1916                                                                            
 1917                                                                            
 1918                                                                            
 1919                                                                            
 1920                                                                            
 1921                                                                            
 1922 Costello                    Standards Track                    [Page 35]   
 1923

line-796 Mathias Bynens(Technical Erratum #3026) [Held for Document Update]

based on outdated version

   (I) Russian (Cyrillic):
       U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
       u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
       u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
       u+0438
       Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l

It should say:

   (I) Russian (Cyrillic):
       U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
       u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
       u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
       u+0438
       Punycode: b1abfaaepdrnnbgefbaDdotcwatmq2g4l

The uppercase `D` in the encoded string appears to be a typo.