1 Network Working Group A. Costello
2 Request for Comments: 3492 Univ. of California, Berkeley
3 Category: Standards Track March 2003
4
5
6 Punycode: A Bootstring encoding of Unicode
7 for Internationalized Domain Names in Applications (IDNA)
8
9 Status of this Memo
10
11 This document specifies an Internet standards track protocol for the
12 Internet community, and requests discussion and suggestions for
13 improvements. Please refer to the current edition of the "Internet
14 Official Protocol Standards" (STD 1) for the standardization state
15 and status of this protocol. Distribution of this memo is unlimited.
16
17 Copyright Notice
18
19 Copyright (C) The Internet Society (2003). All Rights Reserved.
20
21 Abstract
22
23 Punycode is a simple and efficient transfer encoding syntax designed
24 for use with Internationalized Domain Names in Applications (IDNA).
25 It uniquely and reversibly transforms a Unicode string into an ASCII
26 string. ASCII characters in the Unicode string are represented
27 literally, and non-ASCII characters are represented by ASCII
28 characters that are allowed in host name labels (letters, digits, and
29 hyphens). This document defines a general algorithm called
30 Bootstring that allows a string of basic code points to uniquely
31 represent any string of code points drawn from a larger set.
32 Punycode is an instance of Bootstring that uses particular parameter
33 values specified by this document, appropriate for IDNA.
34
35 Table of Contents
36
37 1. Introduction...............................................2
38 1.1 Features..............................................2
39 1.2 Interaction of protocol parts.........................3
40 2. Terminology................................................3
41 3. Bootstring description.....................................4
42 3.1 Basic code point segregation..........................4
43 3.2 Insertion unsort coding...............................4
44 3.3 Generalized variable-length integers..................5
45 3.4 Bias adaptation.......................................7
46 4. Bootstring parameters......................................8
47 5. Parameter values for Punycode..............................8
48 6. Bootstring algorithms......................................9
49
50
51
52 Costello Standards Track [Page 1]
53 RFC 3492 IDNA Punycode March 2003
54
55
56 6.1 Bias adaptation function.............................10
57 6.2 Decoding procedure...................................11
58 6.3 Encoding procedure...................................12
59 6.4 Overflow handling....................................13
60 7. Punycode examples.........................................14
61 7.1 Sample strings.......................................14
62 7.2 Decoding traces......................................17
63 7.3 Encoding traces......................................19
64 8. Security Considerations...................................20
65 9. References................................................21
66 9.1 Normative References.................................21
67 9.2 Informative References...............................21
68 A. Mixed-case annotation.....................................22
69 B. Disclaimer and license....................................22
70 C. Punycode sample implementation............................23
71 Author's Address.............................................34
72 Full Copyright Statement.....................................35
73
74 1. Introduction
75
76 [IDNA] describes an architecture for supporting internationalized
77 domain names. Labels containing non-ASCII characters can be
78 represented by ACE labels, which begin with a special ACE prefix and
79 contain only ASCII characters. The remainder of the label after the
80 prefix is a Punycode encoding of a Unicode string satisfying certain
81 constraints. For the details of the prefix and constraints, see
82 [IDNA] and [NAMEPREP].
83
84 Punycode is an instance of a more general algorithm called
85 Bootstring, which allows strings composed from a small set of "basic"
86 code points to uniquely represent any string of code points drawn
87 from a larger set. Punycode is Bootstring with particular parameter
88 values appropriate for IDNA.
89
90 1.1 Features
91
92 Bootstring has been designed to have the following features:
93
94 * Completeness: Every extended string (sequence of arbitrary code
95 points) can be represented by a basic string (sequence of basic
96 code points). Restrictions on what strings are allowed, and on
97 length, can be imposed by higher layers.
98
99 * Uniqueness: There is at most one basic string that represents a
100 given extended string.
101
102 * Reversibility: Any extended string mapped to a basic string can
103 be recovered from that basic string.
104
105
106
107 Costello Standards Track [Page 2]
108 RFC 3492 IDNA Punycode March 2003
109
110
111 * Efficient encoding: The ratio of basic string length to extended
112 string length is small. This is important in the context of
113 domain names because RFC 1034 [RFC1034] restricts the length of a
114 domain label to 63 characters.
115
116 * Simplicity: The encoding and decoding algorithms are reasonably
117 simple to implement. The goals of efficiency and simplicity are
118 at odds; Bootstring aims at a good balance between them.
119
120 * Readability: Basic code points appearing in the extended string
121 are represented as themselves in the basic string (although the
122 main purpose is to improve efficiency, not readability).
123
124 Punycode can also support an additional feature that is not used by
125 the ToASCII and ToUnicode operations of [IDNA]. When extended
126 strings are case-folded prior to encoding, the basic string can use
127 mixed case to tell how to convert the folded string into a mixed-case
128 string. See appendix A "Mixed-case annotation".
129
130 1.2 Interaction of protocol parts
131
132 Punycode is used by the IDNA protocol [IDNA] for converting domain
133 labels into ASCII; it is not designed for any other purpose. It is
134 explicitly not designed for processing arbitrary free text.
135
136 2. Terminology
137
138 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
139 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
140 document are to be interpreted as described in BCP 14, RFC 2119
141 [RFC2119].
142
143 A code point is an integral value associated with a character in a
144 coded character set.
145
146 As in the Unicode Standard [UNICODE], Unicode code points are denoted
147 by "U+" followed by four to six hexadecimal digits, while a range of
148 code points is denoted by two hexadecimal numbers separated by "..",
149 with no prefixes.
150
151 The operators div and mod perform integer division; (x div y) is the
152 quotient of x divided by y, discarding the remainder, and (x mod y)
153 is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses
154 these operators only with nonnegative operands, so the quotient and
155 remainder are always nonnegative.
156
157 The break statement jumps out of the innermost loop (as in C).
158
159
160
161
162 Costello Standards Track [Page 3]
163 RFC 3492 IDNA Punycode March 2003
164
165
166 An overflow is an attempt to compute a value that exceeds the maximum
167 value of an integer variable.
168
169 3. Bootstring description
170
171 Bootstring represents an arbitrary sequence of code points (the
172 "extended string") as a sequence of basic code points (the "basic
173 string"). This section describes the representation. Section 6
174 "Bootstring algorithms" presents the algorithms as pseudocode.
175 Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
176 algorithms for sample inputs.
177
178 The following sections describe the four techniques used in
179 Bootstring. "Basic code point segregation" is a very simple and
180 efficient encoding for basic code points occurring in the extended
181 string: they are simply copied all at once. "Insertion unsort
182 coding" encodes the non-basic code points as deltas, and processes
183 the code points in numerical order rather than in order of
184 appearance, which typically results in smaller deltas. The deltas
185 are represented as "generalized variable-length integers", which use
186 basic code points to represent nonnegative integers. The parameters
187 of this integer representation are dynamically adjusted using "bias
188 adaptation", to improve efficiency when consecutive deltas have
189 similar magnitudes.
190
191 3.1 Basic code point segregation
192
193 All basic code points appearing in the extended string are
194 represented literally at the beginning of the basic string, in their
195 original order, followed by a delimiter if (and only if) the number
196 of basic code points is nonzero. The delimiter is a particular basic
197 code point, which never appears in the remainder of the basic string.
198 The decoder can therefore find the end of the literal portion (if
199 there is one) by scanning for the last delimiter.
200
201 3.2 Insertion unsort coding
202
203 The remainder of the basic string (after the last delimiter if there
204 is one) represents a sequence of nonnegative integral deltas as
205 generalized variable-length integers, described in section 3.3. The
206 meaning of the deltas is best understood in terms of the decoder.
207
208 The decoder builds the extended string incrementally. Initially, the
209 extended string is a copy of the literal portion of the basic string
210 (excluding the last delimiter). The decoder inserts non-basic code
211 points, one for each delta, into the extended string, ultimately
212 arriving at the final decoded string.
213
214
215
216
217 Costello Standards Track [Page 4]
218 RFC 3492 IDNA Punycode March 2003
219
220
221 At the heart of this process is a state machine with two state
222 variables: an index i and a counter n. The index i refers to a
223 position in the extended string; it ranges from 0 (the first
224 position) to the current length of the extended string (which refers
225 to a potential position beyond the current end). If the current
226 state is <n,i>, the next state is <n,i+1> if i is less than the
227 length of the extended string, or <n+1,0> if i equals the length of
228 the extended string. In other words, each state change causes i to
229 increment, wrapping around to zero if necessary, and n counts the
230 number of wrap-arounds.
231
232 Notice that the state always advances monotonically (there is no way
233 for the decoder to return to an earlier state). At each state, an
234 insertion is either performed or not performed. At most one
235 insertion is performed in a given state. An insertion inserts the
236 value of n at position i in the extended string. The deltas are a
237 run-length encoding of this sequence of events: they are the lengths
238 of the runs of non-insertion states preceeding the insertion states.
239 Hence, for each delta, the decoder performs delta state changes, then
240 an insertion, and then one more state change. (An implementation
241 need not perform each state change individually, but can instead use
242 division and remainder calculations to compute the next insertion
243 state directly.) It is an error if the inserted code point is a
244 basic code point (because basic code points were supposed to be
245 segregated as described in section 3.1).
246
247 The encoder's main task is to derive the sequence of deltas that will
248 cause the decoder to construct the desired string. It can do this by
249 repeatedly scanning the extended string for the next code point that
250 the decoder would need to insert, and counting the number of state
251 changes the decoder would need to perform, mindful of the fact that
252 the decoder's extended string will include only those code points
253 that have already been inserted. Section 6.3 "Encoding procedure"
254 gives a precise algorithm.
255
256 3.3 Generalized variable-length integers
257
258 In a conventional integer representation the base is the number of
259 distinct symbols for digits, whose values are 0 through base-1. Let
260 digit_0 denote the least significant digit, digit_1 the next least
261 significant, and so on. The value represented is the sum over j of
262 digit_j * w(j), where w(j) = base^j is the weight (scale factor) for
263 position j. For example, in the base 8 integer 437, the digits are
264 7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 +
265 3*8 + 4*64 = 287. This representation has two disadvantages: First,
266 there are multiple encodings of each value (because there can be
267 extra zeros in the most significant positions), which is inconvenient
268
269
270
271
272 Costello Standards Track [Page 5]
273 RFC 3492 IDNA Punycode March 2003
274
275
276 when unique encodings are needed. Second, the integer is not self-
277 delimiting, so if multiple integers are concatenated the boundaries
278 between them are lost.
279
280 The generalized variable-length representation solves these two
281 problems. The digit values are still 0 through base-1, but now the
282 integer is self-delimiting by means of thresholds t(j), each of which
283 is in the range 0 through base-1. Exactly one digit, the most
284 significant, satisfies digit_j < t(j). Therefore, if several
285 integers are concatenated, it is easy to separate them, starting with
286 the first if they are little-endian (least significant digit first),
287 or starting with the last if they are big-endian (most significant
288 digit first). As before, the value is the sum over j of digit_j *
289 w(j), but the weights are different:
290
291 w(0) = 1
292 w(j) = w(j-1) * (base - t(j-1)) for j > 0
293
294 For example, consider the little-endian sequence of base 8 digits
295 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This
296 implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) =
297 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is not
298 less than 3, but 4 is less than 5, so 4 is the last digit. The value
299 of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, with
300 value 2*1 + 5*6 + 1*30 = 62. Decoding this representation is very
301 similar to decoding a conventional integer: Start with a current
302 value of N = 0 and a weight w = 1. Fetch the next digit d and
303 increase N by d * w. If d is less than the current threshold (t)
304 then stop, otherwise increase w by a factor of (base - t), update t
305 for the next position, and repeat.
306
307 Encoding this representation is similar to encoding a conventional
308 integer: If N < t then output one digit for N and stop, otherwise
309 output the digit for t + ((N - t) mod (base - t)), then replace N
310 with (N - t) div (base - t), update t for the next position, and
311 repeat.
312
313 For any particular set of values of t(j), there is exactly one
314 generalized variable-length representation of each nonnegative
315 integral value.
316
317 Bootstring uses little-endian ordering so that the deltas can be
318 separated starting with the first. The t(j) values are defined in
319 terms of the constants base, tmin, and tmax, and a state variable
320 called bias:
321
322 t(j) = base * (j + 1) - bias,
323 clamped to the range tmin through tmax
324
325
326
327 Costello Standards Track [Page 6]
328 RFC 3492 IDNA Punycode March 2003
329
330
331 The clamping means that if the formula yields a value less than tmin
332 or greater than tmax, then t(j) = tmin or tmax, respectively. (In
333 the pseudocode in section 6 "Bootstring algorithms", the expression
334 base * (j + 1) is denoted by k for performance reasons.) These t(j)
335 values cause the representation to favor integers within a particular
336 range determined by the bias.
337
338 3.4 Bias adaptation
339
340 After each delta is encoded or decoded, bias is set for the next
341 delta as follows:
342
343 1. Delta is scaled in order to avoid overflow in the next step:
344
345 let delta = delta div 2
346
347 But when this is the very first delta, the divisor is not 2, but
348 instead a constant called damp. This compensates for the fact
349 that the second delta is usually much smaller than the first.
350
351 2. Delta is increased to compensate for the fact that the next delta
352 will be inserting into a longer string:
353
354 let delta = delta + (delta div numpoints)
355
356 numpoints is the total number of code points encoded/decoded so
357 far (including the one corresponding to this delta itself, and
358 including the basic code points).
359
360 3. Delta is repeatedly divided until it falls within a threshold, to
361 predict the minimum number of digits needed to represent the next
362 delta:
363
364 while delta > ((base - tmin) * tmax) div 2
365 do let delta = delta div (base - tmin)
366
367 4. The bias is set:
368
369 let bias =
370 (base * the number of divisions performed in step 3) +
371 (((base - tmin + 1) * delta) div (delta + skew))
372
373 The motivation for this procedure is that the current delta
374 provides a hint about the likely size of the next delta, and so
375 t(j) is set to tmax for the more significant digits starting with
376 the one expected to be last, tmin for the less significant digits
377 up through the one expected to be third-last, and somewhere
378 between tmin and tmax for the digit expected to be second-last
379
380
381
382 Costello Standards Track [Page 7]
383 RFC 3492 IDNA Punycode March 2003
384
385
386 (balancing the hope of the expected-last digit being unnecessary
387 against the danger of it being insufficient).
388
389 4. Bootstring parameters
390
391 Given a set of basic code points, one needs to be designated as the
392 delimiter. The base cannot be greater than the number of
393 distinguishable basic code points remaining. The digit-values in the
394 range 0 through base-1 need to be associated with distinct non-
395 delimiter basic code points. In some cases multiple code points need
396 to have the same digit-value; for example, uppercase and lowercase
397 versions of the same letter need to be equivalent if basic strings
398 are case-insensitive.
399
400 The initial value of n cannot be greater than the minimum non-basic
401 code point that could appear in extended strings.
402
403 The remaining five parameters (tmin, tmax, skew, damp, and the
404 initial value of bias) need to satisfy the following constraints:
405
406 0 <= tmin <= tmax <= base-1
407 skew >= 1
408 damp >= 2
409 initial_bias mod base <= base - tmin
410
411 Provided the constraints are satisfied, these five parameters affect
412 efficiency but not correctness. They are best chosen empirically.
413
414 If support for mixed-case annotation is desired (see appendix A),
415 make sure that the code points corresponding to 0 through tmax-1 all
416 have both uppercase and lowercase forms.
417
418 5. Parameter values for Punycode
419
420 Punycode uses the following Bootstring parameter values:
421
422 base = 36
423 tmin = 1
424 tmax = 26
425 skew = 38
426 damp = 700
427 initial_bias = 72
428 initial_n = 128 = 0x80
429
430 Although the only restriction Punycode imposes on the input integers
431 is that they be nonnegative, these parameters are especially designed
432 to work well with Unicode [UNICODE] code points, which are integers
433 in the range 0..10FFFF (but not D800..DFFF, which are reserved for
434
435
436
437 Costello Standards Track [Page 8]
438 RFC 3492 IDNA Punycode March 2003
439
440
441 use by the UTF-16 encoding of Unicode). The basic code points are
442 the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the
443 delimiter, and some of the others have digit-values as follows:
444
445 code points digit-values
446 ------------ ----------------------
447 41..5A (A-Z) = 0 to 25, respectively
448 61..7A (a-z) = 0 to 25, respectively
449 30..39 (0-9) = 26 to 35, respectively
450
451 Using hyphen-minus as the delimiter implies that the encoded string
452 can end with a hyphen-minus only if the Unicode string consists
453 entirely of basic code points, but IDNA forbids such strings from
454 being encoded. The encoded string can begin with a hyphen-minus, but
455 IDNA prepends a prefix. Therefore IDNA using Punycode conforms to
456 the RFC 952 rule that host name labels neither begin nor end with a
457 hyphen-minus [RFC952].
458
459 A decoder MUST recognize the letters in both uppercase and lowercase
460 forms (including mixtures of both forms). An encoder SHOULD output
461 only uppercase forms or only lowercase forms, unless it uses mixed-
462 case annotation (see appendix A).
463
464 Presumably most users will not manually write or type encoded strings
465 (as opposed to cutting and pasting them), but those who do will need
466 to be alert to the potential visual ambiguity between the following
467 sets of characters:
468
469 G 6
470 I l 1
471 O 0
472 S 5
473 U V
474 Z 2
475
476 Such ambiguities are usually resolved by context, but in a Punycode
477 encoded string there is no context apparent to humans.
478
479 6. Bootstring algorithms
480
481 Some parts of the pseudocode can be omitted if the parameters satisfy
482 certain conditions (for which Punycode qualifies). These parts are
483 enclosed in {braces}, and notes immediately following the pseudocode
484 explain the conditions under which they can be omitted.
485
486
487
488
489
490
491
492 Costello Standards Track [Page 9]
493 RFC 3492 IDNA Punycode March 2003
494
495
496 Formally, code points are integers, and hence the pseudocode assumes
497 that arithmetic operations can be performed directly on code points.
498 In some programming languages, explicit conversion between code
499 points and integers might be necessary.
500
501 6.1 Bias adaptation function
502
503 function adapt(delta,numpoints,firsttime):
504 if firsttime then let delta = delta div damp
505 else let delta = delta div 2
506 let delta = delta + (delta div numpoints)
507 let k = 0
508 while delta > ((base - tmin) * tmax) div 2 do begin
509 let delta = delta div (base - tmin)
510 let k = k + base
511 end
512 return k + (((base - tmin + 1) * delta) div (delta + skew))
513
514 It does not matter whether the modifications to delta and k inside
515 adapt() affect variables of the same name inside the
516 encoding/decoding procedures, because after calling adapt() the
517 caller does not read those variables before overwriting them.
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547 Costello Standards Track [Page 10]
548 RFC 3492 IDNA Punycode March 2003
549
550
551 6.2 Decoding procedure
552
553 let n = initial_n
554 let i = 0
555 let bias = initial_bias
556 let output = an empty string indexed from 0
557 consume all code points before the last delimiter (if there is one)
558 and copy them to output, fail on any non-basic code point
559 if more than zero code points were consumed then consume one more
560 (which will be the last delimiter)
561 while the input is not exhausted do begin
562 let oldi = i
563 let w = 1
564 for k = base to infinity in steps of base do begin
565 consume a code point, or fail if there was none to consume
566 let digit = the code point's digit-value, fail if it has none
567 let i = i + digit * w, fail on overflow
568 let t = tmin if k <= bias {+ tmin}, or
569 tmax if k >= bias + tmax, or k - bias otherwise
570 if digit < t then break
571 let w = w * (base - t), fail on overflow
572 end
573 let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
574 let n = n + i div (length(output) + 1), fail on overflow
575 let i = i mod (length(output) + 1)
576 {if n is a basic code point then fail}
577 insert n into output at position i
578 increment i
579 end
580
581 The full statement enclosed in braces (checking whether n is a basic
582 code point) can be omitted if initial_n exceeds all basic code points
583 (which is true for Punycode), because n is never less than initial_n.
584
585 In the assignment of t, where t is clamped to the range tmin through
586 tmax, "+ tmin" can always be omitted. This makes the clamping
587 calculation incorrect when bias < k < bias + tmin, but that cannot
588 happen because of the way bias is computed and because of the
589 constraints on the parameters.
590
591 Because the decoder state can only advance monotonically, and there
592 is only one representation of any delta, there is therefore only one
593 encoded string that can represent a given sequence of integers. The
594 only error conditions are invalid code points, unexpected end-of-
595 input, overflow, and basic code points encoded using deltas instead
596 of appearing literally. If the decoder fails on these errors as
597 shown above, then it cannot produce the same output for two distinct
598 inputs. Without this property it would have been necessary to re-
599
600
601
602 Costello Standards Track [Page 11]
603 RFC 3492 IDNA Punycode March 2003
604
605
606 encode the output and verify that it matches the input in order to
607 guarantee the uniqueness of the encoding.
608
609 6.3 Encoding procedure
610
611 let n = initial_n
612 let delta = 0
613 let bias = initial_bias
614 let h = b = the number of basic code points in the input
615 copy them to the output in order, followed by a delimiter if b > 0
616 {if the input contains a non-basic code point < n then fail}
617 while h < length(input) do begin
618 let m = the minimum {non-basic} code point >= n in the input
619 let delta = delta + (m - n) * (h + 1), fail on overflow
620 let n = m
621 for each code point c in the input (in order) do begin
622 if c < n {or c is basic} then increment delta, fail on overflow
623 if c == n then begin
624 let q = delta
625 for k = base to infinity in steps of base do begin
626 let t = tmin if k <= bias {+ tmin}, or
627 tmax if k >= bias + tmax, or k - bias otherwise
628 if q < t then break
629 output the code point for digit t + ((q - t) mod (base - t))
630 let q = (q - t) div (base - t)
631 end
632 output the code point for digit q
633 let bias = adapt(delta, h + 1, test h equals b?)
634 let delta = 0
635 increment h
636 end
637 end
638 increment delta and n
639 end
640
641 The full statement enclosed in braces (checking whether the input
642 contains a non-basic code point less than n) can be omitted if all
643 code points less than initial_n are basic code points (which is true
644 for Punycode if code points are unsigned).
645
646 The brace-enclosed conditions "non-basic" and "or c is basic" can be
647 omitted if initial_n exceeds all basic code points (which is true for
648 Punycode), because the code point being tested is never less than
649 initial_n.
650
651 In the assignment of t, where t is clamped to the range tmin through
652 tmax, "+ tmin" can always be omitted. This makes the clamping
653 calculation incorrect when bias < k < bias + tmin, but that cannot
654
655
656
657 Costello Standards Track [Page 12]
658 RFC 3492 IDNA Punycode March 2003
659
660
661 happen because of the way bias is computed and because of the
662 constraints on the parameters.
663
664 The checks for overflow are necessary to avoid producing invalid
665 output when the input contains very large values or is very long.
666
667 The increment of delta at the bottom of the outer loop cannot
668 overflow because delta < length(input) before the increment, and
669 length(input) is already assumed to be representable. The increment
670 of n could overflow, but only if h == length(input), in which case
671 the procedure is finished anyway.
672
673 6.4 Overflow handling
674
675 For IDNA, 26-bit unsigned integers are sufficient to handle all valid
676 IDNA labels without overflow, because any string that needed a 27-bit
677 delta would have to exceed either the code point limit (0..10FFFF) or
678 the label length limit (63 characters). However, overflow handling
679 is necessary because the inputs are not necessarily valid IDNA
680 labels.
681
682 If the programming language does not provide overflow detection, the
683 following technique can be used. Suppose A, B, and C are
684 representable nonnegative integers and C is nonzero. Then A + B
685 overflows if and only if B > maxint - A, and A + (B * C) overflows if
The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.
This RFC is included in the DNS RFCs annotation project whose home page is here.
This RFC is implemented in BIND 9.18 (all versions).
RFC5891 says that it updates RFC 3492, but it doesn't really do so.
Instead, Section 4.4 says:
This document does not update or alter the Punycode algorithm specified in RFC 3492 in any way. RFC 3492 does make a non-normative reference to the information about the value and construction of the ACE prefix that appears in RFC 3490 or Nameprep [RFC3491]. For consistency and reader convenience, IDNA2008 effectively updates that reference to point to this document. That change does not alter the prefix itself. The prefix, "xn--", is the same in both sets of documents.
686 and only if B > (maxint - A) div C, where maxint is the greatest
687 integer for which maxint + 1 cannot be represented. Refer to
688 appendix C "Punycode sample implementation" for demonstrations of
689 this technique in the C language.
690
691 The decoding and encoding algorithms shown in sections 6.2 and 6.3
692 handle overflow by detecting it whenever it happens. Another
693 approach is to enforce limits on the inputs that prevent overflow
694 from happening. For example, if the encoder were to verify that no
695 input code points exceed M and that the input length does not exceed
696 L, then no delta could ever exceed (M - initial_n) * (L + 1), and
697 hence no overflow could occur if integer variables were capable of
698 representing values that large. This prevention approach would
699 impose more restrictions on the input than the detection approach
700 does, but might be considered simpler in some programming languages.
701
702 In theory, the decoder could use an analogous approach, limiting the
703 number of digits in a variable-length integer (that is, limiting the
704 number of iterations in the innermost loop). However, the number of
705 digits that suffice to represent a given delta can sometimes
706 represent much larger deltas (because of the adaptation), and hence
707 this approach would probably need integers wider than 32 bits.
708
709
710
711
712 Costello Standards Track [Page 13]
713 RFC 3492 IDNA Punycode March 2003
714
715
716 Yet another approach for the decoder is to allow overflow to occur,
717 but to check the final output string by re-encoding it and comparing
718 to the decoder input. If and only if they do not match (using a
719 case-insensitive ASCII comparison) overflow has occurred. This
720 delayed-detection approach would not impose any more restrictions on
721 the input than the immediate-detection approach does, and might be
722 considered simpler in some programming languages.
723
724 In fact, if the decoder is used only inside the IDNA ToUnicode
725 operation [IDNA], then it need not check for overflow at all, because
726 ToUnicode performs a higher level re-encoding and comparison, and a
727 mismatch has the same consequence as if the Punycode decoder had
728 failed.
729
730 7. Punycode examples
731
732 7.1 Sample strings
733
734 In the Punycode encodings below, the ACE prefix is not shown.
735 Backslashes show where line breaks have been inserted in strings too
736 long for one line.
737
738 The first several examples are all translations of the sentence "Why
739 can't they just speak in <language>?" (courtesy of Michael Kaplan's
740 "provincial" page [PROVINCIAL]). Word breaks and punctuation have
741 been removed, as is often done in domain names.
742
743 (A) Arabic (Egyptian):
744 u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
745 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
746 Punycode: egbpdaj6bu4bxfgehfvwxn
747
748 (B) Chinese (simplified):
749 u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
750 Punycode: ihqwcrb4cv8a8dqg056pqjye
751
752 (C) Chinese (traditional):
753 u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
754 Punycode: ihqwctvzc91f659drss3x8bo0yb
755
756 (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
757 U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
758 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
759 u+0065 u+0073 u+006B u+0079
760 Punycode: Proprostnemluvesky-uyb24dma41a
761
762
763
764
765
766
767 Costello Standards Track [Page 14]
768 RFC 3492 IDNA Punycode March 2003
769
770
771 (E) Hebrew:
772 u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
773 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
774 u+05D1 u+05E8 u+05D9 u+05EA
775 Punycode: 4dbcagdahymbxekheh6e0a7fei0b
776
777 (F) Hindi (Devanagari):
778 u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
779 u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
780 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
781 u+0939 u+0948 u+0902
782 Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
783
784 (G) Japanese (kanji and hiragana):
785 u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
786 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
787 Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
788
789 (H) Korean (Hangul syllables):
790 u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
791 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
792 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
793 Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
794 psd879ccm6fea98c
795
where maxint is the greatest integer for which maxint + 1 cannot be represented.
where maxint is the maximum value of an integer variable.
796 (I) Russian (Cyrillic):
797 U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
798 u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
799 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
800 u+0438
801 Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
802
803 (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
804 U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
805 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
806 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
807 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
808 u+0061 u+00F1 u+006F u+006C
809 Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a
810
811 (K) Vietnamese:
812 T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
813 <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
814 U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
815 u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
816 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
817 U+0056 u+0069 u+1EC7 u+0074
818 Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g
819
820
821
822 Costello Standards Track [Page 15]
823 RFC 3492 IDNA Punycode March 2003
824
825
826 The next several examples are all names of Japanese music artists,
827 song titles, and TV programs, just because the author happens to have
828 them handy (but Japanese is useful for providing examples of single-
829 row text, two-row text, ideographic text, and various mixtures
830 thereof).
831
832 (L) 3<nen>B<gumi><kinpachi><sensei>
833 u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
834 Punycode: 3B-ww4c5e180e575a65lsy2b
835
836 (M) <amuro><namie>-with-SUPER-MONKEYS
837 u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
838 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
839 U+004F U+004E U+004B U+0045 U+0059 U+0053
840 Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
841
842 (N) Hello-Another-Way-<sorezore><no><basho>
843 U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
844 u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
845 u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
846 Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b
847
848 (O) <hitotsu><yane><no><shita>2
849 u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
850 Punycode: 2-u9tlzr9756bt3uc0v
851
852 (P) Maji<de>Koi<suru>5<byou><mae>
853 U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
854 u+308B u+0035 u+79D2 u+524D
855 Punycode: MajiKoi5-783gue6qz075azm5e
856
857 (Q) <pafii>de<runba>
858 u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
859 Punycode: de-jg4avhby1noc0d
860
861 (R) <sono><supiido><de>
862 u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
863 Punycode: d9juau41awczczp
864
865 The last example is an ASCII string that breaks the existing rules
866 for host name labels. (It is not a realistic example for IDNA,
867 because IDNA never encodes pure ASCII labels.)
868
869 (S) -> $1.00 <-
870 u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
871 u+003C u+002D
872 Punycode: -> $1.00 <--
873
874
875
876
877 Costello Standards Track [Page 16]
878 RFC 3492 IDNA Punycode March 2003
879
880
881 7.2 Decoding traces
882
883 In the following traces, the evolving state of the decoder is shown
884 as a sequence of hexadecimal values, representing the code points in
885 the extended string. An asterisk appears just after the most
886 recently inserted code point, indicating both n (the value preceeding
887 the asterisk) and i (the position of the value just after the
888 asterisk). Other numerical values are decimal.
889
890 Decoding trace of example B from section 7.1:
891
892 n is 128, i is 0, bias is 72
893 input is "ihqwcrb4cv8a8dqg056pqjye"
894 there is no delimiter, so extended string starts empty
895 delta "ihq" decodes to 19853
896 bias becomes 21
897 4E0D *
898 delta "wc" decodes to 64
899 bias becomes 20
900 4E0D 4E2D *
901 delta "rb" decodes to 37
902 bias becomes 13
903 4E3A * 4E0D 4E2D
904 delta "4c" decodes to 56
905 bias becomes 17
906 4E3A 4E48 * 4E0D 4E2D
907 delta "v8a" decodes to 599
908 bias becomes 32
909 4E3A 4EC0 * 4E48 4E0D 4E2D
910 delta "8d" decodes to 130
911 bias becomes 23
912 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
913 delta "qg" decodes to 154
914 bias becomes 25
915 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
916 delta "056p" decodes to 46301
917 bias becomes 84
918 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
919 delta "qjye" decodes to 88531
920 bias becomes 90
921 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
922
923
924
925
926
927
928
929
930
931
932 Costello Standards Track [Page 17]
933 RFC 3492 IDNA Punycode March 2003
934
935
936 Decoding trace of example L from section 7.1:
937
938 n is 128, i is 0, bias is 72
939 input is "3B-ww4c5e180e575a65lsy2b"
940 literal portion is "3B-", so extended string starts as:
941 0033 0042
942 delta "ww4c" decodes to 62042
943 bias becomes 27
944 0033 0042 5148 *
945 delta "5e" decodes to 139
946 bias becomes 24
947 0033 0042 516B * 5148
948 delta "180e" decodes to 16683
949 bias becomes 67
950 0033 5E74 * 0042 516B 5148
951 delta "575a" decodes to 34821
952 bias becomes 82
953 0033 5E74 0042 516B 5148 751F *
954 delta "65l" decodes to 14592
955 bias becomes 67
956 0033 5E74 0042 7D44 * 516B 5148 751F
957 delta "sy2b" decodes to 42088
958 bias becomes 84
959 0033 5E74 0042 7D44 91D1 * 516B 5148 751F
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987 Costello Standards Track [Page 18]
988 RFC 3492 IDNA Punycode March 2003
989
990
991 7.3 Encoding traces
992
993 In the following traces, code point values are hexadecimal, while
994 other numerical values are decimal.
995
996 Encoding trace of example B from section 7.1:
997
998 bias is 72
999 input is:
1000 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
1001 there are no basic code points, so no literal portion
1002 next code point to insert is 4E0D
1003 needed delta is 19853, encodes as "ihq"
1004 bias becomes 21
1005 next code point to insert is 4E2D
1006 needed delta is 64, encodes as "wc"
1007 bias becomes 20
1008 next code point to insert is 4E3A
1009 needed delta is 37, encodes as "rb"
1010 bias becomes 13
1011 next code point to insert is 4E48
1012 needed delta is 56, encodes as "4c"
1013 bias becomes 17
1014 next code point to insert is 4EC0
1015 needed delta is 599, encodes as "v8a"
1016 bias becomes 32
1017 next code point to insert is 4ED6
1018 needed delta is 130, encodes as "8d"
1019 bias becomes 23
1020 next code point to insert is 4EEC
1021 needed delta is 154, encodes as "qg"
1022 bias becomes 25
1023 next code point to insert is 6587
1024 needed delta is 46301, encodes as "056p"
1025 bias becomes 84
1026 next code point to insert is 8BF4
1027 needed delta is 88531, encodes as "qjye"
1028 bias becomes 90
1029 output is "ihqwcrb4cv8a8dqg056pqjye"
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042 Costello Standards Track [Page 19]
1043 RFC 3492 IDNA Punycode March 2003
1044
1045
1046 Encoding trace of example L from section 7.1:
1047
1048 bias is 72
1049 input is:
1050 0033 5E74 0042 7D44 91D1 516B 5148 751F
1051 basic code points (0033, 0042) are copied to literal portion: "3B-"
1052 next code point to insert is 5148
1053 needed delta is 62042, encodes as "ww4c"
1054 bias becomes 27
1055 next code point to insert is 516B
1056 needed delta is 139, encodes as "5e"
1057 bias becomes 24
1058 next code point to insert is 5E74
1059 needed delta is 16683, encodes as "180e"
1060 bias becomes 67
1061 next code point to insert is 751F
1062 needed delta is 34821, encodes as "575a"
1063 bias becomes 82
1064 next code point to insert is 7D44
1065 needed delta is 14592, encodes as "65l"
1066 bias becomes 67
1067 next code point to insert is 91D1
1068 needed delta is 42088, encodes as "sy2b"
1069 bias becomes 84
1070 output is "3B-ww4c5e180e575a65lsy2b"
1071
1072 8. Security Considerations
1073
1074 Users expect each domain name in DNS to be controlled by a single
1075 authority. If a Unicode string intended for use as a domain label
1076 could map to multiple ACE labels, then an internationalized domain
1077 name could map to multiple ASCII domain names, each controlled by a
1078 different authority, some of which could be spoofs that hijack
1079 service requests intended for another. Therefore Punycode is
1080 designed so that each Unicode string has a unique encoding.
1081
1082 However, there can still be multiple Unicode representations of the
1083 "same" text, for various definitions of "same". This problem is
1084 addressed to some extent by the Unicode standard under the topic of
1085 canonicalization, and this work is leveraged for domain names by
1086 Nameprep [NAMEPREP].
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097 Costello Standards Track [Page 20]
1098 RFC 3492 IDNA Punycode March 2003
1099
1100
1101 9. References
1102
1103 9.1 Normative References
1104
1105 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1106 Requirement Levels", BCP 14, RFC 2119, March 1997.
1107
1108 9.2 Informative References
1109
1110 [RFC952] Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet
1111 Host Table Specification", RFC 952, October 1985.
1112
1113 [RFC1034] Mockapetris, P., "Domain Names - Concepts and
1114 Facilities", STD 13, RFC 1034, November 1987.
1115
1116 [IDNA] Faltstrom, P., Hoffman, P. and A. Costello,
1117 "Internationalizing Domain Names in Applications
1118 (IDNA)", RFC 3490, March 2003.
1119
1120 [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1121 Profile for Internationalized Domain Names (IDN)", RFC
1122 3491, March 2003.
1123
1124 [ASCII] Cerf, V., "ASCII format for Network Interchange", RFC
1125 20, October 1969.
1126
1127 [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page",
1128 http://www.trigeminal.com/samples/provincial.html.
1129
1130 [UNICODE] The Unicode Consortium, "The Unicode Standard",
1131 http://www.unicode.org/unicode/standard/standard.html.
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152 Costello Standards Track [Page 21]
1153 RFC 3492 IDNA Punycode March 2003
1154
1155
1156 A. Mixed-case annotation
1157
1158 In order to use Punycode to represent case-insensitive strings,
1159 higher layers need to case-fold the strings prior to Punycode
1160 encoding. The encoded string can use mixed case as an annotation
1161 telling how to convert the folded string into a mixed-case string for
1162 display purposes. Note, however, that mixed-case annotation is not
1163 used by the ToASCII and ToUnicode operations specified in [IDNA], and
1164 therefore implementors of IDNA can disregard this appendix.
1165
1166 Basic code points can use mixed case directly, because the decoder
1167 copies them verbatim, leaving lowercase code points lowercase, and
1168 leaving uppercase code points uppercase. Each non-basic code point
1169 is represented by a delta, which is represented by a sequence of
1170 basic code points, the last of which provides the annotation. If it
1171 is uppercase, it is a suggestion to map the non-basic code point to
1172 uppercase (if possible); if it is lowercase, it is a suggestion to
1173 map the non-basic code point to lowercase (if possible).
1174
1175 These annotations do not alter the code points returned by decoders;
1176 the annotations are returned separately, for the caller to use or
1177 ignore. Encoders can accept annotations in addition to code points,
1178 but the annotations do not alter the output, except to influence the
1179 uppercase/lowercase form of ASCII letters.
1180
1181 Punycode encoders and decoders need not support these annotations,
1182 and higher layers need not use them.
1183
1184 B. Disclaimer and license
1185
1186 Regarding this entire document or any portion of it (including the
1187 pseudocode and C code), the author makes no guarantees and is not
1188 responsible for any damage resulting from its use. The author grants
1189 irrevocable permission to anyone to use, modify, and distribute it in
1190 any way that does not diminish the rights of anyone else to use,
1191 modify, and distribute it, provided that redistributed derivative
1192 works do not contain misleading author or version information.
1193 Derivative works need not be licensed under similar terms.
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207 Costello Standards Track [Page 22]
1208 RFC 3492 IDNA Punycode March 2003
1209
1210
1211 C. Punycode sample implementation
1212
1213 /*
1214 punycode.c from RFC 3492
1215 http://www.nicemice.net/idn/
1216 Adam M. Costello
1217 http://www.nicemice.net/amc/
1218
1219 This is ANSI C code (C89) implementing Punycode (RFC 3492).
1220
1221 */
1222
1223
1224 /************************************************************/
1225 /* Public interface (would normally go in its own .h file): */
1226
1227 #include <limits.h>
1228
1229 enum punycode_status {
1230 punycode_success,
1231 punycode_bad_input, /* Input is invalid. */
1232 punycode_big_output, /* Output would exceed the space provided. */
1233 punycode_overflow /* Input needs wider integers to process. */
1234 };
1235
1236 #if UINT_MAX >= (1 << 26) - 1
1237 typedef unsigned int punycode_uint;
1238 #else
1239 typedef unsigned long punycode_uint;
1240 #endif
1241
1242 enum punycode_status punycode_encode(
1243 punycode_uint input_length,
1244 const punycode_uint input[],
1245 const unsigned char case_flags[],
1246 punycode_uint *output_length,
1247 char output[] );
1248
1249 /* punycode_encode() converts Unicode to Punycode. The input */
1250 /* is represented as an array of Unicode code points (not code */
1251 /* units; surrogate pairs are not allowed), and the output */
1252 /* will be represented as an array of ASCII code points. The */
1253 /* output string is *not* null-terminated; it will contain */
1254 /* zeros if and only if the input contains zeros. (Of course */
1255 /* the caller can leave room for a terminator and add one if */
1256 /* needed.) The input_length is the number of code points in */
1257 /* the input. The output_length is an in/out argument: the */
1258 /* caller passes in the maximum number of code points that it */
1259
1260
1261
1262 Costello Standards Track [Page 23]
1263 RFC 3492 IDNA Punycode March 2003
1264
1265
1266 /* can receive, and on successful return it will contain the */
1267 /* number of code points actually output. The case_flags array */
1268 /* holds input_length boolean values, where nonzero suggests that */
1269 /* the corresponding Unicode character be forced to uppercase */
1270 /* after being decoded (if possible), and zero suggests that */
1271 /* it be forced to lowercase (if possible). ASCII code points */
1272 /* are encoded literally, except that ASCII letters are forced */
1273 /* to uppercase or lowercase according to the corresponding */
1274 /* uppercase flags. If case_flags is a null pointer then ASCII */
1275 /* letters are left as they are, and other code points are */
1276 /* treated as if their uppercase flags were zero. The return */
1277 /* value can be any of the punycode_status values defined above */
1278 /* except punycode_bad_input; if not punycode_success, then */
1279 /* output_size and output might contain garbage. */
1280
1281 enum punycode_status punycode_decode(
1282 punycode_uint input_length,
1283 const char input[],
1284 punycode_uint *output_length,
1285 punycode_uint output[],
1286 unsigned char case_flags[] );
1287
1288 /* punycode_decode() converts Punycode to Unicode. The input is */
1289 /* represented as an array of ASCII code points, and the output */
1290 /* will be represented as an array of Unicode code points. The */
1291 /* input_length is the number of code points in the input. The */
1292 /* output_length is an in/out argument: the caller passes in */
1293 /* the maximum number of code points that it can receive, and */
1294 /* on successful return it will contain the actual number of */
1295 /* code points output. The case_flags array needs room for at */
1296 /* least output_length values, or it can be a null pointer if the */
1297 /* case information is not needed. A nonzero flag suggests that */
1298 /* the corresponding Unicode character be forced to uppercase */
1299 /* by the caller (if possible), while zero suggests that it be */
1300 /* forced to lowercase (if possible). ASCII code points are */
1301 /* output already in the proper case, but their flags will be set */
1302 /* appropriately so that applying the flags would be harmless. */
1303 /* The return value can be any of the punycode_status values */
1304 /* defined above; if not punycode_success, then output_length, */
1305 /* output, and case_flags might contain garbage. On success, the */
1306 /* decoder will never need to write an output_length greater than */
1307 /* input_length, because of how the encoding is defined. */
1308
1309 /**********************************************************/
1310 /* Implementation (would normally go in its own .c file): */
1311
1312 #include <string.h>
1313
1314
1315
1316
1317 Costello Standards Track [Page 24]
1318 RFC 3492 IDNA Punycode March 2003
1319
1320
1321 /*** Bootstring parameters for Punycode ***/
1322
1323 enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
1324 initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
1325
1326 /* basic(cp) tests whether cp is a basic code point: */
1327 #define basic(cp) ((punycode_uint)(cp) < 0x80)
1328
1329 /* delim(cp) tests whether cp is a delimiter: */
1330 #define delim(cp) ((cp) == delimiter)
1331
1332 /* decode_digit(cp) returns the numeric value of a basic code */
1333 /* point (for use in representing integers) in the range 0 to */
1334 /* base-1, or base if cp is does not represent a value. */
1335
1336 static punycode_uint decode_digit(punycode_uint cp)
1337 {
1338 return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 :
1339 cp - 97 < 26 ? cp - 97 : base;
1340 }
1341
1342 /* encode_digit(d,flag) returns the basic code point whose value */
1343 /* (when used for representing integers) is d, which needs to be in */
1344 /* the range 0 to base-1. The lowercase form is used unless flag is */
1345 /* nonzero, in which case the uppercase form is used. The behavior */
1346 /* is undefined if flag is nonzero and digit d has no uppercase form. */
1347
1348 static char encode_digit(punycode_uint d, int flag)
1349 {
1350 return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
1351 /* 0..25 map to ASCII a..z or A..Z */
1352 /* 26..35 map to ASCII 0..9 */
1353 }
1354
1355 /* flagged(bcp) tests whether a basic code point is flagged */
1356 /* (uppercase). The behavior is undefined if bcp is not a */
1357 /* basic code point. */
1358
1359 #define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
1360
1361 /* encode_basic(bcp,flag) forces a basic code point to lowercase */
1362 /* if flag is zero, uppercase if flag is nonzero, and returns */
1363 /* the resulting code point. The code point is unchanged if it */
1364 /* is caseless. The behavior is undefined if bcp is not a basic */
1365 /* code point. */
1366
1367 static char encode_basic(punycode_uint bcp, int flag)
1368 {
1369
1370
1371
1372 Costello Standards Track [Page 25]
1373 RFC 3492 IDNA Punycode March 2003
1374
1375
1376 bcp -= (bcp - 97 < 26) << 5;
1377 return bcp + ((!flag && (bcp - 65 < 26)) << 5);
1378 }
1379
1380 /*** Platform-specific constants ***/
1381
1382 /* maxint is the maximum value of a punycode_uint variable: */
1383 static const punycode_uint maxint = -1;
1384 /* Because maxint is unsigned, -1 becomes the maximum value. */
1385
1386 /*** Bias adaptation function ***/
1387
1388 static punycode_uint adapt(
1389 punycode_uint delta, punycode_uint numpoints, int firsttime )
1390 {
1391 punycode_uint k;
1392
1393 delta = firsttime ? delta / damp : delta >> 1;
1394 /* delta >> 1 is a faster way of doing delta / 2 */
1395 delta += delta / numpoints;
1396
1397 for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) {
1398 delta /= base - tmin;
1399 }
1400
1401 return k + (base - tmin + 1) * delta / (delta + skew);
1402 }
1403
1404 /*** Main encode function ***/
1405
1406 enum punycode_status punycode_encode(
1407 punycode_uint input_length,
1408 const punycode_uint input[],
1409 const unsigned char case_flags[],
1410 punycode_uint *output_length,
1411 char output[] )
1412 {
1413 punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
1414
1415 /* Initialize the state: */
1416
1417 n = initial_n;
1418 delta = out = 0;
1419 max_out = *output_length;
1420 bias = initial_bias;
1421
1422 /* Handle the basic code points: */
1423
1424
1425
1426
1427 Costello Standards Track [Page 26]
1428 RFC 3492 IDNA Punycode March 2003
1429
1430
1431 for (j = 0; j < input_length; ++j) {
1432 if (basic(input[j])) {
1433 if (max_out - out < 2) return punycode_big_output;
1434 output[out++] =
1435 case_flags ? encode_basic(input[j], case_flags[j]) : input[j];
1436 }
1437 /* else if (input[j] < n) return punycode_bad_input; */
1438 /* (not needed for Punycode with unsigned code points) */
1439 }
1440
1441 h = b = out;
1442
1443 /* h is the number of code points that have been handled, b is the */
1444 /* number of basic code points, and out is the number of characters */
1445 /* that have been output. */
1446
1447 if (b > 0) output[out++] = delimiter;
1448
1449 /* Main encoding loop: */
1450
1451 while (h < input_length) {
1452 /* All non-basic code points < n have been */
1453 /* handled already. Find the next larger one: */
1454
1455 for (m = maxint, j = 0; j < input_length; ++j) {
1456 /* if (basic(input[j])) continue; */
1457 /* (not needed for Punycode) */
1458 if (input[j] >= n && input[j] < m) m = input[j];
1459 }
1460
1461 /* Increase delta enough to advance the decoder's */
1462 /* <n,i> state to <m,0>, but guard against overflow: */
1463
1464 if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
1465 delta += (m - n) * (h + 1);
1466 n = m;
1467
1468 for (j = 0; j < input_length; ++j) {
1469 /* Punycode does not need to check whether input[j] is basic: */
1470 if (input[j] < n /* || basic(input[j]) */ ) {
1471 if (++delta == 0) return punycode_overflow;
1472 }
1473
1474 if (input[j] == n) {
1475 /* Represent delta as a generalized variable-length integer: */
1476
1477 for (q = delta, k = base; ; k += base) {
1478 if (out >= max_out) return punycode_big_output;
1479
1480
1481
1482 Costello Standards Track [Page 27]
1483 RFC 3492 IDNA Punycode March 2003
1484
1485
1486 t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
1487 k >= bias + tmax ? tmax : k - bias;
1488 if (q < t) break;
1489 output[out++] = encode_digit(t + (q - t) % (base - t), 0);
1490 q = (q - t) / (base - t);
1491 }
1492
1493 output[out++] = encode_digit(q, case_flags && case_flags[j]);
1494 bias = adapt(delta, h + 1, h == b);
1495 delta = 0;
1496 ++h;
1497 }
1498 }
1499
1500 ++delta, ++n;
1501 }
1502
1503 *output_length = out;
1504 return punycode_success;
1505 }
1506
1507 /*** Main decode function ***/
1508
1509 enum punycode_status punycode_decode(
1510 punycode_uint input_length,
1511 const char input[],
1512 punycode_uint *output_length,
1513 punycode_uint output[],
1514 unsigned char case_flags[] )
1515 {
1516 punycode_uint n, out, i, max_out, bias,
1517 b, j, in, oldi, w, k, digit, t;
1518
1519 /* Initialize the state: */
1520
1521 n = initial_n;
1522 out = i = 0;
1523 max_out = *output_length;
1524 bias = initial_bias;
1525
1526 /* Handle the basic code points: Let b be the number of input code */
1527 /* points before the last delimiter, or 0 if there is none, then */
1528 /* copy the first b code points to the output. */
1529
1530 for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j;
1531 if (b > max_out) return punycode_big_output;
1532
1533 for (j = 0; j < b; ++j) {
1534
1535
1536
1537 Costello Standards Track [Page 28]
1538 RFC 3492 IDNA Punycode March 2003
1539
1540
1541 if (case_flags) case_flags[out] = flagged(input[j]);
1542 if (!basic(input[j])) return punycode_bad_input;
1543 output[out++] = input[j];
1544 }
1545
1546 /* Main decoding loop: Start just after the last delimiter if any */
1547 /* basic code points were copied; start at the beginning otherwise. */
1548
1549 for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) {
1550
1551 /* in is the index of the next character to be consumed, and */
1552 /* out is the number of code points in the output array. */
1553
1554 /* Decode a generalized variable-length integer into delta, */
1555 /* which gets added to i. The overflow checking is easier */
1556 /* if we increase i as we go, then subtract off its starting */
1557 /* value at the end to obtain delta. */
1558
1559 for (oldi = i, w = 1, k = base; ; k += base) {
1560 if (in >= input_length) return punycode_bad_input;
1561 digit = decode_digit(input[in++]);
1562 if (digit >= base) return punycode_bad_input;
1563 if (digit > (maxint - i) / w) return punycode_overflow;
1564 i += digit * w;
1565 t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
1566 k >= bias + tmax ? tmax : k - bias;
1567 if (digit < t) break;
1568 if (w > maxint / (base - t)) return punycode_overflow;
1569 w *= (base - t);
1570 }
1571
1572 bias = adapt(i - oldi, out + 1, oldi == 0);
1573
1574 /* i was supposed to wrap around from out+1 to 0, */
1575 /* incrementing n each time, so we'll fix that now: */
1576
1577 if (i / (out + 1) > maxint - n) return punycode_overflow;
1578 n += i / (out + 1);
1579 i %= (out + 1);
1580
1581 /* Insert n at position i of the output: */
1582
1583 /* not needed for Punycode: */
1584 /* if (decode_digit(n) <= base) return punycode_invalid_input; */
1585 if (out >= max_out) return punycode_big_output;
1586
1587 if (case_flags) {
1588 memmove(case_flags + i + 1, case_flags + i, out - i);
1589
1590
1591
1592 Costello Standards Track [Page 29]
1593 RFC 3492 IDNA Punycode March 2003
1594
1595
1596 /* Case of last character determines uppercase flag: */
1597 case_flags[i] = flagged(input[in - 1]);
1598 }
1599
1600 memmove(output + i + 1, output + i, (out - i) * sizeof *output);
1601 output[i++] = n;
1602 }
1603
1604 *output_length = out;
1605 return punycode_success;
1606 }
1607
1608 /******************************************************************/
1609 /* Wrapper for testing (would normally go in a separate .c file): */
1610
1611 #include <assert.h>
1612 #include <stdio.h>
1613 #include <stdlib.h>
1614 #include <string.h>
1615
1616 /* For testing, we'll just set some compile-time limits rather than */
1617 /* use malloc(), and set a compile-time option rather than using a */
1618 /* command-line option. */
1619
1620 enum {
1621 unicode_max_length = 256,
1622 ace_max_length = 256
1623 };
1624
1625 static void usage(char **argv)
1626 {
1627 fprintf(stderr,
1628 "\n"
1629 "%s -e reads code points and writes a Punycode string.\n"
1630 "%s -d reads a Punycode string and writes code points.\n"
1631 "\n"
1632 "Input and output are plain text in the native character set.\n"
1633 "Code points are in the form u+hex separated by whitespace.\n"
1634 "Although the specification allows Punycode strings to contain\n"
1635 "any characters from the ASCII repertoire, this test code\n"
1636 "supports only the printable characters, and needs the Punycode\n"
1637 "string to be followed by a newline.\n"
1638 "The case of the u in u+hex is the force-to-uppercase flag.\n"
1639 , argv[0], argv[0]);
1640 exit(EXIT_FAILURE);
1641 }
1642
1643 static void fail(const char *msg)
1644
1645
1646
1647 Costello Standards Track [Page 30]
1648 RFC 3492 IDNA Punycode March 2003
1649
1650
1651 {
1652 fputs(msg,stderr);
1653 exit(EXIT_FAILURE);
1654 }
1655
1656 static const char too_big[] =
1657 "input or output is too large, recompile with larger limits\n";
1658 static const char invalid_input[] = "invalid input\n";
1659 static const char overflow[] = "arithmetic overflow\n";
1660 static const char io_error[] = "I/O error\n";
1661
1662 /* The following string is used to convert printable */
1663 /* characters between ASCII and the native charset: */
1664
1665 static const char print_ascii[] =
1666 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1667 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1668 " !\"#$%&'()*+,-./"
1669 "0123456789:;<=>?"
1670 "@ABCDEFGHIJKLMNO"
1671 "PQRSTUVWXYZ[\\]^_"
1672 "`abcdefghijklmno"
1673 "pqrstuvwxyz{|}~\n";
1674
1675 int main(int argc, char **argv)
1676 {
1677 enum punycode_status status;
1678 int r;
1679 unsigned int input_length, output_length, j;
1680 unsigned char case_flags[unicode_max_length];
1681
1682 if (argc != 2) usage(argv);
1683 if (argv[1][0] != '-') usage(argv);
1684 if (argv[1][2] != 0) usage(argv);
1685
1686 if (argv[1][1] == 'e') {
1687 punycode_uint input[unicode_max_length];
1688 unsigned long codept;
1689 char output[ace_max_length+1], uplus[3];
1690 int c;
1691
1692 /* Read the input code points: */
1693
1694 input_length = 0;
1695
1696 for (;;) {
1697 r = scanf("%2s%lx", uplus, &codept);
1698 if (ferror(stdin)) fail(io_error);
1699
1700
1701
1702 Costello Standards Track [Page 31]
1703 RFC 3492 IDNA Punycode March 2003
1704
1705
1706 if (r == EOF || r == 0) break;
1707
1708 if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
1709 fail(invalid_input);
1710 }
1711
1712 if (input_length == unicode_max_length) fail(too_big);
1713
1714 if (uplus[0] == 'u') case_flags[input_length] = 0;
1715 else if (uplus[0] == 'U') case_flags[input_length] = 1;
1716 else fail(invalid_input);
1717
1718 input[input_length++] = codept;
1719 }
1720
1721 /* Encode: */
1722
1723 output_length = ace_max_length;
1724 status = punycode_encode(input_length, input, case_flags,
1725 &output_length, output);
1726 if (status == punycode_bad_input) fail(invalid_input);
1727 if (status == punycode_big_output) fail(too_big);
1728 if (status == punycode_overflow) fail(overflow);
1729 assert(status == punycode_success);
1730
1731 /* Convert to native charset and output: */
1732
1733 for (j = 0; j < output_length; ++j) {
1734 c = output[j];
1735 assert(c >= 0 && c <= 127);
1736 if (print_ascii[c] == 0) fail(invalid_input);
1737 output[j] = print_ascii[c];
1738 }
1739
1740 output[j] = 0;
1741 r = puts(output);
1742 if (r == EOF) fail(io_error);
1743 return EXIT_SUCCESS;
1744 }
1745
1746 if (argv[1][1] == 'd') {
1747 char input[ace_max_length+2], *p, *pp;
1748 punycode_uint output[unicode_max_length];
1749
1750 /* Read the Punycode input string and convert to ASCII: */
1751
1752 fgets(input, ace_max_length+2, stdin);
1753 if (ferror(stdin)) fail(io_error);
1754
1755
1756
1757 Costello Standards Track [Page 32]
1758 RFC 3492 IDNA Punycode March 2003
1759
1760
1761 if (feof(stdin)) fail(invalid_input);
1762 input_length = strlen(input) - 1;
1763 if (input[input_length] != '\n') fail(too_big);
1764 input[input_length] = 0;
1765
1766 for (p = input; *p != 0; ++p) {
1767 pp = strchr(print_ascii, *p);
1768 if (pp == 0) fail(invalid_input);
1769 *p = pp - print_ascii;
1770 }
1771
1772 /* Decode: */
1773
1774 output_length = unicode_max_length;
1775 status = punycode_decode(input_length, input, &output_length,
1776 output, case_flags);
1777 if (status == punycode_bad_input) fail(invalid_input);
1778 if (status == punycode_big_output) fail(too_big);
1779 if (status == punycode_overflow) fail(overflow);
1780 assert(status == punycode_success);
1781
1782 /* Output the result: */
1783
1784 for (j = 0; j < output_length; ++j) {
1785 r = printf("%s+%04lX\n",
1786 case_flags[j] ? "U" : "u",
1787 (unsigned long) output[j] );
1788 if (r < 0) fail(io_error);
1789 }
1790
1791 return EXIT_SUCCESS;
1792 }
1793
1794 usage(argv);
1795 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
1796 }
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812 Costello Standards Track [Page 33]
1813 RFC 3492 IDNA Punycode March 2003
1814
1815
1816 Author's Address
1817
1818 Adam M. Costello
1819 University of California, Berkeley
1820 http://www.nicemice.net/amc/
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867 Costello Standards Track [Page 34]
1868 RFC 3492 IDNA Punycode March 2003
1869
1870
1871 Full Copyright Statement
1872
1873 Copyright (C) The Internet Society (2003). All Rights Reserved.
1874
1875 This document and translations of it may be copied and furnished to
1876 others, and derivative works that comment on or otherwise explain it
1877 or assist in its implementation may be prepared, copied, published
1878 and distributed, in whole or in part, without restriction of any
1879 kind, provided that the above copyright notice and this paragraph are
1880 included on all such copies and derivative works. However, this
1881 document itself may not be modified in any way, such as by removing
1882 the copyright notice or references to the Internet Society or other
1883 Internet organizations, except as needed for the purpose of
1884 developing Internet standards in which case the procedures for
1885 copyrights defined in the Internet Standards process must be
1886 followed, or as required to translate it into languages other than
1887 English.
1888
1889 The limited permissions granted above are perpetual and will not be
1890 revoked by the Internet Society or its successors or assigns.
1891
1892 This document and the information contained herein is provided on an
1893 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1894 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1895 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1896 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1897 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1898
1899 Acknowledgement
1900
1901 Funding for the RFC Editor function is currently provided by the
1902 Internet Society.
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922 Costello Standards Track [Page 35]
1923
(I) Russian (Cyrillic): U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A u+0438 Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
(I) Russian (Cyrillic): U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A u+0438 Punycode: b1abfaaepdrnnbgefbaDdotcwatmq2g4l