1 Internet Engineering Task Force (IETF) H. Alvestrand, Ed.
2 Request for Comments: 5893 Google
3 Category: Standards Track C. Karp
4 ISSN: 2070-1721 Swedish Museum of Natural History
5 August 2010
6
7
8 Right-to-Left Scripts for
9 Internationalized Domain Names for Applications (IDNA)
10
11 Abstract
12
13 The use of right-to-left scripts in Internationalized Domain Names
14 (IDNs) has presented several challenges. This memo provides a new
15 Bidi rule for Internationalized Domain Names for Applications (IDNA)
16 labels, based on the encountered problems with some scripts and some
17 shortcomings in the 2003 IDNA Bidi criterion.
18
19 Status of This Memo
20
21 This is an Internet Standards Track document.
22
23 This document is a product of the Internet Engineering Task Force
24 (IETF). It represents the consensus of the IETF community. It has
25 received public review and has been approved for publication by the
26 Internet Engineering Steering Group (IESG). Further information on
27 Internet Standards is available in Section 2 of RFC 5741.
28
29 Information about the current status of this document, any errata,
30 and how to provide feedback on it may be obtained at
31 http://www.rfc-editor.org/info/rfc5893.
32
33 Copyright Notice
34
35 Copyright (c) 2010 IETF Trust and the persons identified as the
36 document authors. All rights reserved.
37
38 This document is subject to BCP 78 and the IETF Trust's Legal
39 Provisions Relating to IETF Documents
40 (http://trustee.ietf.org/license-info) in effect on the date of
41 publication of this document. Please review these documents
42 carefully, as they describe your rights and restrictions with respect
43 to this document. Code Components extracted from this document must
44 include Simplified BSD License text as described in Section 4.e of
45 the Trust Legal Provisions and are provided without warranty as
46 described in the Simplified BSD License.
47
48
49
50
51
52 Alvestrand & Karp Standards Track [Page 1]
53 RFC 5893 IDNA Right to Left August 2010
54
55
56 Table of Contents
57
58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
59 1.1. Purpose and Applicability . . . . . . . . . . . . . . . . 2
60 1.2. Background and History . . . . . . . . . . . . . . . . . . 3
61 1.3. Structure of the Rest of This Document . . . . . . . . . . 3
62 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
63 2. The Bidi Rule . . . . . . . . . . . . . . . . . . . . . . . . 6
64 3. The Requirement Set for the Bidi Rule . . . . . . . . . . . . 6
65 4. Examples of Issues Found with RFC 3454 . . . . . . . . . . . . 9
66 4.1. Dhivehi . . . . . . . . . . . . . . . . . . . . . . . . . 9
67 4.2. Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . 10
68 4.3. Strings with Numbers . . . . . . . . . . . . . . . . . . . 12
69 5. Troublesome Situations and Guidelines . . . . . . . . . . . . 12
70 6. Other Issues in Need of Resolution . . . . . . . . . . . . . . 13
71 7. Compatibility Considerations . . . . . . . . . . . . . . . . . 14
72 7.1. Backwards Compatibility Considerations . . . . . . . . . . 14
73 7.2. Forward Compatibility Considerations . . . . . . . . . . . 15
74 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15
75 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
76 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
77 10.1. Normative References . . . . . . . . . . . . . . . . . . . 16
78 10.2. Informative References . . . . . . . . . . . . . . . . . . 17
79
80 1. Introduction
81
82 1.1. Purpose and Applicability
83
84 The purpose of this document is to establish a rule that can be
85 applied to Internationalized Domain Name (IDN) labels in Unicode form
86 (U-labels) containing characters from scripts that are written from
87 right to left. It is part of the revised IDNA protocol [RFC5891].
88
89 When labels satisfy the rule, and when certain other conditions are
90 satisfied, there is only a minimal chance of these labels being
91 displayed in a confusing way by the Unicode bidirectional display
92 algorithm.
93
94 The other normative documents in the IDNA2008 document set establish
95 criteria for valid labels, including listing the permitted
96 characters. This document establishes additional validity criteria
97 for labels in scripts normally written from right to left.
98
99 This specification is not intended to place any requirements on
100 domain names that do not contain characters from such scripts.
101
102
103
104
105
106
107 Alvestrand & Karp Standards Track [Page 2]
108 RFC 5893 IDNA Right to Left August 2010
109
110
111 1.2. Background and History
112
113 The "Stringprep" specification [RFC3454], part of IDNA2003, made the
114 following statement in its Section 6 on the Bidi algorithm:
115
116 3) If a string contains any RandALCat character, a RandALCat
117 character MUST be the first character of the string, and a
118 RandALCat character MUST be the last character of the string.
119
120 (A RandALCat character is a character with unambiguously
121 right-to-left directionality.)
122
123 The reasoning behind this prohibition was to ensure that every
124 component of a displayed domain name has an unambiguously preferred
125 direction. However, this made certain words in languages written
126 with right-to-left scripts invalid as IDN labels, and in at least one
127 case (Dhivehi) meant that all the words of an entire language were
128 forbidden as IDN labels.
129
130 This is illustrated below with examples taken from the Dhivehi and
131 Yiddish languages, as written with the Thaana and Hebrew scripts,
132 respectively.
133
134 RFC 3454 did not explicitly state the requirement to be fulfilled.
135 Therefore, it is impossible to determine whether a simple relaxation
136 of the rule would continue to fulfill the requirement.
137
138 While this document specifies rules quite different from RFC 3454,
139 most reasonable labels that were allowed under RFC 3454 will also be
140 allowed under this specification (the most important example of
141 non-permitted labels being labels that mix Arabic and European digits
142 (AN and EN) inside an RTL label, and labels that use AN in an LTR
143 label -- see Section 1.4 for terminology), so the operational impact
144 of using the new rule in the updated IDNA specification is limited.
145
146 1.3. Structure of the Rest of This Document
147
148 Section 2 defines a rule, the "Bidi rule", which can be used on a
149 domain name label to check how safe it is to use in a domain name of
150 possibly mixed directionality. The primary initial use of this rule
151 is as part of the IDNA2008 protocol [RFC5891].
152
153 Section 3 sets out the requirements for defining the Bidi rule.
154
155 Section 4 gives detailed examples that serve as justification for the
156 new rule.
157
158
159
160
161
162 Alvestrand & Karp Standards Track [Page 3]
163 RFC 5893 IDNA Right to Left August 2010
164
165
166 Section 5 to Section 8 describe various situations that can occur
167 when dealing with domain names with characters of different
168 directionality.
169
170 Only Section 1.4 and Section 2 are normative.
171
172 1.4. Terminology
173
174 The terminology used to describe IDNA concepts is defined in the
175 Definitions document [RFC5890].
176
177 The terminology used for the Bidi properties of Unicode characters is
178 taken from the Unicode Standard [Unicode52].
179
180 The Unicode Standard specifies a Bidi property for each character.
181 That property controls the character's behavior in the Unicode
182 bidirectional algorithm [Unicode-UAX9]. For reference, here are the
183 values that the Unicode Bidi property can have:
184
185 o L - Left to right - most letters in LTR scripts
186
187 o R - Right to left - most letters in non-Arabic RTL scripts
188
189 o AL - Arabic letters - most letters in the Arabic script
190
191 o EN - European Number (0-9, and Extended Arabic-Indic numbers)
192
193 o ES - European Number Separator (+ and -)
194
195 o ET - European Number Terminator (currency symbols, the hash sign,
196 the percent sign and so on)
197
198 o AN - Arabic Number; this encompasses the Arabic-Indic numbers, but
199 not the Extended Arabic-Indic numbers
200
201 o CS - Common Number Separator (. , / : et al)
202
203 o NSM - Nonspacing Mark - most combining accents
204
205 o BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)
206
207 o B - Paragraph Separator
208
209 o S - Segment Separator
210
211 o WS - Whitespace, including the SPACE character
212
213 o ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT
214
215
216
217 Alvestrand & Karp Standards Track [Page 4]
218 RFC 5893 IDNA Right to Left August 2010
219
220
221 o LRE, LRO, RLE, RLO, PDF - these are "directional control
222 characters" and are not used in IDNA labels.
223
224 In this memo, we use "network order" to describe the sequence of
225 characters as transmitted on the wire or stored in a file; the terms
226 "first", "next", "previous", "beginning", "end", "before", and
227 "after" are used to refer to the relationship of characters and
228 labels in network order.
229
230 We use "display order" to talk about the sequence of characters as
231 imaged on a display medium; the terms "left" and "right" are used to
232 refer to the relationship of characters and labels in display order.
233
234 Most of the time, the examples use the abbreviations for the Unicode
235 Bidi classes to denote the directionality of the characters; the
236 example string CS L consists of one character of class CS and one
237 character of class L. In some examples, the convention that
238 uppercase characters are of class R or AL, and lowercase characters
239 are of class L is used -- thus, the example string ABC.abc would
240 consist of three right-to-left characters and three left-to-right
241 characters.
242
243 The directionality of such examples is determined by context -- for
244 instance, in the sentence "ABC.abc is displayed as CBA.abc", the
245 first example string is in network order, the second example string
246 is in display order.
247
248 The term "paragraph" is used in the sense of the Unicode Bidi
249 specification [Unicode-UAX9]. It means "a block of text that has an
250 overall direction, either left to right or right to left",
251 approximately; see the "Unicode Bidirectional Algorithm"
252 [Unicode-UAX9] for details.
253
254 "RTL" and "LTR" are abbreviations for "right to left" and "left to
255 right", respectively.
256
257 An RTL label is a label that contains at least one character of type
258 R, AL, or AN.
259
260 An LTR label is any label that is not an RTL label.
261
262 A "Bidi domain name" is a domain name that contains at least one RTL
263 label. (Note: This definition includes domain names containing only
264 dots and right-to-left characters. Providing a separate category of
265 "RTL domain names" would not make this specification simpler, so it
266 has not been done.)
267
268
269
270
271
272 Alvestrand & Karp Standards Track [Page 5]
273 RFC 5893 IDNA Right to Left August 2010
274
275
276 2. The Bidi Rule
277
278 The following rule, consisting of six conditions, applies to labels
279 in Bidi domain names. The requirements that this rule satisfies are
280 described in Section 3. All of the conditions must be satisfied for
281 the rule to be satisfied.
282
283 1. The first character must be a character with Bidi property L, R,
284 or AL. If it has the R or AL property, it is an RTL label; if it
285 has the L property, it is an LTR label.
286
287 2. In an RTL label, only characters with the Bidi properties R, AL,
288 AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.
289
290 3. In an RTL label, the end of the label must be a character with
291 Bidi property R, AL, EN, or AN, followed by zero or more
292 characters with Bidi property NSM.
293
294 4. In an RTL label, if an EN is present, no AN may be present, and
295 vice versa.
296
297 5. In an LTR label, only characters with the Bidi properties L, EN,
298 ES, CS, ET, ON, BN, or NSM are allowed.
299
300 6. In an LTR label, the end of the label must be a character with
301 Bidi property L or EN, followed by zero or more characters with
302 Bidi property NSM.
303
304 The following guarantees can be made based on the above:
305
306 o In a domain name consisting of only labels that satisfy the rule,
307 the requirements of Section 3 are satisfied. Note that even LTR
308 labels and pure ASCII labels have to be tested.
309
310 o In a domain name consisting of only LDH labels (as defined in the
311 Definitions document [RFC5890]) and labels that satisfy the rule,
312 the requirements of Section 3 are satisfied as long as a label
313 that starts with an ASCII digit does not come after a
314 right-to-left label.
315
316 No guarantee is given for other combinations.
317
318 3. The Requirement Set for the Bidi Rule
319
320 This document, unlike RFC 3454 [RFC3454], provides an explicit
321 justification for the Bidi rule, and states a set of requirements for
322 which it is possible to test whether or not the modified rule
323 fulfills the requirement.
324
325
326
327 Alvestrand & Karp Standards Track [Page 6]
328 RFC 5893 IDNA Right to Left August 2010
329
330
331 All the text in this document assumes that text containing the labels
332 under consideration will be displayed using the Unicode bidirectional
333 algorithm [Unicode-UAX9].
334
335 The requirements proposed are these:
336
337 o Label Uniqueness: No two labels, when presented in display order
338 in the same paragraph, should have the same sequence of characters
339 without also having the same sequence of characters in network
340 order, both when the paragraph has LTR direction and when the
341 paragraph has RTL direction. (This is the criterion that is
342 explicit in RFC 3454). (Note that a label displayed in an RTL
343 paragraph may display the same as a different label displayed in
344 an LTR paragraph and still satisfy this criterion.)
345
346 o Character Grouping: When displaying a string of labels, using the
347 Unicode Bidi algorithm to reorder the characters for display, the
348 characters of each label should remain grouped between the
349 characters delimiting the labels, both when the string is embedded
350 in a paragraph with LTR direction and when it is embedded in a
351 paragraph with RTL direction.
352
353 Several stronger statements were considered and rejected, because
354 they seem to be impossible to fulfill within the constraints of the
355 Unicode bidirectional algorithm. These include:
356
357 o The appearance of a label should be unaffected by its embedding
358 context. This proved impossible even for ASCII labels; the label
359 "123-A" will have a different display order in an RTL context than
360 in an LTR context. (This particular example is, however,
361 disallowed anyway.)
362
363 o The sequence of labels should be consistent with network order.
364 This proved impossible -- a domain name consisting of the labels
365 (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in
366 an LTR context. (In an RTL context, it will be displayed as
367 L4.R3.R2.L1).
368
369 o No two domain names should be displayed the same, even under
370 differing directionality. This was shown to be unsound, since the
371 domain name (in network order) ABC.abc will have display order
372 CBA.abc in an LTR context and abc.CBA in an RTL context, while the
373 domain name (network) abc.ABC will have display order abc.CBA in
374 an LTR context and CBA.abc in an RTL context.
375
376
377
378
379
380
381
382 Alvestrand & Karp Standards Track [Page 7]
383 RFC 5893 IDNA Right to Left August 2010
384
385
386 One possible requirement was thought to be problematic, but turned
387 out to be satisfied by a string that obeys the proposed rules:
388
389 o The Character Grouping requirement should be satisfied when
390 directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
391 same paragraph (outside of the labels). Because these controls
392 affect presentation order in non-obvious ways, by affecting the
393 "sor" and "eor" properties of the Unicode Bidi algorithm, the
394 conditions above require extra testing in order to figure out
395 whether or not they influence the display of the domain name.
396 Testing found that for the strings allowed under the rule
397 presented in this document, directional controls do not influence
398 the display of the domain name.
399
400 This is still not stated as a requirement, since it did not seem as
401 important as the stated requirements, but it is useful to know that
402 Bidi domain names where the labels satisfy the rule have this
403 property.
404
405 In the following descriptions, first-level bullets are used to
406 indicate rules or normative statements; second-level bullets are
407 commentary.
408
409 The Character Grouping requirement can be more formally stated as:
410
411 o Let "Delimiterchars" be a set of characters with the Unicode Bidi
412 properties CS, WS, ON. (These are commonly used to delimit labels
413 -- both the FULL STOP and the space are included. They are not
414 allowed in domain labels.)
415
416 * ET, though it commonly occurs next to domain names in practice,
417 is problematic: the context R CS L EN ET (for instance A.a1%)
418 makes the label L EN not satisfy the character grouping
419 requirement.
420
421 * ES commonly occurs in labels as HYPHEN-MINUS, but could also be
422 used as a delimiter (for instance, the plus sign). It is left
423 out here.
424
425 o Let "unproblematic label" be a label that either satisfies the
426 requirements or does not contain any character with the Bidi
427 properties R, AL, or AN and does not begin with a character with
428 the Bidi property EN. (Informally, "it does not start with a
429 number".)
430
431
432
433
434
435
436
437 Alvestrand & Karp Standards Track [Page 8]
438 RFC 5893 IDNA Right to Left August 2010
439
440
441 A label X satisfies the Character Grouping requirement when, for any
442 Delimiter Character D1 and D2, and for any label S1 and S2 that is an
443 unproblematic label or an empty string, the following holds true:
444
445 If the string formed by concatenating S1, D1, X, D2, and S2 is
446 reordered according to the Bidi algorithm, then all the characters of
447 X in the reordered string are between D1 and D2, and no other
448 characters are between D1 and D2, both if the overall paragraph
449 direction is LTR and if the overall paragraph direction is RTL.
450
451 Note that the definition is self-referential, since S1 and S2 are
452 constrained to be "legal" by this definition. This makes testing
453 changes to proposed rules a little complex, but does not create
454 problems for testing whether or not a given proposed rule satisfies
455 the criterion.
456
457 The "zero-length" case represents the case where a domain name is
458 next to something that isn't a domain name, separated by a delimiter
459 character.
460
461 Note about the position of BN: The Unicode bidirectional algorithm
462 specifies that a BN has an effect on the adjoining characters in
463 network order, not in display order, and are therefore treated as if
464 removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule
465 X9 and Section 5.3). Therefore, the question of "what position does
466 a BN have after reordering" is not meaningful. It has been ignored
467 while developing the rules here.
468
469 The Label Uniqueness requirement can be formally stated as:
470
471 If two non-identical labels X and Y, embedded as for the test above,
472 displayed in paragraphs with the same directionality, are reordered
473 by the Bidi algorithm into the same sequence of code points, the
474 labels X and Y cannot both be legal.
475
476 4. Examples of Issues Found with RFC 3454
477
478 4.1. Dhivehi
479
480 Dhivehi, the official language of the Maldives, is written with the
481 Thaana script. This script displays some of the characteristics of
482 the Arabic script, including its directional properties, and the
483 indication of vowels by the diacritical marking of consonantal base
484 characters. This marking is obligatory, and both two consecutive
485 vowels and syllable-final consonants are indicated with unvoiced
486 combining marks. Every Dhivehi word therefore ends with a combining
487 mark.
488
489
490
491
492 Alvestrand & Karp Standards Track [Page 9]
493 RFC 5893 IDNA Right to Left August 2010
494
495
496 The word for "computer", which is romanized as "konpeetaru", is
497 written with the following sequence of Unicode code points:
498
499 U+0786 THAANA LETTER KAAFU (AL)
500
501 U+07AE THAANA OBOFILI (NSM)
502
503 U+0782 THAANA LETTER NOONU (AL)
504
505 U+07B0 THAANA SUKUN (NSM)
506
507 U+0795 THAANA LETTER PAVIYANI (AL)
508
509 U+07A9 THAANA LETTER EEBEEFILI (AL)
510
511 U+0793 THAANA LETTER TAVIYANI (AL)
512
513 U+07A6 THAANA ABAFILI (NSM)
514
515 U+0783 THAANA LETTER RAA (AL)
516
517 U+07AA THAANA UBUFILI (NSM)
518
519 The directionality class of U+07AA in the Unicode database
520 [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a
521 conformant implementation of the IDNA2003 algorithm will say that
522 "this is not in RandALCat" and refuse to encode the string.
523
524 4.2. Yiddish
525
526 Yiddish is one of several languages written with the Hebrew script
527 (others include Hebrew and Ladino). This is basically a consonantal
528 alphabet (also termed an "abjad"), but Yiddish is written using an
529 extended form that is fully vocalic. The vowels are indicated in
530 several ways, one of which is by repurposing letters that are
531 consonants in Hebrew. Other letters are used both as vowels and
532 consonants, with combining marks, called "points", used to
533 differentiate between them. Finally, some base characters can
534 indicate several different vowels, which are also disambiguated by
535 combining marks. Pointed characters can appear in word-final
536 position and may therefore also be needed at the end of labels. This
537 is not an invariable attribute of a Yiddish string and there is thus
538 greater latitude here than there is with Dhivehi.
539
540 The organization now known as the "YIVO Institute for Jewish
541 Research" developed orthographic rules for modern Standard Yiddish
542 during the 1930s on the basis of work conducted in several venues
543 since earlier in that century. These are given in, "The Standardized
544
545
546
547 Alvestrand & Karp Standards Track [Page 10]
548 RFC 5893 IDNA Right to Left August 2010
549
550
551 Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken
552 as normatively descriptive of modern Standard Yiddish in any context
553 where that notion is deemed relevant. They have been applied
554 exclusively in all formal Yiddish dictionaries published since their
555 establishment, and are similarly dominant in academic and
556 bibliographic regards.
557
558 It therefore appears appropriate for this repertoire also to be
559 supported fully by IDNA. This presents no difficulty with characters
560 in initial and medial positions, but pointed characters are regularly
561 used in final position as well. All of the characters in the SYO
562 repertoire appear in both marked and unmarked form with one
563 exception: the HEBREW LETTER PE (U+05E4). The SYO only permits this
564 with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
565 to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
566 to the Latin letter "f". There is, however, a separate unpointed
567 allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
568 character when it appears in final position. The constraint on the
569 use of the SYO repertoire resulting from the proscription of
570 combining marks at the end of RTL strings thus reduces to nothing
571 more, or less, than the equivalent of saying that a string of Latin
572 characters cannot end with the letter "p". It must also be noted
573 that the HEBREW LETTER PE with the HEBREW POINT DAGESH is
574 characteristic of almost all traditional Yiddish orthographies that
575 predate (or remain in use in parallel to) the SYO, being the first
576 pointed character to appear in any of them.
577
578 A more general instantiation of the basic problem can be seen in the
579 representation of the YIVO acronym. This acronym is written with the
580 Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and
581 QAMATS are combining points. The Unicode code points are:
582
583 U+05D9 HEBREW LETTER YOD (R)
584
585 U+05B4 HEBREW POINT HIRIQ (NSM)
586
587 U+05D5 HEBREW LETTER VAV (R)
588
589 U+05D0 HEBREW LETTER ALEF (R)
590
591 U+05B8 HEBREW POINT QAMATS (NSM)
592
593 The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
594 database is NSM, which again causes the IDNA2003 algorithm to reject
595 the string.
596
597
598
599
600
601
602 Alvestrand & Karp Standards Track [Page 11]
603 RFC 5893 IDNA Right to Left August 2010
604
605
606 It may also be noted that all of the combined characters mentioned
607 above exist in precomposed form at separate positions in the Unicode
608 chart. However, by invoking Stringprep, the IDNA2003 algorithm also
609 rejects those code points, for reasons not discussed here.
610
611 4.3. Strings with Numbers
612
613 By requiring that the first or last character of a string be a member
614 of category R or AL, the Stringprep specification [RFC3454]
615 prohibited a string containing right-to-left characters from ending
616 with a number.
617
618 Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
619 ALEF. Displayed in an LTR context, the first one will be displayed
620 from left to right as 5 ALEF (with the 5 being considered right to
621 left because of the leading ALEF), while 5 ALEF will be displayed in
622 exactly the same order (5 taking the direction from context).
623 Clearly, only one of those should be permitted as a registered label,
624 but barring them both seems unnecessary.
625
626 5. Troublesome Situations and Guidelines
627
628 There are situations in which labels that satisfy the rule above will
629 be displayed in a surprising fashion. The most important of these is
630 the case where a label ending in a character with Bidi property AL,
631 AN, or R occurs before a label beginning with a character of Bidi
632 property EN. In that case, the number will appear to move into the
633 label containing the right-to-left character, violating the Character
634 Grouping requirement.
635
636 If the label that occurs after the right-to-left label itself
637 satisfies the Bidi criterion, the requirements will be satisfied in
638 all cases (this is the reason why the criterion talks about strings
639 containing L in some cases). However, the IDNABIS WG concluded that
640 this could not be required for several reasons:
641
642 o There is a large current deployment of ASCII domain names starting
643 with digits. These cannot possibly be invalidated.
644
645 o Domain names are often constructed piecemeal, for instance, by
646 combining a string with the content of a search list. This may
647 occur after IDNA processing, and thus in part of the code that is
648 not IDNA-aware, making detection of the undesirable combination
649 impossible.
650
651
652
653
654
655
656
657 Alvestrand & Karp Standards Track [Page 12]
658 RFC 5893 IDNA Right to Left August 2010
659
660
661 o Even if a label is registered under a "safe" label, there may be a
662 DNAME [RFC2672] with an "unsafe" label that points to the "safe"
663 label, thus creating seemingly valid names that would not satisfy
664 the criterion.
665
666 o Wildcards create the odd situation where a label is "valid" (can
667 be looked up successfully) without the zone owner knowing that
668 this label exists. So an owner of a zone whose name starts with a
669 digit and contains a wildcard has no way of controlling whether or
670 not names with RTL labels in them are looked up in his zone.
671
672 Rather than trying to suggest rules that disallow all such
673 undesirable situations, this document merely warns about the
674 possibility, and leaves it to application developers to take whatever
675 measures they deem appropriate to avoid problematic situations.
676
677 6. Other Issues in Need of Resolution
678
679 This document concerns itself only with the rules that are needed
680 when dealing with domain names with characters that have differing
681 Bidi properties, and considers characters only in terms of their Bidi
682 properties. All other issues with scripts that are written from
683 right to left must be considered in other contexts.
684
685 One such issue is the need to keep numbers separate. Several scripts
686 are used with multiple sets of numbers -- most commonly they use
687 Latin numbers and a script-specific set of numbers, but in the case
688 of Arabic, there are two sets of "Arabic-Indic" digits involved.
689
690 The algorithm in this document disallows occurrences of AN-class
691 characters ("Arabic-Indic digits", U+0660 to U+0669) together with
692 EN-class characters (which includes "European" digits, U+0030 to
693 U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but
694 does not help in preventing the mixing of, for instance, Bengali
695 digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),
696 both of which have Bidi class L. A registry or script community that
697 wishes to create rules restricting the mixing of digits in a label
698 will be able to specify these restrictions at the registry level.
699 Some rules are also specified at the protocol level.
700
701 Another set of issues concerns the proper display of IDNs with a
702 mixture of LTR and RTL labels, or only RTL labels.
703
704 It is unrealistic to expect that applications will display domain
705 names using embedded formatting codes between their labels (for one
706 thing, no reliable algorithms for identifying domain names in running
707 text exist); thus, the display order will be determined by the Bidi
708 algorithm. Thus, a sequence (in network order) of R1.R2.ltr will be
709
710
711
712 Alvestrand & Karp Standards Track [Page 13]
713 RFC 5893 IDNA Right to Left August 2010
714
715
716 displayed in the order 2R.1R.ltr in an LTR context, which might
717 surprise someone expecting to see labels displayed in hierarchical
718 order. People used to working with text that mixes LTR and RTL
719 strings might not be so surprised by this. Again, this memo does not
720 attempt to suggest a solution to this problem.
721
722 7. Compatibility Considerations
723
724 7.1. Backwards Compatibility Considerations
725
726 As with any change to an existing standard, it is important to
727 consider what happens with existing implementations when the change
728 is introduced. Some troublesome cases include:
729
730 o An old program used to input the newly allowed label. If the old
731 program checks the input against RFC 3454, some labels will not be
732 allowed, and domain names containing those labels will remain
733 inaccessible.
734
735 o An old program is asked to display the newly allowed label, and
736 checks it against RFC 3454 before displaying. The program will
737 perform some kind of fallback, most likely displaying the label in
738 A-label form.
739
740 o An old program tries to display the newly allowed label. If the
741 old program has code for displaying the last character of a label
742 that is different from the code used to display the characters in
743 the middle of the label, the display may be inconsistent and cause
744 confusion.
745
746 One particular example of the last case is if a program chooses to
747 examine the last character (in network order) of a string in order to
748 determine its directionality, rather than its first. If it finds an
749 NSM character and tries to display the string as if it was a
750 left-to-right string, the resulting display may be interesting, but
751 not useful.
752
753 The editors believe that these cases will have a less harmful impact
754 in practice than continuing to deny the use of words from the
755 languages for which these strings are necessary as IDN labels.
756
757 This specification does not forbid using leading European digits in
758 ASCII-only labels, since this would conflict with a large installed
759 base of such labels, and would increase the scope of the
760 specification from RTL labels to all labels. The harm resulting from
761 this limitation of scope is described in Section 5. Registries and
762 private zone managers can check for this particular condition before
763 they allow registration of any RTL label. Generally, it is best to
764
765
766
767 Alvestrand & Karp Standards Track [Page 14]
768 RFC 5893 IDNA Right to Left August 2010
769
770
771 disallow registration of any right-to-left strings in a zone where
772 the label at the level above begins with a digit.
773
774 7.2. Forward Compatibility Considerations
775
776 This text is intentionally specified strictly in terms of the Unicode
777 Bidi properties. The determination that the condition is sufficient
778 to fulfill the criteria depends on the Unicode Bidi algorithm; it is
779 unlikely that drastic changes will be made to this algorithm.
780
781 However, the determination of validity for any string depends on the
782 Unicode Bidi property values, which are not declared immutable by the
783 Unicode Consortium. Furthermore, the behavior of the algorithm for
784 any given character is likely to be linguistically and culturally
785 sensitive, so while it should occur rarely, it is possible that later
786 versions of the Unicode Standard may change the Bidi properties
787 assigned to certain Unicode characters.
788
789 This memo does not propose a solution for this problem.
790
791 8. Security Considerations
792
793 The display behavior of mixed-direction text can be extremely
794 surprising to users who are not used to it; for instance, cut and
795 paste of a piece of text can cause the text to display differently at
796 the destination, if the destination is in another directionality
797 context, and adding a character in one place of a text can cause
798 characters some distance from the point of insertion to change their
799 display position. This is, however, not a phenomenon unique to the
800 display of domain names.
801
802 The new IDNA protocol, and particularly these new Bidi rules, will
803 allow some strings to be used in IDNA contexts that are not allowed
804 today. It is possible that differences in the interpretation of
805 labels between implementations of IDNA2003 and IDNA2008 could pose a
806 security risk, but it is difficult to envision any specific
807 instantiation of this.
808
809 Any rational attempt to compute, for instance, a hash over an
810 identifier processed by IDNA would use network order for its
811 computation, and thus be unaffected by the new rules proposed here.
812
813 While it is not believed to pose a problem, if display routines had
814 been written with specific knowledge of the RFC 3454 IDNA
815 prohibitions, it is possible that the potential problems noted under
816 "Backwards Compatibility Considerations" could cause new kinds of
817 confusion.
818
819
820
821
822 Alvestrand & Karp Standards Track [Page 15]
823 RFC 5893 IDNA Right to Left August 2010
824
825
826 9. Acknowledgements
827
828 While the listed editors held the pen, this document represents the
829 joint work and conclusions of an ad hoc design team. In addition to
830 the editors, this consisted of, in alphabetic order, Tina Dam, Patrik
831 Faltstrom, and John Klensin. Many further specific contributions and
832 helpful comments were received from the people listed below, and
833 others who have contributed to the development and use of the IDNA
834 protocols.
835
836 The particular formulation of the Bidi rule in Section 2 was
837 suggested by Matitiahu Allouche.
838
839 The team wishes, in particular, to thank Roozbeh Pournader for
840 calling its attention to the issue with the Thaana script, Paul
841 Hoffman for pointing out the need to be explicit about backwards
842 compatibility considerations, Ken Whistler for suggesting the basis
843 of the formalized "Character Grouping" requirement, Mark Davis for
844 commentary, Erik van der Poel for careful review, comments, and
845 verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete
846 Resnick for reviews, and Vint Cerf for chairing the working group and
847 contributing massively to getting the documents finished.
848
849 10. References
850
851 10.1. Normative References
852
853 [RFC5890] Klensin, J., "Internationalized Domain Names for
854 Applications (IDNA): Definitions and Document
855 Framework", RFC 5890, August 2010.
856
857 [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:
858 Unicode Bidirectional Algorithm", September 2009,
859 <http://www.unicode.org/reports/tr9/>.
860
861 [Unicode52] The Unicode Consortium. The Unicode Standard, Version
862 5.2.0, defined by: "The Unicode Standard, Version
863 5.2.0", (Mountain View, CA: The Unicode Consortium,
864 2009. ISBN 978-1-936213-00-9).
865 <http://www.unicode.org/versions/Unicode5.2.0/>.
866
867
868
869
870
871
872
873
874
875
876
877 Alvestrand & Karp Standards Track [Page 16]
878 RFC 5893 IDNA Right to Left August 2010
879
880
881 10.2. Informative References
882
883 [RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection",
884 RFC 2672, August 1999.
885
886 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
887 Internationalized Strings ("stringprep")", RFC 3454,
888 December 2002.
889
890 [RFC5891] Klensin, J., "Internationalized Domain Names in
891 Applications (IDNA): Protocol", RFC 5891, August 2010.
892
893 [SYO] "The Standardized Yiddish Orthography: Rules of
894 Yiddish Spelling, 6th ed., New York, ISBN
895 0-914512-25-0", 1999.
896
897 Authors' Addresses
898
899 Harald Tveit Alvestrand (editor)
900 Google
901 Beddingen 10
902 Trondheim, 7014
903 Norway
904
905 EMail: harald@alvestrand.no
906
907
908 Cary Karp
909 Swedish Museum of Natural History
910 Frescativ. 40
911 Stockholm, 10405
912 Sweden
913
914 Phone: +46 8 5195 4055
915 Fax:
916 EMail: ck@nic.museum
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932 Alvestrand & Karp Standards Track [Page 17]
933
The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.
This RFC is included in the DNS RFCs annotation project whose home page is here.