1 Independent Submission P. Resnick
2 Request for Comments: 5895 Qualcomm Incorporated
3 Category: Informational P. Hoffman
4 ISSN: 2070-1721 VPN Consortium
5 September 2010
6
7
8 Mapping Characters for
9 Internationalized Domain Names in Applications (IDNA) 2008
10
11 Abstract
12
13 In the original version of the Internationalized Domain Names in
14 Applications (IDNA) protocol, any Unicode code points taken from user
15 input were mapped into a set of Unicode code points that "made
16 sense", and then encoded and passed to the domain name system (DNS).
17 The IDNA2008 protocol (described in RFCs 5890, 5891, 5892, and 5893)
18 presumes that the input to the protocol comes from a set of
19 "permitted" code points, which it then encodes and passes to the DNS,
20 but does not specify what to do with the result of user input. This
21 document describes the actions that can be taken by an implementation
22 between receiving user input and passing permitted code points to the
23 new IDNA protocol.
24
25 Status of This Memo
26
27 This document is not an Internet Standards Track specification; it is
28 published for informational purposes.
29
30 This is a contribution to the RFC Series, independently of any other
31 RFC stream. The RFC Editor has chosen to publish this document at
32 its discretion and makes no statement about its value for
33 implementation or deployment. Documents approved for publication by
34 the RFC Editor are not a candidate for any level of Internet
35 Standard; see Section 2 of RFC 5741.
36
37 Information about the current status of this document, any errata,
38 and how to provide feedback on it may be obtained at
39 http://www.rfc-editor.org/info/rfc5895.
40
41
42
43
44
45
46
47
48
49
50
51
52 Resnick & Hoffman Informational [Page 1]
53 RFC 5895 IDNA Mapping September 2010
54
55
56 Copyright Notice
57
58 Copyright (c) 2010 IETF Trust and the persons identified as the
59 document authors. All rights reserved.
60
61 This document is subject to BCP 78 and the IETF Trust's Legal
62 Provisions Relating to IETF Documents
63 (http://trustee.ietf.org/license-info) in effect on the date of
64 publication of this document. Please review these documents
65 carefully, as they describe your rights and restrictions with respect
66 to this document.
67
68 1. Introduction
69
70 This document describes the operations that can be applied to user
71 input in order to get it into a form that is acceptable by the
72 Internationalized Domain Names in Applications (IDNA) protocol
73 [IDNA2008protocol]. It includes a general implementation procedure
74 for mapping.
75
76 It should be noted that this document does not specify the behavior
77 of a protocol that appears "on the wire". It describes an operation
78 that is to be applied to user input in order to prepare that user
79 input for use in an "on the network" protocol. As unusual as this
80 may be for a document concerning Internet protocols, it is necessary
81 to describe this operation for implementors who may have designed
82 around the original IDNA protocol (herein referred to as IDNA2003),
83 which conflates this user-input operation into the protocol.
84
85 It is very important to note that there are many potential valid
86 mappings of characters from user input. The mapping described in
87 this document is the basis for other mappings, and is not likely to
88 be useful without modification. Any useful mapping will have
89 features designed to reduce the surprise for users and is likely to
90 be slightly (or sometimes radically) different depending on the
91 locale of the user, the type of input being used (such as typing,
92 copy-and-paste, voice, and so on), the type of application used, etc.
93 Although most common mappings will probably produce similar results
94 for the same input, there will be subtle differences between
95 applications.
96
97 1.1. The Dividing Line between User Interface and Protocol
98
99 The user interface to applications is much more complicated than most
100 network implementers think. When we say "the user enters an
101 internationalized domain name in the application", we are talking
102 about a very complex process that encompasses everything from the
103 user formulating the name and deciding which symbols to use to
104
105
106
107 Resnick & Hoffman Informational [Page 2]
108 RFC 5895 IDNA Mapping September 2010
109
110
111 express that name, to the user entering the symbols into the computer
112 using some input method (be it a keyboard, a stylus, or even a voice
113 recognition program), to the computer interpreting that input (be it
114 keyboard scan codes, a graphical representation, or digitized sounds)
115 into some representation of those symbols, through finally
116 normalizing those symbols into a particular character repertoire in
117 an encoding recognizable to IDNA processes and the domain name
118 system.
119
120 Considerations for a user interface for internationalized domain
121 names involves taking into account culture, context, and locale for
122 any given user. A simple and well-known example is the lowercasing
123 of the letter LATIN CAPITAL LETTER I (U+0049) when it is used in the
124 Turkish and other languages. A capital "I" in Turkish is properly
125 lowercased to a LATIN SMALL LETTER DOTLESS I (U+0131), not to a LATIN
126 SMALL LETTER I (U+0069). This lowercasing is clearly dependent on
127 the locale of the system and/or the locale of the user. Using a
128 single context-free mapping without considering the user interface
129 properties has the potential of doing exactly the wrong thing for the
130 user.
131
132 The original version of IDNA conflated user interface processing and
133 protocol. It took whatever characters the user produced in whatever
134 encoding the application used, assumed some conversion to Unicode
135 code points, and then without regard to context, locale, or anything
136 about the user's intentions, mapped them into a particular set of
137 other characters, and then re-encoded them in Punycode, in order to
138 have the entire operation be contained within the protocol. Ignoring
139 context, locale, and user preference in the IDNA protocol made life
140 significantly less complicated for the application developer, but at
141 the expense of violating the principle of "least user surprise" for
142 consumers and producers of domain names.
143
144 In IDNA2008, the dividing line between "user interface" and
145 "protocol" is clear. The IDNA2008 specification defines the protocol
146 part of IDNA: it explicitly does not deal with the user interface.
147 Mappings such as the one described in this document explicitly deal
148 with the user interface and not the protocol. That is, a mapping is
149 only to be applied before a string of characters is treated as a
150 domain name (in the "user interface") and is never to be applied
151 during domain name processing (in the "protocol").
152
153 1.2. The Design of This Mapping
154
155 The user interface mapping in this document is a set of expansions to
156 IDNA2008 that are meant to be sensible and friendly and mostly
157 obvious to people throughout the world when using typical
158 applications with domain names that are entered by hand. It is also
159
160
161
162 Resnick & Hoffman Informational [Page 3]
163 RFC 5895 IDNA Mapping September 2010
164
165
166 designed to let applications be mostly backwards compatible with
167 IDNA2003. By definition, it cannot meet all of those design goals
168 for all people, and in fact is known to fail on some of those goals
169 for quite large populations of people.
170
171 A good mapping in the real world might use the "sensible and friendly
172 and mostly obvious" design goal but come up with a different
173 algorithm. Many algorithms will have results that are close to what
174 is described here, but will differ in assumptions about the users'
175 way of thinking or typing. Having said that, it is likely that some
176 mappings will be significantly different. For example, a mapping
177 might apply to a spoken user interface instead of a typed one.
178 Another example is that a mapping might be different for users that
179 are typing than for users that are copying-and-pasting from different
180 applications. Yet another example is that a user interface that
181 allows typed input that is transliterated from Latin characters could
182 have very different mappings than one that applies to typing in other
183 character sets; this would be typical in a Pinyin input method for
184 Chinese characters.
185
186 2. The General Procedure
187
188 This section defines a general algorithm that applications ought to
189 implement in order to produce Unicode code points that will be valid
190 under the IDNA protocol. An application might implement the full
191 mapping as described below, or it can choose a different mapping.
192 This mapping is very general and was designed to be acceptable to the
193 widest user community, but as stated above, it does not take into
194 account any particular context, culture, or locale.
195
196 The general algorithm that an application (or the input method
197 provided by an operating system) ought to use is relatively
198 straightforward:
199
200 1. Uppercase characters are mapped to their lowercase equivalents by
201 using the algorithm for mapping case in Unicode characters. This
202 step was chosen because the output will behave more like ASCII
203 host names behave.
204
205 2. Fullwidth and halfwidth characters (those defined with
206 Decomposition Types <wide> and <narrow>) are mapped to their
207 decomposition mappings as shown in the Unicode character
208 database. This step was chosen because many input mechanisms,
209 particularly in Asia, do not allow you to easily enter characters
210 in the form used by IDNA2008. Even if they do allow the correct
211 character form, the user might not know which form they are
212 entering.
213
214
215
216
217 Resnick & Hoffman Informational [Page 4]
218 RFC 5895 IDNA Mapping September 2010
219
220
221 3. All characters are mapped using Unicode Normalization Form C
222 (NFC). This step was chosen because it maps combinations of
223 combining characters into canonical composed form. As with the
224 fullwidth/halfwidth mapping, users are not generally aware of the
225 particular form of characters that they are entering, and
226 IDNA2008 requires that only the canonical composed forms from NFC
227 be used.
228
229 4. [IDNA2008protocol] is specified such that the protocol acts on
230 the individual labels of the domain name. If an implementation
231 of this mapping is also performing the step of separation of the
232 parts of a domain name into labels by using the FULL STOP
233 character (U+002E), the IDEOGRAPHIC FULL STOP character (U+3002)
234 can be mapped to the FULL STOP before label separation occurs.
235 There are other characters that are used as "full stops" that one
236 could consider mapping as label separators, but their use as such
237 has not been investigated thoroughly. This step was chosen
238 because some input mechanisms do not allow the user to easily
239 enter proper label separators. Only the IDEOGRAPHIC FULL STOP
240 character (U+3002) is added in this mapping because the authors
241 have not fully investigated the applicability of other characters
242 and the environments where they should and should not be
243 considered domain name label separators.
244
245 Note that the steps above are ordered.
246
247 Definitions for the rules in this algorithm can be found in
248 [Unicode52]. Specifically:
249
250 o Unicode Normalization Form C can be found in Annex #15 of
251 [Unicode-UAX15].
252
253 o In order to map uppercase characters to their lowercase
254 equivalents (defined in Section 3.13 of [Unicode52]), first map
255 characters to the "Lowercase_Mapping" property (the "<lower>"
256 entry in the second column) in
257 <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any.
258 Then, map characters to the "Simple_Lowercase_Mapping" property
259 (the fourteenth column) in
260 <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.
261
262 o In order to map fullwidth and halfwidth characters to their
263 decomposition mappings, map any character whose
264 "Decomposition_Type" (contained in the first part of the sixth
265 column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>
266 is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of
267 that character (contained in the second part of the sixth column)
268 in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>.
269
270
271
272 Resnick & Hoffman Informational [Page 5]
273 RFC 5895 IDNA Mapping September 2010
274
275
276 o The Unicode Character Database [TR44] has useful descriptions of
277 the contents of these files.
278
279 If the mappings in this document are applied to versions of Unicode
280 later than Unicode 5.2, the later versions of the Unicode Standard
281 should be consulted.
282
283 These form a minimal set of mappings that an application should
284 strongly consider doing. Of course, there are many others that might
285 be done.
286
287 3. Implementing This Mapping
288
289 If you are implementing a mapping for an application or operating
290 system by using exactly the four steps in Section 2, the authors of
291 this document have a request: please don't. We mean it. Section 2
292 does not describe a universal mapping algorithm because, as we said,
293 there is no universally-applicable mapping algorithm.
294
295 If you read the material in Section 2 without reading Section 1, go
296 back and carefully read all of Section 1; in many ways, Section 1 is
297 more important than Section 2. Further, you can probably think of
298 user interface considerations that we did not list in Section 1. If
299 you did read Section 1 but somehow decided that the algorithm in
300 Section 2 is completely correct for the intended users of your
301 application or operating system, you are probably not thinking hard
302 enough about your intended users.
303
304 4. Security Considerations
305
306 This document suggests creating mappings that might cause confusion
307 for some users while alleviating confusion in other users. Such
308 confusion is not covered in any depth in this document (nor in the
309 other IDNA-related documents).
310
311 5. Acknowledgements
312
313 This document is the product of many contributions from numerous
314 people in the IETF.
315
316
317
318
319
320
321
322
323
324
325
326
327 Resnick & Hoffman Informational [Page 6]
328 RFC 5895 IDNA Mapping September 2010
329
330
331 6. Normative References
332
333 [IDNA2008protocol] Klensin, J., "Internationalized Domain Names in
334 Applications (IDNA): Protocol", RFC 5891,
335 August 2010.
336
337 [TR44] The Unicode Consortium, "Unicode Technical Report
338 #44: Unicode Character Database", September 2009,
339 <http://www.unicode.org/reports/tr44/
340 tr44-4.html>.
341
342 [Unicode-UAX15] The Unicode Consortium, "Unicode Standard Annex
343 #15: Unicode Normalization Forms, Revision 31",
344 September 2009, <http://www.unicode.org/reports/
345 tr15/tr15-31.html>.
346
347 [Unicode52] The Unicode Consortium. The Unicode Standard,
348 Version 5.2.0, defined by: "The Unicode Standard,
349 Version 5.2.0", (Mountain View, CA: The Unicode
350 Consortium, 2009. ISBN 978-1-936213-00-9).
351 <http://www.unicode.org/versions/Unicode5.2.0/>.
352
353 Authors' Addresses
354
355 Peter W. Resnick
356 Qualcomm Incorporated
357 5775 Morehouse Drive
358 San Diego, CA 92121-1714
359 US
360
361 Phone: +1 858 651 4478
362 EMail: presnick@qualcomm.com
363 URI: http://www.qualcomm.com/~presnick/
364
365
366 Paul Hoffman
367 VPN Consortium
368 127 Segre Place
369 Santa Cruz, CA 95060
370 US
371
372 Phone: 1-831-426-9827
373 EMail: paul.hoffman@vpnc.org
374
375
376
377
378
379
380
381
382 Resnick & Hoffman Informational [Page 7]
383
The IETF is responsible for the creation and maintenance of the DNS RFCs. The ICANN DNS RFC annotation project provides a forum for collecting community annotations on these RFCs as an aid to understanding for implementers and any interested parties. The annotations displayed here are not the result of the IETF consensus process.
This RFC is included in the DNS RFCs annotation project whose home page is here.