7 Internet-Draft Editor: Kurt D. Zeilenga
8 Intended Category: Standard Track OpenLDAP Foundation
9 Expires in six months 4 May 2003
13 LDAP: Internationalized String Preparation
14 <draft-zeilenga-ldapbis-strprep-00.txt>
19 This document is an Internet-Draft and is in full conformance with all
20 provisions of Section 10 of RFC2026.
22 Distribution of this memo is unlimited. Technical discussion of this
23 document will take place on the IETF LDAP Revision Working Group
24 mailing list <ietf-ldapbis@openldap.org>. Please send editorial
25 comments directly to the author <Kurt@OpenLDAP.org>.
27 Internet-Drafts are working documents of the Internet Engineering Task
28 Force (IETF), its areas, and its working groups. Note that other
29 groups may also distribute working documents as Internet-Drafts.
30 Internet-Drafts are draft documents valid for a maximum of six months
31 and may be updated, replaced, or obsoleted by other documents at any
32 time. It is inappropriate to use Internet-Drafts as reference
33 material or to cite them other than as ``work in progress.''
35 The list of current Internet-Drafts can be accessed at
36 <http://www.ietf.org/ietf/1id-abstracts.txt>. The list of
37 Internet-Draft Shadow Directories can be accessed at
38 <http://www.ietf.org/shadow.html>.
40 Copyright 2003, The Internet Society. All Rights Reserved.
42 Please see the Copyright section near the end of this document for
48 The previous Lightweight Directory Access Protocol (LDAP) technical
49 specifications did not precisely define how string matching is to be
50 performed. This lead to a number of usability and interoperability
51 problems. This document defines string preparation algorithms for
52 matching rules defined for use in LDAP.
58 Zeilenga LDAPprep [Page 1]
60 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
65 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
66 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
67 document are to be interpreted as described in BCP 14 [RFC2119].
69 Character names in this document use the notation for code points and
70 names from the Unicode Standard [UNICODE] and ISO/IEC 10646-1
71 [ISO10646]. For example, the letter "a" may be represented as either
72 <U+0061> or <LATIN SMALL LETTER A>. In the lists of mappings and the
73 prohibited characters, the "U+" is left off to make the lists easier
74 to read. The comments for character ranges are shown in square
75 brackets (such as "[CONTROL CHARACTERS]") and do not come from the
78 Note: a glossary of terms used in Unicode and ISO/IEC 10646 can be
79 found in [GLOSSARY]. Information on the ISO/IEC 10646/Unicode
80 character encoding model can be found in [UTR17].
87 An LDAP matching rule [Syntaxes] defines an algorithm for determining
88 whether a presented value matches an attribute value in accordance
89 with the criteria defined for the rule. The proposition may be
90 evaluated to True, False, or Undefined.
92 True - the attribute contains a matching value,
94 False - the attribute contains no matching value,
96 Undefined - it cannot be determined whether the attribute contains
97 a matching value or not.
99 For instance, the caseIgnoreMatch matching rule may be used to compare
100 whether the commonName attribute contains a particular value without
101 regard for case and insignificant spaces.
104 1.2. X.500 String Matching Rules
106 "X.520: Selected attribute types" [X.520] provides (amongst other
107 things) value syntaxes and matching rules for comparing values
108 commonly used in the Directory. These specifications are inadequate
109 for strings composed of characters from the Universal Character Set
110 (UCS) [ISO10646], a superset of Unicode [UNICODE].
114 Zeilenga LDAPprep [Page 2]
116 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
119 The CaseIgnoreMatch matching rule [X.520], for example, is simply
120 defined as being a case insensitive comparison where insignificant
121 spaces are ignored. For printableString, there is only one space
122 character and case mapping is bijective, hence this definition is
123 sufficient. However, for UCS-based string types such as
124 universalString, this is not sufficient. For example, a case
125 insensitive matching implementation which folded lower case characters
126 to upper case would yield different different results than an
127 implementation which used upper case to lower case folding. Or one
128 implementation may view space as referring to only SPACE (U+0020), a
129 second implementation may view any character with the space separator
130 (Zs) property as a space, and another implementation may view any
131 character with the whitespace (WS) category as a space.
133 The lack of precise specification for string matching has led to
134 significant interoperability problems. When used in certificate chain
135 validation, security vulnerabilities can arise. To address these
136 problems, this document defines precise algorithms for preparing
137 strings for matching.
140 1.3. Relationship to "stringprep"
142 The string preparation algorithms described in this document are based
143 upon the "stringprep" approach [RFC3454]. In "stringprep", presented
144 and stored values are first prepared for comparison and so that a
145 character-by-character comparison yields the "correct" result.
147 The approach used here is a refinement of the "stringprep" [RFC3454]
148 approach. Each algorithm involves two additional preparation steps.
150 a) prior to applying the Unicode string preparation steps outlined in
151 "stringprep", the string is transcoded to Unicode;
153 b) after applying the Unicode string preparation steps outlined in
154 "stringprep", characters insignificant to the matching rules are
157 Hence, preparation of strings for X.500 matching involves the
164 5) Check Bidi (Bidirectional)
165 6) Insignificant Character Removal
170 Zeilenga LDAPprep [Page 3]
172 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
175 These steps are described in Section 2.
178 1.4. Relationship to the LDAP Technical Specification
180 This document is a integral part of the LDAP technical specification
181 [Roadmap] which obsoletes the previously defined LDAP technical
182 specification [RFC3377] in its entirety.
184 This document details LDAP internationalized string preparation
185 algorithms used by [Syntaxes] and possible other technical
186 specifications defining LDAP syntaxes and/or matching rules.
189 1.5. Relationship to X.500
191 LDAP is defined [Roadmap] in X.500 terms as an X.500 access mechanism.
192 As such, there is a strong desire for alignment between LDAP and X.500
193 syntax and semantics. The string preparation algorithms described in
194 this document are based upon "Internationalized String Matching Rules
195 for X.500" [XMATCH] proposal to ITU/ISO Joint Study Group 2.
198 2. String Preparation
200 The following six-step process SHALL be applied to each presented and
201 attribute value in preparation for string match rule evaluation.
208 6) Insignificant Character Removal
210 Failure in any step is be cause the assertion to be Undefined.
212 The character repertoire of this process is Unicode 3.2 [UNICODE].
217 Each non-Unicode string value is transcoded to Unicode.
219 TeletexString values are transcoded to Unicode as described in
222 PrintableString value are transcoded directly to Unicode.
226 Zeilenga LDAPprep [Page 4]
228 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
231 UniversalString, UTF8String, and bmpString values need not be
232 transcoded as they are Unicode-based strings (in the case of
233 bmpString, restricted to a subset of Unicode).
235 If the implementation is unable or unwilling to perform the
236 transcoding as described above, or the transcoding fails, this step
237 fails and the assertion is evaluated to Undefined.
239 The transcoded string is the output string.
244 SOFT HYPHEN (U+00AD) and MONGOLIAN TODO SOFT HYPHEN (U+1806) code
245 points are mapped to nothing. COMBINING GRAPHEME JOINER (U+034F) and
246 VARIATION SELECTORs (U+180B-180D,FF00-FE0F) code points are also
247 mapped to nothing. The OBJECT REPLACEMENT CHARACTER (U+FFFC) is
250 CHARACTER TABULATION (U+0009), LINE FEED (LF) (U+000A), LINE
251 TABULATION (U+000B), FORM FEED (FF) (U+000C), CARRIAGE RETURN (CR)
252 (U+000D), and NEXT LINE (NEL) (U+0085) are mapped to SPACE (U+0020).
254 All other control code points (e.g., Cc) or code points with a control
255 function (e.g., Cf) are mapped to nothing.
257 ZERO WIDTH SPACE (U+200B) is mapped to nothing. All other code points
258 with Separator (space, line, or paragraph) property (e.g, Zs, Zl, or
259 Zp) are mapped to SPACE (U+0020).
261 For case ignore, numeric, and stored prefix string matching rules,
262 characters are case folded per B.2 of [RFC3454].
267 The input string is be normalized to Unicode Form KC (compatibility
268 composed) as described in [UAX15].
273 All Unassigned, Private Use, and non-character code points are
274 prohibited. Surrogate codes (U+D800-DFFFF) are prohibited.
276 The REPLACEMENT CHARACTER (U+FFFD) code point is prohibited.
278 The first code point of a string is prohibited from being a combining
282 Zeilenga LDAPprep [Page 5]
284 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
289 Empty strings are prohibited.
291 The step fails and the assertion is evaluated to Undefined if the
292 input string contains any prohibited code point. The output string is
298 There are no bidirectional restrictions. The output string is the
302 2.5. Insignificant Character Removal
304 In this step, characters insignificant to the matching rule are to be
305 removed. The characters to be removed differ from matching rule to
308 Section 2.6.1 applies to case ignore and exact string matching.
309 Section 2.6.2 applies to numericString matching.
310 Section 2.6.3 applies to telephoneNumber matching
313 2.6.1. Insignificant Space Removal
315 For the purposes of this section, a space is defined to be the SPACE
316 (U+0020) code point followed by no combining marks.
318 NOTE - The previous steps ensure that the string cannot contain
319 any code points in the separator class, other than SPACE
322 The following spaces are regarded as not significant and are to be
324 - leading spaces (i.e. those preceding the first character that is
326 - trailing spaces (i.e. those following the last character that is
328 - multiple consecutive spaces (these are taken as equivalent to a
329 single space character).
331 (A string consisting entirely of spaces is equivalent to a string
332 containing exactly one space.)
334 For example, removal of spaces from the Form KC string:
338 Zeilenga LDAPprep [Page 6]
340 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
343 "<SPACE><SPACE>foo<SPACE><SPACE>bar<SPACE><SPACE>" would result in
345 "<SPACE>foo<SPACE>bar<SPACE>".
347 and the Form KC string:
348 "<SPACE><SPACE><SPACE>" would result in the output string:
352 2.6.2. NumericString Insignificant Character Removal
354 For the purposes of this section, a space is defined to be the SPACE
355 (U+0020) code point followed by no combining marks.
357 All spaces are regarded as not significant and are to be removed.
359 For example, removal of spaces from the Form KC string:
360 "<SPACE><SPACE>123<SPACE><SPACE>456<SPACE><SPACE>" would result in
364 and the Form KC string:
365 "<SPACE><SPACE><SPACE>" would result in an empty output string.
368 2.6.3. TelephoneNumber Insignificant Character Removal
370 For the purposes of this section, a hyphen is defined to be
371 HYPHEN-MINUS (U+002D), ARMENIAN HYPHEN (U+058A), HYPHEN (U+2010),
372 NON-BREAKING HYPHEN (U+2011), MINUS SIGN (U+2212), SMALL HYPHEN-MINUS
373 (U+FE63), or FULLWIDTH HYPHEN-MINUS (U+FF0D) code point followed by no
374 combining marks and a space is defined to be the SPACE (U+0020) code
375 point followed by no combining marks.
377 All hyphens and spaces are regarded as not significant and are to be
381 3. Security Considerations
383 "Preparation for International Strings ('stringprep')" [RFC3454]
384 security considerations generally apply to the algorithms described
390 The approach used in this document is based upon design principles and
394 Zeilenga LDAPprep [Page 7]
396 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
399 algorithms described in "Preparation of Internationalized Strings
400 ('stringprep')" [RFC3454] by Paul Hoffman and Marc Blanchet. Some
401 additional guidance was drawn from Unicode Technical Standards,
402 Technical Reports, and Notes.
408 E-mail: <kurt@openldap.org>
413 6.1. Normative References
415 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate
416 Requirement Levels", BCP 14 (also RFC 2119), March 1997.
418 [RFC3454] P. Hoffman, M. Blanchet, "Preparation of Internationalized
419 Strings ('stringprep')", RFC 3454, December 2002.
421 [Roadmap] K. Zeilenga, "LDAP: Technical Specification Road Map",
422 draft-ietf-ldapbis-roadmap-xx.txt, a work in progress.
424 [Syntaxes] S. Legg (editor), "LDAP: Syntaxes and Matching Rules",
425 draft-ietf-ldapbis-syntaxes-xx.txt, a work in progress.
427 [ISO10646] Universal Multiple-Octet Coded Character Set (UCS) -
428 Architecture and Basic Multilingual Plane, ISO/IEC 10646-1
431 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version
432 3.2.0" is defined by "The Unicode Standard, Version 3.0"
433 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as
434 amended by the "Unicode Standard Annex #27: Unicode 3.1"
435 (http://www.unicode.org/reports/tr27/) and by the "Unicode
436 Standard Annex #28: Unicode 3.2"
437 (http://www.unicode.org/reports/tr28/).
439 [UAX15] M. Davis, M. Duerst, "Unicode Standard Annex #15: Unicode
440 Normalization Forms, Version 3.2.0".
441 <http://www.unicode.org/unicode/reports/tr15/tr15-22.html>,
445 6.2. Informative References
450 Zeilenga LDAPprep [Page 8]
452 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
455 [X.500] International Telephone Union, "The Directory: Overview of
456 Concepts, Models and Service", X.500, 2000.
458 [X.501] International Telephone Union, "The Directory: The Models",
461 [X.520] International Telephone Union, "The Directory: Selected
462 Attribute Types", X.520, 2000.
464 [XMATCH] K. Zeilenga, "Internationalized String Matching
465 Rules for X.500", draft-zeilenga-ldapbis-strmatch-xx.txt a
468 [GLOSSARY] The Unicode Consortium, "Unicode Glossary",
469 <http://www.unicode.org/glossary/>.
471 [UTR17] K. Whistler, M. Davis, "Unicode Technical Report
472 #17, Character Encoding Model", UTR17,
473 <http://www.unicode.org/unicode/reports/tr17/>, August
478 Copyright 2003, The Internet Society. All Rights Reserved.
480 This document and translations of it may be copied and furnished to
481 others, and derivative works that comment on or otherwise explain it
482 or assist in its implementation may be prepared, copied, published and
483 distributed, in whole or in part, without restriction of any kind,
484 provided that the above copyright notice and this paragraph are
485 included on all such copies and derivative works. However, this
486 document itself may not be modified in any way, such as by removing
487 the copyright notice or references to the Internet Society or other
488 Internet organizations, except as needed for the purpose of
489 developing Internet standards in which case the procedures for
490 copyrights defined in the Internet Standards process must be followed,
491 or as required to translate it into languages other than English.
493 The limited permissions granted above are perpetual and will not be
494 revoked by the Internet Society or its successors or assigns.
496 This document and the information contained herein is provided on an
497 "AS IS" basis and THE AUTHORS, THE INTERNET SOCIETY, AND THE INTERNET
498 ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED,
499 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
500 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
501 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
506 Zeilenga LDAPprep [Page 9]
508 Internet-Draft draft-ietf-ldapbis-strprep-00 4 May 2003
511 Appendix A. Teletex (T.61) to Unicode
562 Zeilenga LDAPprep [Page 10]