gomp-20050608-branch/libjava/classpath/doc/unicode/UnicodeData-3.0.0.html

   1 <html>
   2
   3
   4
   5 <head>
   6
   7 <meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 4.0">
   8
   9 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
  10
  11 <link REL="stylesheet" HREF="http://www.unicode.org/unicode.css" TYPE="text/css">
  12
  13 <title>UnicodeData File Format</title>
  14
  15 </head>
  16
  17
  18
  19 <body>
  20
  21
  22
  23 <h1>UnicodeData File Format<br>
  24 Version 3.0.0</h1>
  25
  26
  27
  28 <table BORDER="1" CELLSPACING="2" CELLPADDING="0" HEIGHT="87" WIDTH="100%">
  29
  30   <tr>
  31
  32     <td VALIGN="TOP" width="144">Revision</td>
  33
  34     <td VALIGN="TOP">3.0.0</td>
  35
  36   </tr>
  37
  38   <tr>
  39
  40     <td VALIGN="TOP" width="144">Authors</td>
  41
  42     <td VALIGN="TOP">Mark Davis and Ken Whistler</td>
  43
  44   </tr>
  45
  46   <tr>
  47
  48     <td VALIGN="TOP" width="144">Date</td>
  49
  50     <td VALIGN="TOP">1999-09-12</td>
  51
  52   </tr>
  53
  54   <tr>
  55
  56     <td VALIGN="TOP" width="144">This Version</td>
  57
  58     <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
  59
  60   </tr>
  61
  62   <tr>
  63
  64     <td VALIGN="TOP" width="144">Previous Version</td>
  65
  66     <td VALIGN="TOP">n/a</td>
  67
  68   </tr>
  69
  70   <tr>
  71
  72     <td VALIGN="TOP" width="144">Latest Version</td>
  73
  74     <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
  75
  76   </tr>
  77
  78 </table>
  79
  80
  81
  82 <p align="center">Copyright © 1995-1999 Unicode, Inc. All Rights reserved.<br>
  83
  84 <i>For more information, including Disclamer and Limitations, see <a HREF="UnicodeCharacterDatabase-3.0.0.html">UnicodeCharacterDatabase-3.0.0.html</a> </i></p>
  85
  86
  87
  88 <p>This document describes the format of the UnicodeData.txt file, which is one of the
  89
  90 files in the Unicode Character Database. The document is divided into the following
  91
  92 sections:
  93
  94
  95
  96 <ul>
  97
  98   <li><a HREF="#Field Formats">Field Formats</a> <ul>
  99
 100       <li><a HREF="#General Category">General Category</a> </li>
 101
 102       <li><a HREF="#Bidirectional Category">Bidirectional Category</a> </li>
 103
 104       <li><a HREF="#Character Decomposition">Character Decomposition Mapping</a> </li>
 105
 106       <li><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </li>
 107
 108       <li><a HREF="#Decompositions and Normalization">Decompositions and Normalization</a> </li>
 109
 110       <li><a HREF="#Case Mappings">Case Mappings</a> </li>
 111
 112     </ul>
 113
 114   </li>
 115
 116   <li><a HREF="#Property Invariants">Property Invariants</a> </li>
 117
 118   <li><a HREF="#Modification History">Modification History</a> </li>
 119
 120 </ul>
 121
 122
 123
 124 <p><b>Warning: </b>the information in this file does not completely describe the use and
 125
 126 interpretation of Unicode character properties and behavior. It must be used in
 127
 128 conjunction with the data in the other files in the Unicode Character Database, and relies
 129
 130 on the notation and definitions supplied in <i><a href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html"> The Unicode
 131 Standard</a></i>. All chapter references
 132
 133 are to Version 3.0 of the standard.</p>
 134
 135
 136
 137 <h2><a NAME="Field Formats"></a>Field Formats</h2>
 138
 139
 140
 141 <p>The file consists of lines containing fields terminated by semicolons. Each line
 142
 143 represents the data for one encoded character in the Unicode Standard. Every encoded
 144
 145 character has a data entry, with the exception of certain special ranges, as detailed
 146
 147 below.
 148
 149
 150
 151 <ul>
 152
 153   <li>There are six special ranges of characters that are represented only by their start and
 154
 155     end characters, since the properties in the file are uniform, except for code values
 156
 157     (which are all sequential and assigned). </li>
 158
 159   <li>The names of CJK ideograph characters and the names and decompositions of Hangul
 160
 161     syllable characters are algorithmically derivable. (See the Unicode Standard and <a
 162
 163     HREF="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical Report #15</a> for
 164
 165     more information). </li>
 166
 167   <li>Surrogate code values and private use characters have no names. </li>
 168
 169   <li>The Private Use character outside of the BMP (U+F0000..U+FFFFD, U+100000..U+10FFFD) are
 170
 171     not listed. These correspond to surrogate pairs where the first surrogate is in the High
 172
 173     Surrogate Private Use section. </li>
 174
 175 </ul>
 176
 177
 178
 179 <p>The exact ranges represented by start and end characters are:
 180
 181
 182
 183 <ul>
 184
 185   <li>CJK Ideographs Extension A (U+3400 - U+4DB5) </li>
 186
 187   <li>CJK Ideographs (U+4E00 - U+9FA5) </li>
 188
 189   <li>Hangul Syllables (U+AC00 - U+D7A3) </li>
 190
 191   <li>Non-Private Use High Surrogates (U+D800 - U+DB7F) </li>
 192
 193   <li>Private Use High Surrogates (U+DB80 - U+DBFF) </li>
 194
 195   <li>Low Surrogates (U+DC00 - U+DFFF) </li>
 196
 197   <li>The Private Use Area (U+E000 - U+F8FF) </li>
 198
 199 </ul>
 200
 201
 202
 203 <p>The following table describes the format and meaning of each field in a data entry in
 204
 205 the UnicodeData file. Fields which contain normative information are so indicated.</p>
 206
 207
 208
 209 <table BORDER="1" CELLSPACING="2" CELLPADDING="2">
 210
 211   <tr>
 212
 213     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Field</th>
 214
 215     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Name</th>
 216
 217     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Status</th>
 218
 219     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Explanation</th>
 220
 221   </tr>
 222
 223   <tr>
 224
 225     <th VALIGN="top">0</th>
 226
 227     <td VALIGN="top">Code value</td>
 228
 229     <td VALIGN="top">normative</td>
 230
 231     <td VALIGN="top">Code value in 4-digit hexadecimal format.</td>
 232
 233   </tr>
 234
 235   <tr>
 236
 237     <th VALIGN="top">1</th>
 238
 239     <td VALIGN="top">Character name</td>
 240
 241     <td VALIGN="top">normative</td>
 242
 243     <td VALIGN="top">These names match exactly the names published in Chapter 14 of the
 244
 245     Unicode Standard, Version 3.0.</td>
 246
 247   </tr>
 248
 249   <tr>
 250
 251     <th VALIGN="top">2</th>
 252
 253     <td VALIGN="top"><a HREF="#General Category">General Category</a> </td>
 254
 255     <td VALIGN="top">normative / informative<br>
 256
 257     (see below)</td>
 258
 259     <td VALIGN="top">This is a useful breakdown into various &quot;character types&quot; which
 260
 261     can be used as a default categorization in implementations. See below for a brief
 262
 263     explanation.</td>
 264
 265   </tr>
 266
 267   <tr>
 268
 269     <th VALIGN="top">3</th>
 270
 271     <td VALIGN="top"><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </td>
 272
 273     <td VALIGN="top">normative</td>
 274
 275     <td VALIGN="top">The classes used for the Canonical Ordering Algorithm in the Unicode
 276
 277     Standard. These classes are also printed in Chapter 4 of the Unicode Standard.</td>
 278
 279   </tr>
 280
 281   <tr>
 282
 283     <th VALIGN="top">4</th>
 284
 285     <td VALIGN="top"><a HREF="#Bidirectional Category">Bidirectional Category</a> </td>
 286
 287     <td VALIGN="top">normative</td>
 288
 289     <td VALIGN="top">See the list below for an explanation of the abbreviations used in this
 290
 291     field. These are the categories required by the Bidirectional Behavior Algorithm in the
 292
 293     Unicode Standard. These categories are summarized in Chapter 3 of the Unicode Standard.</td>
 294
 295   </tr>
 296
 297   <tr>
 298
 299     <th VALIGN="top">5</th>
 300
 301     <td VALIGN="top"><a HREF="#Character Decomposition">Character Decomposition
 302       Mapping</a></td>
 303
 304     <td VALIGN="top">normative</td>
 305
 306     <td VALIGN="top">In the Unicode Standard, not all of the mappings are full (maximal)
 307
 308     decompositions. Recursive application of look-up for decompositions will, in all cases,
 309
 310     lead to a maximal decomposition. The decomposition mappings match exactly the
 311
 312     decomposition mappings published with the character names in the Unicode Standard.</td>
 313
 314   </tr>
 315
 316   <tr>
 317
 318     <th VALIGN="top">6</th>
 319
 320     <td VALIGN="top">Decimal digit value</td>
 321
 322     <td VALIGN="top">normative</td>
 323
 324     <td VALIGN="top">This is a numeric field. If the character has the decimal digit property,
 325
 326     as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented
 327
 328     with an integer value in this field</td>
 329
 330   </tr>
 331
 332   <tr>
 333
 334     <th VALIGN="top">7</th>
 335
 336     <td VALIGN="top">Digit value</td>
 337
 338     <td VALIGN="top">normative</td>
 339
 340     <td VALIGN="top">This is a numeric field. If the character represents a digit, not
 341
 342     necessarily a decimal digit, the value is here. This covers digits which do not form
 343
 344     decimal radix forms, such as the compatibility superscript digits</td>
 345
 346   </tr>
 347
 348   <tr>
 349
 350     <th VALIGN="top">8</th>
 351
 352     <td VALIGN="top">Numeric value</td>
 353
 354     <td VALIGN="top">normative</td>
 355
 356     <td VALIGN="top">This is a numeric field. If the character has the numeric property, as
 357
 358     specified in Chapter 4 of the Unicode Standard, the value of that character is represented
 359
 360     with an integer or rational number in this field. This includes fractions as, e.g.,
 361
 362     &quot;1/5&quot; for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values
 363
 364     for compatibility characters such as circled numbers.</td>
 365
 366   </tr>
 367
 368   <tr>
 369
 370     <th VALIGN="top">8</th>
 371
 372     <td VALIGN="top">Mirrored</td>
 373
 374     <td VALIGN="top">normative</td>
 375
 376     <td VALIGN="top">If the character has been identified as a &quot;mirrored&quot; character
 377
 378     in bidirectional text, this field has the value &quot;Y&quot;; otherwise &quot;N&quot;.
 379
 380     The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard.</td>
 381
 382   </tr>
 383
 384   <tr>
 385
 386     <th VALIGN="top">10</th>
 387
 388     <td VALIGN="top">Unicode 1.0 Name</td>
 389
 390     <td VALIGN="top">informative</td>
 391
 392     <td VALIGN="top">This is the old name as published in Unicode 1.0. This name is only
 393
 394     provided when it is significantly different from the Unicode 3.0 name for the character.</td>
 395
 396   </tr>
 397
 398   <tr>
 399
 400     <th VALIGN="top">11</th>
 401
 402     <td VALIGN="top">10646 comment field</td>
 403
 404     <td VALIGN="top">informative</td>
 405
 406     <td VALIGN="top">This is the ISO 10646 comment field. It is in parantheses in the 10646
 407
 408     names list.</td>
 409
 410   </tr>
 411
 412   <tr>
 413
 414     <th VALIGN="top">12</th>
 415
 416     <td VALIGN="top"><a HREF="#Case Mappings">Uppercase Mapping</a></td>
 417
 418     <td VALIGN="top">informative</td>
 419
 420     <td VALIGN="top">Upper case equivalent mapping. If a character is part of an alphabet with
 421
 422     case distinctions, and has an upper case equivalent, then the upper case equivalent is in
 423
 424     this field. See the explanation below on case distinctions. These mappings are always
 425
 426     one-to-one, not one-to-many or many-to-one. This field is informative.</td>
 427
 428   </tr>
 429
 430   <tr>
 431
 432     <th VALIGN="top">13</th>
 433
 434     <td VALIGN="top"><a HREF="#Case Mappings">Lowercase Mapping</a></td>
 435
 436     <td VALIGN="top">informative</td>
 437
 438     <td VALIGN="top">Similar to Uppercase mapping</td>
 439
 440   </tr>
 441
 442   <tr>
 443
 444     <th VALIGN="top">14</th>
 445
 446     <td VALIGN="top"><a HREF="#Case Mappings">Titlecase Mapping</a></td>
 447
 448     <td VALIGN="top">informative</td>
 449
 450     <td VALIGN="top">Similar to Uppercase mapping</td>
 451
 452   </tr>
 453
 454 </table>
 455
 456
 457
 458 <h3><a NAME="General Category"></a>General Category</h3>
 459
 460
 461
 462 <p>The values in this field are abbreviations for the following. Some of the values are
 463
 464 normative, and some are informative. For more information, see the Unicode Standard.</p>
 465
 466
 467
 468 <p><b>Note:</b> the standard does not assign information to control characters (except for
 469
 470 certain cases in the Bidirectional Algorithm). Implementations will generally also assign
 471
 472 categories to certain control characters, notably CR and LF, according to platform
 473
 474 conventions.</p>
 475
 476
 477
 478 <h4>Normative Categories</h4>
 479
 480
 481
 482 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">
 483
 484   <tr>
 485
 486     <th><p ALIGN="LEFT">Abbr.</th>
 487
 488     <th><p ALIGN="LEFT">Description</th>
 489
 490   </tr>
 491
 492   <tr>
 493
 494     <td ALIGN="CENTER">Lu</td>
 495
 496     <td>Letter, Uppercase</td>
 497
 498   </tr>
 499
 500   <tr>
 501
 502     <td ALIGN="CENTER">Ll</td>
 503
 504     <td>Letter, Lowercase</td>
 505
 506   </tr>
 507
 508   <tr>
 509
 510     <td ALIGN="CENTER">Lt</td>
 511
 512     <td>Letter, Titlecase</td>
 513
 514   </tr>
 515
 516   <tr>
 517
 518     <td ALIGN="CENTER">Mn</td>
 519
 520     <td>Mark, Non-Spacing</td>
 521
 522   </tr>
 523
 524   <tr>
 525
 526     <td ALIGN="CENTER">Mc</td>
 527
 528     <td>Mark, Spacing Combining</td>
 529
 530   </tr>
 531
 532   <tr>
 533
 534     <td ALIGN="CENTER">Me</td>
 535
 536     <td>Mark, Enclosing</td>
 537
 538   </tr>
 539
 540   <tr>
 541
 542     <td ALIGN="CENTER">Nd</td>
 543
 544     <td>Number, Decimal Digit</td>
 545
 546   </tr>
 547
 548   <tr>
 549
 550     <td ALIGN="CENTER">Nl</td>
 551
 552     <td>Number, Letter</td>
 553
 554   </tr>
 555
 556   <tr>
 557
 558     <td ALIGN="CENTER">No</td>
 559
 560     <td>Number, Other</td>
 561
 562   </tr>
 563
 564   <tr>
 565
 566     <td ALIGN="CENTER">Zs</td>
 567
 568     <td>Separator, Space</td>
 569
 570   </tr>
 571
 572   <tr>
 573
 574     <td ALIGN="CENTER">Zl</td>
 575
 576     <td>Separator, Line</td>
 577
 578   </tr>
 579
 580   <tr>
 581
 582     <td ALIGN="CENTER">Zp</td>
 583
 584     <td>Separator, Paragraph</td>
 585
 586   </tr>
 587
 588   <tr>
 589
 590     <td ALIGN="CENTER">Cc</td>
 591
 592     <td>Other, Control</td>
 593
 594   </tr>
 595
 596   <tr>
 597
 598     <td ALIGN="CENTER">Cf</td>
 599
 600     <td>Other, Format</td>
 601
 602   </tr>
 603
 604   <tr>
 605
 606     <td ALIGN="CENTER">Cs</td>
 607
 608     <td>Other, Surrogate</td>
 609
 610   </tr>
 611
 612   <tr>
 613
 614     <td ALIGN="CENTER">Co</td>
 615
 616     <td>Other, Private Use</td>
 617
 618   </tr>
 619
 620   <tr>
 621
 622     <td ALIGN="CENTER">Cn</td>
 623
 624     <td>Other, Not Assigned (no characters in the file have this property)</td>
 625
 626   </tr>
 627
 628 </table>
 629
 630
 631
 632 <h4>Informative Categories</h4>
 633
 634
 635
 636 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">
 637
 638   <tr>
 639
 640     <th><p ALIGN="LEFT">Abbr.</th>
 641
 642     <th><p ALIGN="LEFT">Description</th>
 643
 644   </tr>
 645
 646   <tr>
 647
 648     <td ALIGN="CENTER">Lm</td>
 649
 650     <td>Letter, Modifier</td>
 651
 652   </tr>
 653
 654   <tr>
 655
 656     <td ALIGN="CENTER">Lo</td>
 657
 658     <td>Letter, Other</td>
 659
 660   </tr>
 661
 662   <tr>
 663
 664     <td ALIGN="CENTER">Pc</td>
 665
 666     <td>Punctuation, Connector</td>
 667
 668   </tr>
 669
 670   <tr>
 671
 672     <td ALIGN="CENTER">Pd</td>
 673
 674     <td>Punctuation, Dash</td>
 675
 676   </tr>
 677
 678   <tr>
 679
 680     <td ALIGN="CENTER">Ps</td>
 681
 682     <td>Punctuation, Open</td>
 683
 684   </tr>
 685
 686   <tr>
 687
 688     <td ALIGN="CENTER">Pe</td>
 689
 690     <td>Punctuation, Close</td>
 691
 692   </tr>
 693
 694   <tr>
 695
 696     <td ALIGN="CENTER">Pi</td>
 697
 698     <td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td>
 699
 700   </tr>
 701
 702   <tr>
 703
 704     <td ALIGN="CENTER">Pf</td>
 705
 706     <td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td>
 707
 708   </tr>
 709
 710   <tr>
 711
 712     <td ALIGN="CENTER">Po</td>
 713
 714     <td>Punctuation, Other</td>
 715
 716   </tr>
 717
 718   <tr>
 719
 720     <td ALIGN="CENTER">Sm</td>
 721
 722     <td>Symbol, Math</td>
 723
 724   </tr>
 725
 726   <tr>
 727
 728     <td ALIGN="CENTER">Sc</td>
 729
 730     <td>Symbol, Currency</td>
 731
 732   </tr>
 733
 734   <tr>
 735
 736     <td ALIGN="CENTER">Sk</td>
 737
 738     <td>Symbol, Modifier</td>
 739
 740   </tr>
 741
 742   <tr>
 743
 744     <td ALIGN="CENTER">So</td>
 745
 746     <td>Symbol, Other</td>
 747
 748   </tr>
 749
 750 </table>
 751
 752
 753
 754 <h3><a NAME="Bidirectional Category"></a>Bidirectional Category</h3>
 755
 756
 757
 758 <p>Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional
 759
 760 Behavior and an explanation of the significance of these categories. An up-to-date version
 761
 762 can be found on <a HREF="http://www.unicode.org/unicode/reports/tr9/">Unicode Technical
 763
 764 Report #9: The Bidirectional Algorithm</a>. These values are normative.</p>
 765
 766
 767
 768 <table BORDER="0" CELLPADDING="2">
 769
 770   <tr>
 771
 772     <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Type</th>
 773
 774     <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Description</th>
 775
 776   </tr>
 777
 778   <tr>
 779
 780     <td VALIGN="TOP"><b>L</b></td>
 781
 782     <td VALIGN="TOP">Left-to-Right</td>
 783
 784   </tr>
 785
 786   <tr>
 787
 788     <td VALIGN="TOP"><b>LRE</b></td>
 789
 790     <td VALIGN="TOP">Left-to-Right Embedding</td>
 791
 792   </tr>
 793
 794   <tr>
 795
 796     <td VALIGN="TOP"><b>LRO</b></td>
 797
 798     <td VALIGN="TOP">Left-to-Right Override</td>
 799
 800   </tr>
 801
 802   <tr>
 803
 804     <td VALIGN="TOP"><b>R</b></td>
 805
 806     <td VALIGN="TOP">Right-to-Left</td>
 807
 808   </tr>
 809
 810   <tr>
 811
 812     <td VALIGN="TOP"><b>AL</b></td>
 813
 814     <td VALIGN="TOP">Right-to-Left Arabic</td>
 815
 816   </tr>
 817
 818   <tr>
 819
 820     <td VALIGN="TOP"><b>RLE</b></td>
 821
 822     <td VALIGN="TOP">Right-to-Left Embedding</td>
 823
 824   </tr>
 825
 826   <tr>
 827
 828     <td VALIGN="TOP"><b>RLO</b></td>
 829
 830     <td VALIGN="TOP">Right-to-Left Override</td>
 831
 832   </tr>
 833
 834   <tr>
 835
 836     <td VALIGN="TOP"><b>PDF</b></td>
 837
 838     <td VALIGN="TOP">Pop Directional Format</td>
 839
 840   </tr>
 841
 842   <tr>
 843
 844     <td VALIGN="TOP"><b>EN</b></td>
 845
 846     <td VALIGN="TOP">European Number</td>
 847
 848   </tr>
 849
 850   <tr>
 851
 852     <td VALIGN="TOP"><b>ES</b></td>
 853
 854     <td VALIGN="TOP">European Number Separator</td>
 855
 856   </tr>
 857
 858   <tr>
 859
 860     <td VALIGN="TOP"><b>ET</b></td>
 861
 862     <td VALIGN="TOP">European Number Terminator</td>
 863
 864   </tr>
 865
 866   <tr>
 867
 868     <td VALIGN="TOP"><b>AN</b></td>
 869
 870     <td VALIGN="TOP">Arabic Number</td>
 871
 872   </tr>
 873
 874   <tr>
 875
 876     <td VALIGN="TOP"><b>CS</b></td>
 877
 878     <td VALIGN="TOP">Common Number Separator</td>
 879
 880   </tr>
 881
 882   <tr>
 883
 884     <td VALIGN="TOP"><b>NSM</b></td>
 885
 886     <td VALIGN="TOP">Non-Spacing Mark</td>
 887
 888   </tr>
 889
 890   <tr>
 891
 892     <td VALIGN="TOP"><b>BN</b></td>
 893
 894     <td VALIGN="TOP">Boundary Neutral</td>
 895
 896   </tr>
 897
 898   <tr>
 899
 900     <td VALIGN="TOP"><b>B</b></td>
 901
 902     <td VALIGN="TOP">Paragraph Separator</td>
 903
 904   </tr>
 905
 906   <tr>
 907
 908     <td VALIGN="TOP"><b>S</b></td>
 909
 910     <td VALIGN="TOP">Segment Separator</td>
 911
 912   </tr>
 913
 914   <tr>
 915
 916     <td VALIGN="TOP"><b>WS</b></td>
 917
 918     <td VALIGN="TOP">Whitespace</td>
 919
 920   </tr>
 921
 922   <tr>
 923
 924     <td VALIGN="TOP"><b>ON</b></td>
 925
 926     <td VALIGN="TOP">Other Neutrals</td>
 927
 928   </tr>
 929
 930 </table>
 931
 932
 933
 934 <h3><a NAME="Character Decomposition"></a>Character Decomposition Mapping</h3>
 935
 936
 937
 938 <p>The decomposition is a normative property of a character. The tags supplied with
 939
 940 certain decomposition mappings generally indicate formatting information. Where no such
 941
 942 tag is given, the mapping is designated as canonical. Conversely, the presence of a
 943
 944 formatting tag also indicates that the mapping is a compatibility mapping and not a
 945
 946 canonical mapping. In the absence of other formatting information in a compatibility
 947
 948 mapping, the tag is used to distinguish it from canonical mappings.</p>
 949
 950
 951
 952 <p>In some instances a canonical mapping or a compatibility mapping may consist of a
 953
 954 single character. For a canonical mapping, this indicates that the character is a
 955
 956 canonical equivalent of another single character. For a compatibility mapping, this
 957
 958 indicates that the character is a compatibility equivalent of another single character.
 959
 960 The compatibility formatting tags used are:</p>
 961
 962
 963
 964 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">
 965
 966   <tr>
 967
 968     <th>Tag</th>
 969
 970     <th><p ALIGN="LEFT">Description</th>
 971
 972   </tr>
 973
 974   <tr>
 975
 976     <td ALIGN="CENTER">&lt;font&gt;&nbsp;&nbsp;</td>
 977
 978     <td>A font variant (e.g. a blackletter form).</td>
 979
 980   </tr>
 981
 982   <tr>
 983
 984     <td ALIGN="CENTER">&lt;noBreak&gt;&nbsp;&nbsp;</td>
 985
 986     <td>A no-break version of a space or hyphen.</td>
 987
 988   </tr>
 989
 990   <tr>
 991
 992     <td ALIGN="CENTER">&lt;initial&gt;&nbsp;&nbsp;</td>
 993
 994     <td>An initial presentation form (Arabic).</td>
 995
 996   </tr>
 997
 998   <tr>
 999
1000     <td ALIGN="CENTER">&lt;medial&gt;&nbsp;&nbsp;</td>
1001
1002     <td>A medial presentation form (Arabic).</td>
1003
1004   </tr>
1005
1006   <tr>
1007
1008     <td ALIGN="CENTER">&lt;final&gt;&nbsp;&nbsp;</td>
1009
1010     <td>A final presentation form (Arabic).</td>
1011
1012   </tr>
1013
1014   <tr>
1015
1016     <td ALIGN="CENTER">&lt;isolated&gt;&nbsp;&nbsp;</td>
1017
1018     <td>An isolated presentation form (Arabic).</td>
1019
1020   </tr>
1021
1022   <tr>
1023
1024     <td ALIGN="CENTER">&lt;circle&gt;&nbsp;&nbsp;</td>
1025
1026     <td>An encircled form.</td>
1027
1028   </tr>
1029
1030   <tr>
1031
1032     <td ALIGN="CENTER">&lt;super&gt;&nbsp;&nbsp;</td>
1033
1034     <td>A superscript form.</td>
1035
1036   </tr>
1037
1038   <tr>
1039
1040     <td ALIGN="CENTER">&lt;sub&gt;&nbsp;&nbsp;</td>
1041
1042     <td>A subscript form.</td>
1043
1044   </tr>
1045
1046   <tr>
1047
1048     <td ALIGN="CENTER">&lt;vertical&gt;&nbsp;&nbsp;</td>
1049
1050     <td>A vertical layout presentation form.</td>
1051
1052   </tr>
1053
1054   <tr>
1055
1056     <td ALIGN="CENTER">&lt;wide&gt;&nbsp;&nbsp;</td>
1057
1058     <td>A wide (or zenkaku) compatibility character.</td>
1059
1060   </tr>
1061
1062   <tr>
1063
1064     <td ALIGN="CENTER">&lt;narrow&gt;&nbsp;&nbsp;</td>
1065
1066     <td>A narrow (or hankaku) compatibility character.</td>
1067
1068   </tr>
1069
1070   <tr>
1071
1072     <td ALIGN="CENTER">&lt;small&gt;&nbsp;&nbsp;</td>
1073
1074     <td>A small variant form (CNS compatibility).</td>
1075
1076   </tr>
1077
1078   <tr>
1079
1080     <td ALIGN="CENTER">&lt;square&gt;&nbsp;&nbsp;</td>
1081
1082     <td>A CJK squared font variant.</td>
1083
1084   </tr>
1085
1086   <tr>
1087
1088     <td ALIGN="CENTER">&lt;fraction&gt;&nbsp;&nbsp;</td>
1089
1090     <td>A vulgar fraction form.</td>
1091
1092   </tr>
1093
1094   <tr>
1095
1096     <td ALIGN="CENTER">&lt;compat&gt;&nbsp;&nbsp;</td>
1097
1098     <td>Otherwise unspecified compatibility character.</td>
1099
1100   </tr>
1101
1102 </table>
1103
1104
1105
1106 <p><b>Reminder: </b>There is a difference between decomposition and decomposition mapping.
1107
1108 The decomposition mappings are defined in the UnicodeData, while the decomposition (also
1109
1110 termed &quot;full decomposition&quot;) is defined in Chapter 3 to use those mappings
1111 <i>
1112
1113 recursively.</i>
1114
1115
1116
1117 <ul>
1118
1119   <li>The canonical decomposition is formed by recursively applying the canonical mappings,
1120
1121     then applying the canonical reordering algorithm. </li>
1122
1123   <li>The compatibility decomposition is formed by recursively applying the canonical <em>and</em>
1124
1125     compatibility mappings, then applying the canonical reordering algorithm. </li>
1126
1127 </ul>
1128
1129
1130
1131 <h3><a NAME="Canonical Combining Classes"></a>Canonical Combining Classes</h3>
1132
1133
1134
1135 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">
1136
1137   <tr>
1138
1139     <th><p ALIGN="LEFT">Value</th>
1140
1141     <th><p ALIGN="LEFT">Description</th>
1142
1143   </tr>
1144
1145   <tr>
1146
1147     <td ALIGN="RIGHT">0:</td>
1148
1149     <td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td>
1150
1151   </tr>
1152
1153   <tr>
1154
1155     <td ALIGN="RIGHT">1:</td>
1156
1157     <td>Overlays and interior</td>
1158
1159   </tr>
1160
1161   <tr>
1162
1163     <td ALIGN="RIGHT">7:</td>
1164
1165     <td>Nuktas</td>
1166
1167   </tr>
1168
1169   <tr>
1170
1171     <td ALIGN="RIGHT">8:</td>
1172
1173     <td>Hiragana/Katakana voicing marks</td>
1174
1175   </tr>
1176
1177   <tr>
1178
1179     <td ALIGN="RIGHT">9:</td>
1180
1181     <td>Viramas</td>
1182
1183   </tr>
1184
1185   <tr>
1186
1187     <td ALIGN="RIGHT">10:</td>
1188
1189     <td>Start of fixed position classes</td>
1190
1191   </tr>
1192
1193   <tr>
1194
1195     <td ALIGN="RIGHT">199:</td>
1196
1197     <td>End of fixed position classes</td>
1198
1199   </tr>
1200
1201   <tr>
1202
1203     <td ALIGN="RIGHT">200:</td>
1204
1205     <td>Below left attached</td>
1206
1207   </tr>
1208
1209   <tr>
1210
1211     <td ALIGN="RIGHT">202:</td>
1212
1213     <td>Below attached</td>
1214
1215   </tr>
1216
1217   <tr>
1218
1219     <td ALIGN="RIGHT">204:</td>
1220
1221     <td>Below right attached</td>
1222
1223   </tr>
1224
1225   <tr>
1226
1227     <td ALIGN="RIGHT">208:</td>
1228
1229     <td>Left attached (reordrant around single base character)</td>
1230
1231   </tr>
1232
1233   <tr>
1234
1235     <td ALIGN="RIGHT">210:</td>
1236
1237     <td>Right attached</td>
1238
1239   </tr>
1240
1241   <tr>
1242
1243     <td ALIGN="RIGHT">212:</td>
1244
1245     <td>Above left attached</td>
1246
1247   </tr>
1248
1249   <tr>
1250
1251     <td ALIGN="RIGHT">214:</td>
1252
1253     <td>Above attached</td>
1254
1255   </tr>
1256
1257   <tr>
1258
1259     <td ALIGN="RIGHT">216:</td>
1260
1261     <td>Above right attached</td>
1262
1263   </tr>
1264
1265   <tr>
1266
1267     <td ALIGN="RIGHT">218:</td>
1268
1269     <td>Below left</td>
1270
1271   </tr>
1272
1273   <tr>
1274
1275     <td ALIGN="RIGHT">220:</td>
1276
1277     <td>Below</td>
1278
1279   </tr>
1280
1281   <tr>
1282
1283     <td ALIGN="RIGHT">222:</td>
1284
1285     <td>Below right</td>
1286
1287   </tr>
1288
1289   <tr>
1290
1291     <td ALIGN="RIGHT">224:</td>
1292
1293     <td>Left (reordrant around single base character)</td>
1294
1295   </tr>
1296
1297   <tr>
1298
1299     <td ALIGN="RIGHT">226:</td>
1300
1301     <td>Right</td>
1302
1303   </tr>
1304
1305   <tr>
1306
1307     <td ALIGN="RIGHT">228:</td>
1308
1309     <td>Above left</td>
1310
1311   </tr>
1312
1313   <tr>
1314
1315     <td ALIGN="RIGHT">230:</td>
1316
1317     <td>Above</td>
1318
1319   </tr>
1320
1321   <tr>
1322
1323     <td ALIGN="RIGHT">232:</td>
1324
1325     <td>Above right</td>
1326
1327   </tr>
1328
1329   <tr>
1330
1331     <td ALIGN="RIGHT">233:</td>
1332
1333     <td>Double below</td>
1334
1335   </tr>
1336
1337   <tr>
1338
1339     <td ALIGN="RIGHT">234:</td>
1340
1341     <td>Double above</td>
1342
1343   </tr>
1344
1345   <tr>
1346
1347     <td ALIGN="RIGHT">240:</td>
1348
1349     <td>Below (iota subscript)</td>
1350
1351   </tr>
1352
1353 </table>
1354
1355
1356
1357 <p><strong>Note: </strong>some of the combining classes in this list do not currently have
1358
1359 members but are specified here for completeness.</p>
1360
1361
1362
1363 <h3><a NAME="Decompositions and Normalization"></a>Decompositions and Normalization</h3>
1364
1365
1366
1367 <p>Decomposition is specified in Chapter 3. <a href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Technical Report #15:
1368
1369 Normalization Forms</i></a> specifies the interaction between decomposition and normalization. The
1370
1371 most up-to-date version is found on <a HREF="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15/</a>.
1372
1373 That report specifies how the decompositions defined in UnicodeData.txt are used to derive
1374
1375 normalized forms of Unicode text.</p>
1376
1377
1378
1379 <p>Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions
1380
1381 in the UnicodeData.txt file can be used to recursively derive the full decomposition in
1382
1383 canonical order, without the need to separately apply canonical reordering. However,
1384
1385 canonical reordering of combining character sequences must still be applied in
1386
1387 decomposition when normalizing source text which contains any combining marks.</p>
1388
1389
1390
1391 <h3><a NAME="Case Mappings"></a>Case Mappings</h3>
1392
1393
1394
1395 <p>The case mapping is an informative, default mapping. Case itself, on the other hand,
1396
1397 has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A is normatively
1398
1399 uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER A is informative. The
1400
1401 reason for this is that case can be considered to be an inherent property of a particular
1402
1403 character (and is usually, but not always, derivable from the presence of the terms
1404
1405 &quot;CAPITAL&quot; or &quot;SMALL&quot; in the character name), but case mappings between
1406
1407 characters are occasionally influenced by local conventions. For example, certain
1408
1409 languages, such as Turkish, German, French, or Greek may have small deviations from the
1410
1411 default mappings listed in UnicodeData.</p>
1412
1413
1414
1415 <p>In addition to uppercase and lowercase, because of the inclusion of certain composite
1416
1417 characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ, there is a third case,
1418
1419 called <i>titlecase</i>, which is used where the first letter of a word is to be
1420
1421 capitalized (e.g. UPPERCASE, Titlecase, lowercase). An example of such a titlecase letter
1422
1423 is 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.</p>
1424
1425
1426
1427 <p>The uppercase, titlecase and lowercase fields are only included for characters that
1428
1429 have a single corresponding character of that type. Composite characters (such as
1430
1431 &quot;339D SQUARE CM&quot;) that do not have a single corresponding character of that type
1432
1433 can be cased by decomposition.</p>
1434
1435
1436
1437 <p>For compatibility with existing parsers, UnicodeData only contains case mappings for
1438
1439 characters where they are one-to-one mappings; it also omits information about
1440
1441 context-sensitive case mappings. Information about these special cases can be found in a
1442
1443 separate data file, SpecialCasing.txt,
1444
1445 which has been added starting with the 2.1.8 update to the Unicode data files.
1446
1447 SpecialCasing.txt contains additional informative case mappings that are either not
1448
1449 one-to-one or which are context-sensitive.</p>
1450
1451
1452
1453 <h2><a NAME="Property Invariants"></a>Property Invariants</h2>
1454
1455
1456
1457 <p>Values in UnicodeData.txt are subject to correction as errors are found; however, some
1458
1459 characteristics of the categories themselves can be considered invariants. Applications
1460
1461 may wish to take these invariants into account when choosing how to implement character
1462
1463 properties. The following is a partial list of known invariants for the Unicode Character
1464
1465 Database.</p>
1466
1467
1468
1469 <h4>Database Fields</h4>
1470
1471
1472
1473 <ul>
1474
1475   <li>The number of fields in UnicodeData.txt is fixed. </li>
1476
1477   <li>The order of the fields is also fixed. <ul>
1478
1479       <li>Any additional information about character properties to be added in the future will
1480
1481         appear in separate data tables, rather than being added on to the existing table or by
1482
1483         subdivision or reinterpretation of existing fields. </li>
1484
1485     </ul>
1486
1487   </li>
1488
1489 </ul>
1490
1491
1492
1493 <h4>General Category</h4>
1494
1495
1496
1497 <ul>
1498
1499   <li>There will never be more than 32 General Category values. <ul>
1500
1501       <li>It is very unlikely that the Unicode Technical Committee will subdivide the General
1502
1503         Category partition any further, since that can cause implementations to misbehave. Because
1504
1505         the General Category is limited to 32 values, 5 bits can be used to represent the
1506
1507         information, and a 32-bit integer can be used as a bitmask to represent arbitrary sets of
1508
1509         categories. </li>
1510
1511     </ul>
1512
1513   </li>
1514
1515 </ul>
1516
1517
1518
1519 <h4>Combining Classes</h4>
1520
1521
1522
1523 <ul>
1524
1525   <li>Combining classes are limited to the values 0 to 255. <ul>
1526
1527       <li>In practice, there are far fewer than 256 values used. Implementations may take
1528
1529         advantage of this fact for compression, since only the ordering of the non-zero values
1530
1531         matters for the Canonical Reordering Algorithm. It is possible for up to 256 values to be
1532
1533         used in the future; however, UTC decisions in the future may restrict the number of values
1534
1535         to 128, since this has implementation advantages. [Signed bytes can be used without
1536
1537         widening to ints in Java, for example.] </li>
1538
1539     </ul>
1540
1541   </li>
1542
1543   <li>All characters other than those of General Category M* have the combining class 0. <ul>
1544
1545       <li>Currently, all characters other than those of General Category Mn have the value 0.
1546
1547         However, some characters of General Category Me or Mc may be given non-zero values in the
1548
1549         future. </li>
1550
1551       <li>The precise values above the value 0 are not invariant--only the relative ordering is
1552
1553         considered normative. For example, it is not guaranteed in future versions that the class
1554
1555         of U+05B4 will be precisely 14. </li>
1556
1557     </ul>
1558
1559   </li>
1560
1561 </ul>
1562
1563
1564
1565 <h4>Case</h4>
1566
1567
1568
1569 <ul>
1570
1571   <li>Characters of type Lu, Lt, or Ll are called <i>cased</i>. All characters with an Upper,
1572
1573     Lower, or Titlecase mapping are cased characters. <ul>
1574
1575       <li>However, characters with the General Categories of Lu, Ll, or Lt may not always have
1576
1577         case mappings, and case mappings may vary by locale. (See
1578
1579         ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt). </li>
1580
1581     </ul>
1582
1583   </li>
1584
1585 </ul>
1586
1587
1588
1589 <h4>Canonical Decomposition</h4>
1590
1591
1592
1593 <ul>
1594
1595   <li>Canonical mappings are always in canonical order. </li>
1596
1597   <li>Canonical mappings have only the first of a pair possibly further decomposing. </li>
1598
1599   <li>Canonical decompositions are &quot;transparent&quot; to other character data: <ul>
1600
1601       <li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt> </li>
1602
1603       <li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt> </li>
1604
1605       <li><tt>CombiningClass(a) = CombiningClass(principal(canonicalDecomposition(a))</tt><br>
1606
1607         where principal(a) is the first character not of type Mn, or the first character if all
1608
1609         characters are of type Mn. </li>
1610
1611     </ul>
1612
1613   </li>
1614
1615   <li>However, because there are sometimes missing case pairs, and because of some legacy
1616
1617     characters, it is only generally true that: <ul>
1618
1619       <li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt> </li>
1620
1621       <li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt> </li>
1622
1623       <li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt> </li>
1624
1625     </ul>
1626
1627   </li>
1628
1629 </ul>
1630
1631
1632
1633 <h2><a NAME="Modification History"></a>Modification History</h2>
1634
1635
1636
1637 <p>This section provides a summary of the changes between update versions of the Unicode
1638
1639 Standard.</p>
1640
1641
1642
1643 <h3><a href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0"> Unicode 3.0.0</a></h3>
1644
1645
1646
1647 <p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and
1648
1649 a number of property changes. These are summarized in Appendex D of <em>The Unicode
1650
1651 Standard, Version 3.0.</em></p>
1652
1653
1654
1655 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode 2.1.9</a> </h3>
1656
1657
1658
1659 <p>Modifications made for Version 2.1.9 of UnicodeData.txt include:
1660
1661
1662
1663 <ul>
1664
1665   <li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR. </li>
1666
1667   <li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE </li>
1668
1669   <li>Corrected combining class for U+0F35 and U+0F37 to 220. </li>
1670
1671   <li>Corrected combining class for U+0F71 to 129. </li>
1672
1673   <li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR. </li>
1674
1675   <li>Added&nbsp; decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5,
1676
1677     U+03D6, U+03F0..U+03F2. </li>
1678
1679   <li>Removed&nbsp; decompositions from the conjoining jamo block: U+1100..U+11F8. </li>
1680
1681   <li>Changes to decomposition mappings for some Tibetan vowels for consistency in
1682
1683     normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81) </li>
1684
1685   <li>Updated the decomposition mappings for several Vietnamese characters with two diacritics
1686
1687     (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive
1688
1689     decomposition can be generated directly in canonically reordered form (not a normative
1690
1691     change). </li>
1692
1693   <li>Updated the decomposition mappings for several Arabic compatibility characters involving
1694
1695     shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so
1696
1697     that the decompositions are generated directly in canonically reordered form (not a
1698
1699     normative change). </li>
1700
1701   <li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE
1702
1703     SEPARATOR. </li>
1704
1705   <li>Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035,
1706
1707     U+FF9E, U+FF9F. </li>
1708
1709   <li>Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375. </li>
1710
1711   <li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL. </li>
1712
1713   <li>Added Unicode 1.0 names for many Tibetan characters (informative). </li>
1714
1715 </ul>
1716
1717
1718
1719 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode 2.1.8</a> </h3>
1720
1721
1722
1723 <p>Modifications made for Version 2.1.8 of UnicodeData.txt include:
1724
1725
1726
1727 <ul>
1728
1729   <li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that
1730
1731     decompositions involving iota subscript are derivable directly in canonically reordered
1732
1733     form; this also has a bearing on simplification of casing of polytonic Greek. </li>
1734
1735   <li>Changes in decompositions related to Greek tonos. These result from the clarification
1736
1737     that monotonic Greek &quot;tonos&quot; should be equated with U+0301 COMBINING ACUTE,
1738
1739     rather than with U+030D COMBINING VERTICAL LINE ABOVE. (All Greek characters in the Greek
1740
1741     block involving &quot;tonos&quot;; some Greek characters in the polytonic Greek in the
1742
1743     1FXX block.) </li>
1744
1745   <li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0) </li>
1746
1747   <li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These changes
1748
1749     simplify normalization. </li>
1750
1751   <li>Removed canonical decomposition for Latin Candrabindu. (U+0310) </li>
1752
1753   <li>Corrected error in canonical decomposition for U+1FF4. </li>
1754
1755   <li>Added compatibility decompositions to clarify collation tables. (U+2100, U+2101, U+2105,
1756
1757     U+2106, U+1E9A) </li>
1758
1759   <li>A series of general category changes to assist the convergence of of Unicode definition
1760
1761     of identifier with ISO TR 10176: <ul>
1762
1763       <li>So &gt; Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B </li>
1764
1765       <li>Po &gt; Lo: U+0E2F, U+0EAF, U+3006 </li>
1766
1767       <li>Lm &gt; Sk: U+309B, U+309C </li>
1768
1769       <li>Po &gt; Pc: U+30FB, U+FF65 </li>
1770
1771       <li>Ps/Pe &gt; Mn: U+0F3E, U+0F3F </li>
1772
1773     </ul>
1774
1775   </li>
1776
1777   <li>A series of bidi property changes for consistency. <ul>
1778
1779       <li>L &gt; ET: U+09F2, U+09F3 </li>
1780
1781       <li>ON &gt; L: U+3007 </li>
1782
1783       <li>L &gt; ON: U+0F3A..U+0F3D, U+037E, U+0387 </li>
1784
1785     </ul>
1786
1787   </li>
1788
1789   <li>Add case mapping: U+01A6 &lt;-&gt; U+0280 </li>
1790
1791   <li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039, U+203A. </li>
1792
1793   <li>Changes to combining class values. Most Indic fixed position class non-spacing marks
1794
1795     were changed to combining class 0. This fixes some inconsistencies in how canonical
1796
1797     reordering would apply to Indic scripts, including Tibetan. Indic interacting top/bottom
1798
1799     fixed position classes were merged into single (non-zero) classes as part of this change.
1800
1801     Tibetan subjoined consonants are changed from combining class 6 to combining class 0. Thai
1802
1803     pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari stress marks into generic
1804
1805     above and below combining classes (U+0951, U+0952). </li>
1806
1807   <li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E, etc., scattered
1808
1809     positions to U+FA29) </li>
1810
1811 </ul>
1812
1813
1814
1815 <h3>Version 2.1.7</h3>
1816
1817
1818
1819 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1820
1821
1822
1823 <h3>Version 2.1.6</h3>
1824
1825
1826
1827 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1828
1829
1830
1831 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode 2.1.5</a> </h3>
1832
1833
1834
1835 <p>Modifications made for Version 2.1.5 of UnicodeData.txt include:
1836
1837
1838
1839 <ul>
1840
1841   <li>Changed decomposition for U+FF9E and U+FF9F so that correct collation weighting will
1842
1843     automatically result from the canonical equivalences. </li>
1844
1845   <li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, U+04E0, U+04E1,
1846
1847     U+04E8, U+04E9 (the implication being that no canonical equivalence is claimed between
1848
1849     these 8 characters and similar Latin letters), and updated 4 canonical decompositions for
1850
1851     U+04DB, U+04DC, U+04EA, U+04EB to reflect the implied difference in the base character. </li>
1852
1853   <li>Added Pi, and Pf categories and assigned the relevant quotation marks to those
1854
1855     categories, based on the Unicode Technical Corrigendum on Quotation Characters. </li>
1856
1857   <li>Updating of many bidi properties, following the advice of the ad hoc committee on bidi,
1858
1859     and to make the bidi properties of compatibility characters more consistent. </li>
1860
1861   <li>Changed category of several Tibetan characters: U+0F3E, U+0F3F, U+0F88..U+0F8B to make
1862
1863     them non-combining, reflecting the combined opinion of Tibetan experts. </li>
1864
1865   <li>Added case mapping for U+03F2. </li>
1866
1867   <li>Corrected case mapping for U+0275. </li>
1868
1869   <li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. U+03F2. </li>
1870
1871   <li>Corrected compatibility label for U+2121. </li>
1872
1873   <li>Add specific entries for all the CJK compatibility ideographs, U+F900..U+FA2D, so the
1874
1875     canonical decomposition for each (the URO character it is equivalent to) can be carried in
1876
1877     the database. </li>
1878
1879 </ul>
1880
1881
1882
1883 <h3>Version 2.1.4</h3>
1884
1885
1886
1887 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1888
1889
1890
1891 <h3>Version 2.1.3</h3>
1892
1893
1894
1895 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1896
1897
1898
1899 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode 2.1.2</a> </h3>
1900
1901
1902
1903 <p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode
1904
1905 Standard, Version 2.1 (from Version 2.0) include:
1906
1907
1908
1909 <ul>
1910
1911   <li>Added two characters (U+20AC and U+FFFC). </li>
1912
1913   <li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007. </li>
1914
1915   <li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, U+03C2, U+1E9B. </li>
1916
1917   <li>Changed combining order class for U+0F71. </li>
1918
1919   <li>Corrected canonical decompositions for U+0F73, U+1FBE. </li>
1920
1921   <li>Changed decomposition for U+FB1F from compatibility to canonical. </li>
1922
1923   <li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB. </li>
1924
1925   <li>Corrected compatibility decompositions for U+2469, U+246A, U+3358. </li>
1926
1927 </ul>
1928
1929
1930
1931 <h3>Version 2.1.1</h3>
1932
1933
1934
1935 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1936
1937
1938
1939 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode 2.0.0</a> </h3>
1940
1941
1942
1943 <p>The modifications made in updating UnicodeData.txt for the Unicode
1944
1945 Standard, Version 2.0 include:
1946
1947
1948
1949 <ul>
1950
1951   <li>Fixed decompositions with TONOS to use correct NSM: 030D. </li>
1952
1953   <li>Removed old Hangul Syllables; mapping to new characters are in a separate table. </li>
1954
1955   <li>Marked compatibility decompositions with additional tags. </li>
1956
1957   <li>Changed old tag names for clarity. </li>
1958
1959   <li>Revision of decompositions to use first-level decomposition, instead of maximal
1960
1961     decomposition. </li>
1962
1963   <li>Correction of all known errors in decompositions from earlier versions. </li>
1964
1965   <li>Added control code names (as old Unicode names). </li>
1966
1967   <li>Added Hangul Jamo decompositions. </li>
1968
1969   <li>Added Number category to match properties list in book. </li>
1970
1971   <li>Fixed categories of Koranic Arabic marks. </li>
1972
1973   <li>Fixed categories of precomposed characters to match decomposition where possible. </li>
1974
1975   <li>Added Hebrew cantillation marks and the Tibetan script. </li>
1976
1977   <li>Added place holders for ranges such as CJK Ideographic Area and the Private Use Area. </li>
1978
1979   <li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the
1980
1981     database. </li>
1982
1983 </ul>
1984
1985 </body>
1986
1987 </html>
1988