admin/notes/unicode

   1                                             -*-mode: text; coding: utf-8;-*-
   2
   3 Copyright (C) 2002-2016 Free Software Foundation, Inc.
   4 See the end of the file for license conditions.
   5
   6 Importing a new Unicode Standard version into Emacs
   7 -------------------------------------------------------------
   8
   9 Emacs uses the following files from the Unicode Character Database
  10 (a.k.a. "UCD):
  11
  12   . UnicodeData.txt
  13   . Blocks.txt
  14   . BidiMirroring.txt
  15   . BidiBrackets.txt
  16   . IVD_Sequences.txt
  17   . BidiCharacterTest.txt
  18
  19 First, the first 5 files need to be copied into admin/unidata/, and
  20 then Emacs should be rebuilt for them to take effect.  Rebuilding
  21 Emacs updates several derived files elsewhere in the Emacs source
  22 tree, mainly in lisp/international/.
  23
  24 When Emacs is rebuilt for the first time after importing the new
  25 files, pay attention to any warning or error messages.  In particular,
  26 admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines
  27 new bidirectional attributes of characters, because unidata-gen.el,
  28 bidi.c and dispextern.h need to be updated in that case; failure to do
  29 so will cause aborts in redisplay.
  30
  31 Next, review the changes in UnicodeData.txt vs the previous version
  32 used by Emacs.  Any changes, be it introduction of new scripts or
  33 addition of codepoints to existing scripts, might need corresponding
  34 changes in the data used for filling the category-table, case-table,
  35 and char-width-table.  The additional scripts should cause automatic
  36 updates in charscript.el, but it is a good idea to look at the results
  37 and see if any changes in admin/unidata/blocks.awk are required.
  38
  39 Any new scripts added by UnicodeData.txt will also need updates to
  40 script-representative-chars defined in fontset.el, and also the list
  41 of OTF script tags in otf-script-alist, whose source is on this page:
  42
  43   https://www.microsoft.com/typography/otspec/scripttags.htm
  44
  45 Other databases in fontset.el might also need to be updated as needed.
  46
  47 The function 'ucs-names', defined in lisp/international/mule-cmds.el,
  48 might need to be updated because it knows about used and unused ranges
  49 of Unicode codepoints, which a new release of the Unicode Standard
  50 could change.
  51
  52 The file BidiCharacterTest.txt should be copied to the test suite, and
  53 if its format has changed, the file biditest.el there should be
  54 modified to follow suit.
  55
  56 Problems, fixmes and other unicode-related issues
  57 -------------------------------------------------------------
  58
  59 Notes by fx to record various things of variable importance.  Handa
  60 needs to check them -- don't take too seriously, especially with
  61 regard to completeness.
  62
  63  * SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has
  64    undesirable effects.  E.g.:
  65    (multibyte-string-p (let ((s "x")) (aset s 0 ?£) s)) => nil
  66    (multibyte-string-p (concat [?£])) => nil
  67    (text-char-description ?£) => "M-#"
  68
  69         These examples are all fixed by the change of 2002-10-14, but
  70         there still exist questionable SINGLE_BYTE_CHAR_P in the
  71         code (keymap.c and print.c).
  72
  73  * Rationalize character syntax and its relationship to the Unicode
  74    database.  (Applies mainly to symbol an punctuation syntax.)
  75
  76  * Fontset handling and customization needs work.  We want to relate
  77    fonts to scripts, probably based on the Unicode blocks.  The
  78    presence of small-repertoire 10646-encoded fonts in XFree 4 is a
  79    pain, not currently worked round.
  80
  81         With the change on 2002-07-26, multiple fonts can be
  82         specified in a fontset for a specific range of characters.
  83         Each range can also be specified by script.  Before using
  84         ISO10646 fonts, Emacs checks their repertories to avoid such
  85         fonts that don't have a glyph for a specific character.
  86
  87         fx has worked on fontset customization, but was stymied by
  88         basic problems with the way the default face is dealt with
  89         (and something else, I think).  This needs revisiting.
  90
  91  * Work is also needed on charset and coding system priorities.
  92
  93  * The relevant bits of latin1-disp.el need porting (and probably
  94    re-naming/updating).  See also cyril-util.el.
  95
  96  * Quail files need more work now the encoding is largely irrelevant.
  97
  98  * What to do with the old coding categories stuff?
  99
 100  * The preferred-coding-system property of charsets should probably be
 101    junked unless it can be made more useful now.
 102
 103  * find-multibyte-characters needs looking at.
 104
 105  * Implement Korean cp949/UHC, BIG5-HKSCS and any other important missing
 106    charsets.
 107
 108  * Lazy-load tables for unify-charset somehow?
 109
 110         Actually, Emacs clears out all charset maps and unify-map just
 111         before dumping, and they are loaded again on demand by the
 112         dumped emacs.  But, those maps (char tables) generated while
 113         temacs is running can't be removed from the dumped emacs.
 114
 115  * iso-2022 charsets get unified on i/o.
 116
 117         With the change on 2003-01-06, decoding routines put the 'charset'
 118         property onto decoded text, and iso-2022 encoder pay attention
 119         to it.  Thus, for instance, reading and writing by
 120         iso-2022-7bit preserve the original designation sequences.
 121         The property name 'preferred-charset' may be better?
 122
 123         We may have to utilize this property to decide a font.
 124
 125  * Revisit locale processing: look at treating the language and
 126    charset parts separately.  (Language should affect things like
 127    spelling and calendar, but that's not a Unicode issue.)
 128
 129  * Handle Unicode combining characters usefully, e.g. diacritics, and
 130    handle more scripts specifically (à la Devanagari).  There are
 131    issues with canonicalization.
 132
 133  * We need tabular input methods, e.g. for maths symbols.  (Not
 134    specific to Unicode.)
 135
 136  * Need multibyte text in menus, e.g. for the above.  (Not specific to
 137    Unicode -- see Emacs etc/TODO, but now mostly works with gtk.)
 138
 139  * There's currently no support for Unicode normalization.
 140
 141  * Populate char-width-table correctly for Unicode characters and
 142    worry about what happens when double-width charsets covering
 143    non-CJK characters are unified.
 144
 145  * There are type errors lurking, e.g. in
 146    Fcheck_coding_systems_region.  Define ENABLE_CHECKING to find them.
 147
 148  * Old auto-save files, and similar files, such as Gnus drafts,
 149    containing non-ASCII characters probably won't be re-read correctly.
 150
 151
 152 Source file encoding
 153 --------------------
 154
 155 Most Emacs source files are encoded in UTF-8 (or in ASCII, which is a
 156 subset), but there are a few exceptions, listed below.  Perhaps
 157 someday many of these files will be converted to UTF-8, for
 158 convenience when using tools like 'grep -r', but this might need
 159 nontrivial changes to the build process.
 160
 161  * chinese-big5
 162
 163      These are verbatim copies of files taken from external sources.
 164      They haven't been converted to UTF-8.
 165
 166         leim/CXTERM-DIC/4Corner.tit
 167         leim/CXTERM-DIC/ARRAY30.tit
 168         leim/CXTERM-DIC/ECDICT.tit
 169         leim/CXTERM-DIC/ETZY.tit
 170         leim/CXTERM-DIC/PY-b5.tit
 171         leim/CXTERM-DIC/Punct-b5.tit
 172         leim/CXTERM-DIC/QJ-b5.tit
 173         leim/CXTERM-DIC/ZOZY.tit
 174         leim/MISC-DIC/CTLau-b5.html
 175         leim/MISC-DIC/cangjie-table.b5
 176
 177  * chinese-iso-8bit
 178
 179      These are verbatim copies of files taken from external sources.
 180      They haven't been converted to UTF-8.
 181
 182         leim/CXTERM-DIC/CCDOSPY.tit
 183         leim/CXTERM-DIC/Punct.tit
 184         leim/CXTERM-DIC/QJ.tit
 185         leim/CXTERM-DIC/SW.tit
 186         leim/CXTERM-DIC/TONEPY.tit
 187         leim/MISC-DIC/CTLau.html
 188         leim/MISC-DIC/pinyin.map
 189         leim/MISC-DIC/ziranma.cin
 190
 191  * cp850
 192
 193      This file contains non-ASCII characters in unibyte strings.  When
 194      editing a keyboard layout it's more convenient to see 'é' than
 195      '\202', and the MS-DOS compiler requires the single byte if a
 196      backslash escape is not being used.
 197
 198         src/msdos.c
 199
 200  * iso-2022-cn-ext
 201
 202      This file is externally generated from leim/MISC-DIC/cangjie-table.b5
 203      by Big5->CNS converter.  It hasn't been converted to UTF-8.
 204
 205         leim/MISC-DIC/cangjie-table.cns
 206
 207  * japanese-iso-8bit
 208
 209      SKK-JISYO.L is a verbatim copy of a file taken from an external source.
 210      It hasn't been converted to UTF-8.
 211
 212         leim/SKK-DIC/SKK-JISYO.L
 213
 214  * japanese-shift-jis
 215
 216      This is a verbatim copy of a file taken from an external source.
 217      It hasn't been converted to UTF-8.
 218
 219         admin/charsets/mapfiles/cns2ucsdkw.txt
 220
 221  * iso-2022-7bit
 222
 223      This file switches between CJK charsets, which is not encoded in UTF-8.
 224
 225         etc/HELLO
 226
 227      Each of these files contains just one CJK charset, but Emacs
 228      currently has no easy way to specify set-charset-priority on a
 229      per-file basis, so converting any of these files to UTF-8 might
 230      change the file's appearance when viewed by an Emacs that is
 231      operating in some other language environment.
 232
 233         etc/tutorials/TUTORIAL.ja
 234         lisp/international/ja-dic-cnv.el
 235         lisp/international/ja-dic-utl.el
 236         lisp/international/kinsoku.el
 237         lisp/international/kkc.el
 238         lisp/international/titdic-cnv.el
 239         lisp/language/japan-util.el
 240         lisp/language/japanese.el
 241         lisp/leim/quail/cyril-jis.el
 242         lisp/leim/quail/hanja-jis.el
 243         lisp/leim/quail/japanese.el
 244         lisp/leim/quail/py-punct.el
 245         lisp/leim/quail/pypunct-b5.el
 246
 247      This file contains just Chinese characters, and has same problem.
 248      Also, it contains characters that cannot be encoded in UTF-8.
 249
 250         lisp/international/titdic-cnv.el
 251
 252  * utf-8-emacs
 253
 254      These files contain characters that cannot be encoded in UTF-8.
 255
 256         lisp/language/ethio-util.el
 257         lisp/language/ethiopic.el
 258         lisp/language/ind-util.el
 259         lisp/language/tibet-util.el
 260         lisp/language/tibetan.el
 261         lisp/leim/quail/ethiopic.el
 262         lisp/leim/quail/tibetan.el
 263
 264  * binary files
 265
 266      These files contain binary data, and are not text files.
 267      Some of the entries in this list are patterns, and stand for any
 268      files with the listed extension.
 269
 270         *.gz
 271         *.icns
 272         *.ico
 273         *.pbm
 274         *.pdf
 275         *.png
 276         *.sig
 277         etc/e/eterm-color
 278         etc/package-keyring.gpg
 279         msdos/emacs.pif
 280         nextstep/GNUstep/Emacs.base/Resources/emacs.tiff
 281         nt/icons/hand.cur
 282
 283 \f
 284 This file is part of GNU Emacs.
 285
 286 GNU Emacs is free software: you can redistribute it and/or modify
 287 it under the terms of the GNU General Public License as published by
 288 the Free Software Foundation, either version 3 of the License, or
 289 (at your option) any later version.
 290
 291 GNU Emacs is distributed in the hope that it will be useful,
 292 but WITHOUT ANY WARRANTY; without even the implied warranty of
 293 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 294 GNU General Public License for more details.
 295
 296 You should have received a copy of the GNU General Public License
 297 along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.