man/pre-preconv.1.in

   1 .ig
   2 @ @L_P_PRECONV@.1
   3
   4 Copyright (c) 2014 - 2015 Steffen (Daode) Nurpmeso <sdaoden@users.sf.net>.
   5
   6 Copyright (C) 2006 - 2008 Free Software Foundation, Inc.
   7
   8 Permission is granted to make and distribute verbatim copies of
   9 this manual provided the copyright notice and this permission notice
  10 are preserved on all copies.
  11
  12 Permission is granted to copy and distribute modified versions of this
  13 manual under the conditions for verbatim copying, provided that the
  14 entire resulting derived work is distributed under the terms of a
  15 permission notice identical to this one.
  16
  17 Permission is granted to copy and distribute translations of this
  18 manual into another language, under the above conditions for modified
  19 versions, except that this permission notice may be included in
  20 translations approved by the Free Software Foundation instead of in
  21 the original English.
  22 ..
  23 .
  24 .TH @U_P_PRECONV@ @MAN1EXT@ "@MDATE@" "@T_ROFF@ v@VERSION@"
  25 .
  26 .
  27 .SH NAME
  28 @L_P_PRECONV@ \- convert encoding of input files
  29 .
  30 .
  31 .SH SYNOPSIS
  32 .SY @L_P_PRECONV@
  33 .OP \-dr
  34 .OP \-e encoding
  35 .RI [ files
  36 .IR .\|.\|. ]
  37 .
  38 .SY @
  39 .B \-h
  40 |
  41 .B \-\-help
  42 .
  43 .SY @L_P_PRECONV@
  44 .B \-v
  45 |
  46 .B \-\-version
  47 .YS
  48 .
  49 .PP
  50 It is possible to have whitespace between the
  51 .B \-e
  52 command line option and its parameter.
  53 .
  54 .
  55 .SH DESCRIPTION
  56 .B @L_P_PRECONV@
  57 reads
  58 .I files
  59 and converts its encoding(s) to a form
  60 .BR @L_TROFF@ (@MAN1EXT@)
  61 can process, sending the data to standard output.
  62 Currently, this means ASCII characters and `\e[uXXXX]' entities, where
  63 `XXXX' is a hexadecimal number with four to six digits, representing a
  64 Unicode input code.
  65 Normally,
  66 .B @L_P_PRECONV@
  67 should be invoked with the
  68 .B \-k
  69 and
  70 .B \-K
  71 options of
  72 .BR @L_ROFF@ .
  73 .
  74 .
  75 .SH OPTIONS
  76 .TP
  77 .B \-d
  78 Emit debugging messages to standard error (mainly the used encoding).
  79 .
  80 .TP
  81 .BI \-D encoding
  82 Specify default encoding if everything fails (see below).
  83 .
  84 .TP
  85 .BI \-e encoding
  86 Specify input encoding explicitly, overriding all other methods.
  87 This corresponds to
  88 .BR @L_ROFF@ 's
  89 .BI \-K encoding
  90 option.
  91 Without this switch,
  92 .B @L_P_PRECONV@
  93 uses the algorithm described below to select the input encoding.
  94 .
  95 .TP
  96 .B \-\-help
  97 .TQ
  98 .B \-h
  99 Print help message.
 100 .
 101 .TP
 102 .B \-r
 103 Do not add .lf requests.
 104 .
 105 .TP
 106 .B \-\-version
 107 .TQ
 108 .B \-v
 109 Print version number.
 110 .
 111 .
 112 .SH USAGE
 113 .B @L_P_PRECONV@
 114 tries to find the input encoding with the following algorithm.
 115 .
 116 .IP 1.
 117 If the input encoding has been explicitly specified with option
 118 .BR \-e ,
 119 use it.
 120 .
 121 .IP 2.
 122 Otherwise, check whether the input starts with a
 123 .I Byte Order Mark
 124 (BOM, see below).
 125 If found, use it.
 126 .
 127 .IP 3.
 128 Finally, check whether there is a known
 129 .I coding tag
 130 (see below) in either the first or second input line.
 131 If found, use it.
 132 .
 133 .IP 4.
 134 If everything fails, use a default encoding as given with option
 135 .BR \-D ,
 136 by the current locale, or `latin1' if the locale is set to `C',
 137 `POSIX', or empty (in that order).
 138 .
 139 .PP
 140 Note that the
 141 .B @L_ROFF@
 142 program supports a
 143 .B @U_ROFF@_ENCODING
 144 environment variable which is eventually expanded to option
 145 .BR \-k .
 146 .
 147 .SS "Byte Order Mark"
 148 The Unicode Standard defines character U+FEFF as the Byte Order Mark
 149 (BOM).
 150 On the other hand, value U+FFFE is guaranteed not be a Unicode character at
 151 all.
 152 This allows to detect the byte order within the data stream (either
 153 big-endian or lower-endian), and the MIME encodings \%`UTF-16' and
 154 \%`UTF-32' mandate that the data stream starts with U+FEFF.
 155 Similarly, the data stream encoded as \%`UTF-8' might start with a BOM (to
 156 ease the conversion from and to \%UTF-16 and \%UTF-32).
 157 In all cases, the byte order mark is
 158 .I not
 159 part of the data but part of the encoding protocol; in other words,
 160 .BR @L_P_PRECONV@ 's
 161 output doesn't contain it.
 162 .
 163 .PP
 164 Note that U+FEFF not at the start of the input data actually is emitted;
 165 it has then the meaning of a `zero width no-break space' character \[en]
 166 something not needed normally in
 167 .BR @L_ROFF@ .
 168 .
 169 .SS "Coding Tags"
 170 Editors which support more than a single character encoding need tags
 171 within the input files to mark the file's encoding.
 172 While it is possible to guess the right input encoding with the help of
 173 heuristic algorithms for data which represents a greater amount of a natural
 174 language, it is still just a guess.
 175 Additionally, all algorithms fail easily for input which is either too short
 176 or doesn't represent a natural language.
 177 .
 178 .PP
 179 For these reasons,
 180 .B @L_P_PRECONV@
 181 supports the coding tag convention (with some restrictions) as used by
 182 .B "GNU Emacs"
 183 and
 184 .B XEmacs
 185 (and probably other programs too).
 186 .
 187 .PP
 188 Coding tags in
 189 .B "GNU Emacs"
 190 and
 191 .B XEmacs
 192 are stored in so-called
 193 .IR "File Variables" .
 194 .B @L_P_PRECONV@
 195 recognizes the following syntax form which must be put into a troff comment
 196 in the first or second line.
 197 .
 198 .RS
 199 .PP
 200 \-*\-
 201 .IR tag1 :
 202 .IR value1 ;
 203 .IR tag2 :
 204 .IR value2 ;
 205 \&.\|.\|.\& \-*\-
 206 .RE
 207 .
 208 .PP
 209 The only relevant tag for
 210 .B @L_P_PRECONV@
 211 is `coding' which can take the values listed below.
 212 Here an example line which tells
 213 .B Emacs
 214 to edit a file in troff mode, and to use \%latin2 as its encoding.
 215 .
 216 .RS
 217 .PP
 218 .EX
 219 \&.\[rs]" \-*\- mode: troff; coding: latin-2 \-*\-
 220 .EE
 221 .RE
 222 .
 223 .PP
 224 The following list gives all MIME coding tags (either lowercase or
 225 uppercase) supported by
 226 .BR @L_P_PRECONV@ ;
 227 this list is hard-coded in the source.
 228 .
 229 .RS
 230 .PP
 231 .ad l
 232 \%big5, \%cp1047, \%euc-jp, \%euc-kr, \%gb2312, \%iso-8859-1, \%iso-8859-2,
 233 \%iso-8859-5, \%iso-8859-7, \%iso-8859-9, \%iso-8859-13, \%iso-8859-15,
 234 \%koi8-r, \%us-ascii, \%utf-8, \%utf-16, \%utf-16be, \%utf-16le
 235 .ad
 236 .RE
 237 .
 238 .PP
 239 In addition, the following hard-coded list of other tags is recognized which
 240 eventually map to values from the list above.
 241 .
 242 .RS
 243 .PP
 244 .ad l
 245 \%ascii, \%chinese-big5, \%chinese-euc, \%chinese-iso-8bit, \%cn-big5,
 246 \%\%cn-gb, \%cn-gb-2312, \%cp878, \%csascii, \%csisolatin1,
 247 \%cyrillic-iso-8bit, \%cyrillic-koi8, \%euc-china, \%euc-cn, \%euc-japan,
 248 \%euc-japan-1990, \%euc-korea, \%greek-iso-8bit, \%iso-10646/utf8,
 249 \%iso-10646/utf-8, \%iso-latin-1, \%iso-latin-2, \%iso-latin-5,
 250 \%iso-latin-7, \%iso-latin-9, \%japanese-euc, \%japanese-iso-8bit, \%jis8,
 251 \%koi8, \%korean-euc, \%korean-iso-8bit, \%latin-0, \%latin1, \%latin-1,
 252 \%latin-2, \%latin-5, \%latin-7, \%latin-9, \%mule-utf-8, \%mule-utf-16,
 253 \%mule-utf-16be, \%mule-utf-16-be, \%mule-utf-16be-with-signature,
 254 \%mule-utf-16le, \%mule-utf-16-le, \%mule-utf-16le-with-signature, \%utf8,
 255 \%utf-16-be, \%utf-16-be-with-signature, \%utf-16be-with-signature,
 256 \%utf-16-le, \%utf-16-le-with-signature, \%utf-16le-with-signature
 257 .ad
 258 .RE
 259 .
 260 .PP
 261 Those tags are taken from
 262 .B "GNU Emacs"
 263 and
 264 .BR XEmacs ,
 265 together with some aliases.
 266 Trailing \%`-dos', \%`-unix', and \%`-mac' suffixes of coding tags (which
 267 give the end-of-line convention used in the file) are stripped off before
 268 the comparison with the above tags happens.
 269 .
 270 .SS "Iconv Issues"
 271 .B @L_P_PRECONV@
 272 by itself only supports three encodings: \%latin-1, cp1047, and \%UTF-8;
 273 all other encodings are passed to the
 274 .B iconv
 275 library functions.
 276 At compile time it is searched and checked for a valid
 277 .B iconv
 278 implementation; a call to `@L_P_PRECONV@ \-\-version' shows whether
 279 .B iconv
 280 is used.
 281 .
 282 .
 283 .SH BUGS
 284 .B @L_P_PRECONV@
 285 doesn't support
 286 .I "local variable lists"
 287 yet.
 288 This is a different syntax form to specify local variables at the end of a
 289 file.
 290 .
 291 .
 292 .SH "SEE ALSO"
 293 .BR @L_ROFF@ (@MAN1EXT@)
 294 .\" s-ts-mode